Monday 29 August 2011

Implementation of Gibbs sampling for Naive Bayes text classification

This post is related to the Gibbs sampling implementation described by Resnik & Hardisty (2010), Gibbs sampling for the uninitiated, CS-TR-4956. The referred paper is a very good tutorial on Gibbs sampling, and they work through an example of Gibbs sampling for text classification using "Naive" Bayes. I add quotation marks because the approach presented is slightly more informed than the traditional Naive Bayes method.

In the traditional Naive Bayes method, the idea is that, given a document $j$, the task is to find the label $L_j$ that maximises the probability $P(L_j|W_j)$, where $W_j$ is a vector of features. This approach uses the Bayes rule, and as a result the problem is transformed to one of finding the label that maximises:

$$
P(L_j|W_j) = \frac{P(W_j|L_j) P(L_j)}{P(W_j)}
$$


Since $P(W_j)$ is constant in the document $j$, we can ignore the denominator. The prior $P(L_j)$ is estimated by counting the number of times label $L_j$ is assigned to a document in the training data, and dividing by the number of training documents.

To compute $P(W_j|L_j)$ we introduce the naive assumption that all features are independent of each other, and therefore $P(W_j|L_j)$ is the product of all $P(W_{ji}|L_i)$, for each $W_{ji}$ separate feature value. Each individual $P(W_{ji}|L_i)$ can then be estimated using the maximum likelihood estimate on the training data, that is, by counting.

Okay, that's the known Naive Bayes method. Resnik & Hardisty's method simulates this Naive Bayes approach for text classification by offering this generative story of the documents to classify:
  1. The label $L_j$ of a given document is randomly generated by flipping a coin with heads probability $\pi = P(L_j=1)$. In other words and more formally, $L_j$ is sampled from a Bernoulli distribution with probability $\pi$.
  2. Words in the document $j$ are generated randomly and independently of each other, according to a probability that depends on the chosen $L_j =0$ or $L_j = 1$. This is simulated by throwing a multi-faceted die with as many faces as different words. We throw the die once per word we want to generate. The feature $W_{ji}$ indicates the number of times a word $i$ has been obtained. In other words and more formally, $W_{ji}$ is sampled from a Multinomial distribution with probability distribution theta depending on the value of $L_j$.
Note that step 2 is a little more informed than in a generic Naive Bayes approach, since we are saying that a feature $W_{ij}$ is not drawn from some unknown distribution. It is a Multinomial distribution and this fact is exploited by Resnik & Hardisty.

Given this generative story, the joint distribution of all document word frequencies $W$, labels $L$, label probability $\pi$, and word distributions $\theta_0$ and $\theta_1$ are computed as described in the paper (I won't go into this here). After some simplifications, the task is handed over to the Gibbs sampler, which is used to estimate the values of this joint distribution in the manner that is explained in the paper so well. Basically, the idea of the Gibbs sampler is to sample a new value of a parameter by removing this parameter, computing the probability of this parameter given the other parameters, and sampling a new value of the parameter. This is done with all parameters a number of times until we think we have reached a good value... more about this later. The final algorithm looks like this:

At every iteration, given the counts previously computed of all words and documents associated to a label class:
  1. For every document $j$:
    If $j$ is not a training document then:
    • Subtract $j$'s word counts from the total word counts associated to label class $L_j$
    • Subtract $1$ from the count of documents with label $L_j$
    • Assign a new label $L_j$ as described in the paper
    • Add $1$ to the count of documents with label $L_j$ (note that $L_j$ may be different now)
    • Add $j$'s word counts to the total word counts associated to label class $L_j$
  2. Sample new values of $\theta_0$ given word counts associated to label $0$
  3. Sample new values of $\theta_1$ given word counts associated to label $1$
Due to practical reasons I had to do a simple modification. This modification is due to the problem of computing the new label $L_j$, which involves computing a product of theta values raised to the power of word counts. This computation can easily produce numbers very close to zero which may overwhelm the representation ability of the computer (underflow). So I used logarithmic values and changed products to sums, and exponents to multiplications. Then, since the final operation in the computation involved a division, I simply subtracted values and then obtained the final value by raising $e$ to the power of our number.

I tried the program with artificial documents generated exactly as described in the generative model, and I was pleased to see how well it was able to predict the labels given very few samples.

I also noticed that only two to three iterations were usually required to obtain stable values of $L_j$ (though the values of theta were not stable at that moment). The following figure represents the histograms of 10 distributions. Each distribution uses 1000 document sets randomly generated and then partitioned into a training and a test set. Each set contains 50 documents of 100 samples from a multinomial with 5 distinct words, and the Gibbs sampling used 5 iterations. I chose the labels of the last iteration (instead of the mode of the last three iterations, which is probably wiser). Each plot represents a distribution where the ratio between training and test set differs. We can appreciate how quickly the system learns:

Note that, for a ratio of 0, the system does not use any training set and therefore the process is completely unsupervised. We appreciate a symmetric distribution due to inversions between 0 and 1 in half of the cases. So, really, the accuracy of the unsupervised process, as we view it as a case of clustering, should be doubled. As we increase the percentage of training documents, the system has more information to avoid inversion. I actually found it surprising that we need up to 50% of the documents for training to get rid of the zeros, I expected we would need only a handful of documents.

But we can really see that this approach is very effective with very few documents. Of course the figure is based on an ideal case when the documents are generated exactly as per the generative model. The next step would be to test the robustness of the model against real-world documents, i.e., against documents that do not follow the generative model.

The implementation is available in http://sourceforge.net/p/nbgibbs. Feel free to download and use the code, and give feedback here or in the sourceforge page.

Sunday 1 May 2011

Problems with the video drivers in Ubuntu 11.04

Last Friday I upgraded my desktop to Ubuntu 11.04, and yesterday Saturday I upgraded my laptop. In both cases I had problems with the video drivers and I think I solved them. Just in case someone is having similar problems, here I'll write about the problems I had and what I did to solve them. I'm writing this from memory so I may have missed some of the many things I tried, or I did them in the wrong order, or in the writing I mix up the exact names of the software I used and the options I selected... but hopefully this information is still useful.

My desktop computer has dual monitor and an nvidia graphics card. I didn't manage to make the nvidia drivers work with the previous release so I was using the default drivers, which worked well enough for me since I didn't use the computer for fancy graphics or games. But when I upgraded to Ubuntu 11.04 I had problems with the display.

On the first startup, the monitors froze after logging in. I could only see the desktop background, that's all. There were no menus, no mouse pointer, nothing. Just the background image. I restarted a couple of times in ubuntu and in the other OS, Windows 7, but whenever I logged into Ubuntu I could only see the background image.

The first thing I did was to load the recovery version, and select the option to repair broken packages. The system didn't report anything special: no packages were found broken, so nothing was done. And again, after logging in, I could only see the background image.

Next I tried to login in with the failsafe display. Login was successful but the display behaved in a very strange manner. The monitor on the left had a black screen and the mouse pointer, whereas the monitor on the right had the desktop contents. To click on an item on the monitor on the right I had to use the mouse on the monitor on the left and predict where I was clicking... a very strange setup and very frustrating!

So I disconnected one monitor and restarted again. Now mouse and desktop were in the one monitor but there were no window decorations so I could not move or resize the windows. And it wasn't displaying the new "unity" desktop graphics but the classic one.

I tried several options. First of all I logged in again, this time asking to re-configure the graphic display (this is an option I saw when logging in on the failsafe session), but again the windows had no decorations and the desktop had the classic setup instead of the "unity" setup.

Finally I asked to use the generic display when logging in with the failsafe session and I managed to have the "unity" graphics. I asked to have the new graphic setting permanent, plugged the second monitor, and logged in again. The "unity" graphics appeared the contents of the two monitors were replicated. I opened the monitor manager and I selected the "extended desktop" option. Now everything was displayed as I wanted! I was so excited to see it apparently working and I was so late for dinner that I went home without checking whether the windows had decorations... I'll find out tomorrow Monday.

Yesterday Saturday I upgraded the laptop. During the upgrade the system reported errors with the xorg package so expected the worse. And true, after restarting the system would not even show the login page. Instead, the monitor showed a mix of colourful pixels.

The first thing I did was to select the option to repair broken packages in the recovery session, and to my relief it reported that some packages were repaired. Yet after restarting the monitor still showed the mix of colourful pixels in apparent random distribution. Next I selected the option to reconfigure the graphics but again, nothing interesting happened.

Next I selected the option to use the generic display, and now I could log in and see the desktop. Success! The windows had decorations, but the desktop setup was the classic, not the "unity" setup, because a message said that the laptop didn't have the hardware to run unity. I couldn't believe that since the laptop has an nvidia card which is supposed to be powerful enough. I opened the "restricted drivers" interface but it didn't show the nvidia driver there. I searched in synaptics package manager and I found several drivers, but they could not be installed because the source was not enabled. So I enabled the source and opened the "restricted drivers" interface. Now I could see the recommended nvidia drivers there, great! I selected the recommended one, and after re-starting I could see the unity interface.

Success! now I can use the laptop, and hopefully the desktop at my office works too.

The next thing is to get used to the unity interface. So far it doesn't look as useful as the classic one so I may end up switching to the classic one but let's give it a few more days.

Monday 21 March 2011

How to install a Python module in Windows?

The Python documentation is very detailed on how to install Python modules: you simply type something like this at the terminal window:

python setup.py install

The problem is, how do you do this in a Windows-based PC? I tried the following but it didn't work:
  1. Open a command prompt by type "cmd" at the "Run..." option from the start menu (I'm using Windows XP)
  2. Navigate to the folder where we have downloaded our Python module
  3. Type python setup.py install
When I tried that I got this error message:

'python' is not recognized as an internal or external command, operable program or batch file.

The problem is that the path of the executable python.exe is not listed in the PATH environment variable. I went round the problem by typing the exact location of the executable:

C:\python27\python setup.py install

But then I got another error! It turns out that the package I wanted to install requires another package called setuptools... back to the drawing board.

Fortunately I found documentation about setuptools. The module is packaged for a range of distributions including Windows, I only need to run its executable to install the package. And the good thing of this module is that provides a program called easy_install, which is located in C:\Python27\scripts. With this program you can actually install many other packages. It will download the package, install it, and solve any possible dependencies by downloading and installing additional packages.

So, to install a new Python package, say, suds, in a Windows PC, my current sequence of steps was:

  1. Open a command prompt as before.
  2. Type C:\Python27\scripts\easy_install suds

End of story! and it works! for me, anyway. You can find additional documentation about easy_install here.