Processing math: 100%

Monday, 29 August 2011

Implementation of Gibbs sampling for Naive Bayes text classification

This post is related to the Gibbs sampling implementation described by Resnik & Hardisty (2010), Gibbs sampling for the uninitiated, CS-TR-4956. The referred paper is a very good tutorial on Gibbs sampling, and they work through an example of Gibbs sampling for text classification using "Naive" Bayes. I add quotation marks because the approach presented is slightly more informed than the traditional Naive Bayes method.

In the traditional Naive Bayes method, the idea is that, given a document j, the task is to find the label L_j that maximises the probability P(L_j|W_j), where W_j is a vector of features. This approach uses the Bayes rule, and as a result the problem is transformed to one of finding the label that maximises:

P(L_j|W_j) = \frac{P(W_j|L_j) P(L_j)}{P(W_j)}


Since P(W_j) is constant in the document j, we can ignore the denominator. The prior P(L_j) is estimated by counting the number of times label L_j is assigned to a document in the training data, and dividing by the number of training documents.

To compute P(W_j|L_j) we introduce the naive assumption that all features are independent of each other, and therefore P(W_j|L_j) is the product of all P(W_{ji}|L_i), for each W_{ji} separate feature value. Each individual P(W_{ji}|L_i) can then be estimated using the maximum likelihood estimate on the training data, that is, by counting.

Okay, that's the known Naive Bayes method. Resnik & Hardisty's method simulates this Naive Bayes approach for text classification by offering this generative story of the documents to classify:
  1. The label L_j of a given document is randomly generated by flipping a coin with heads probability \pi = P(L_j=1). In other words and more formally, L_j is sampled from a Bernoulli distribution with probability \pi.
  2. Words in the document j are generated randomly and independently of each other, according to a probability that depends on the chosen L_j =0 or L_j = 1. This is simulated by throwing a multi-faceted die with as many faces as different words. We throw the die once per word we want to generate. The feature W_{ji} indicates the number of times a word i has been obtained. In other words and more formally, W_{ji} is sampled from a Multinomial distribution with probability distribution theta depending on the value of L_j.
Note that step 2 is a little more informed than in a generic Naive Bayes approach, since we are saying that a feature W_{ij} is not drawn from some unknown distribution. It is a Multinomial distribution and this fact is exploited by Resnik & Hardisty.

Given this generative story, the joint distribution of all document word frequencies W, labels L, label probability \pi, and word distributions \theta_0 and \theta_1 are computed as described in the paper (I won't go into this here). After some simplifications, the task is handed over to the Gibbs sampler, which is used to estimate the values of this joint distribution in the manner that is explained in the paper so well. Basically, the idea of the Gibbs sampler is to sample a new value of a parameter by removing this parameter, computing the probability of this parameter given the other parameters, and sampling a new value of the parameter. This is done with all parameters a number of times until we think we have reached a good value... more about this later. The final algorithm looks like this:

At every iteration, given the counts previously computed of all words and documents associated to a label class:
  1. For every document j:
    If j is not a training document then:
    • Subtract j's word counts from the total word counts associated to label class L_j
    • Subtract 1 from the count of documents with label L_j
    • Assign a new label L_j as described in the paper
    • Add 1 to the count of documents with label L_j (note that L_j may be different now)
    • Add j's word counts to the total word counts associated to label class L_j
  2. Sample new values of \theta_0 given word counts associated to label 0
  3. Sample new values of \theta_1 given word counts associated to label 1
Due to practical reasons I had to do a simple modification. This modification is due to the problem of computing the new label L_j, which involves computing a product of theta values raised to the power of word counts. This computation can easily produce numbers very close to zero which may overwhelm the representation ability of the computer (underflow). So I used logarithmic values and changed products to sums, and exponents to multiplications. Then, since the final operation in the computation involved a division, I simply subtracted values and then obtained the final value by raising e to the power of our number.

I tried the program with artificial documents generated exactly as described in the generative model, and I was pleased to see how well it was able to predict the labels given very few samples.

I also noticed that only two to three iterations were usually required to obtain stable values of L_j (though the values of theta were not stable at that moment). The following figure represents the histograms of 10 distributions. Each distribution uses 1000 document sets randomly generated and then partitioned into a training and a test set. Each set contains 50 documents of 100 samples from a multinomial with 5 distinct words, and the Gibbs sampling used 5 iterations. I chose the labels of the last iteration (instead of the mode of the last three iterations, which is probably wiser). Each plot represents a distribution where the ratio between training and test set differs. We can appreciate how quickly the system learns:

Note that, for a ratio of 0, the system does not use any training set and therefore the process is completely unsupervised. We appreciate a symmetric distribution due to inversions between 0 and 1 in half of the cases. So, really, the accuracy of the unsupervised process, as we view it as a case of clustering, should be doubled. As we increase the percentage of training documents, the system has more information to avoid inversion. I actually found it surprising that we need up to 50% of the documents for training to get rid of the zeros, I expected we would need only a handful of documents.

But we can really see that this approach is very effective with very few documents. Of course the figure is based on an ideal case when the documents are generated exactly as per the generative model. The next step would be to test the robustness of the model against real-world documents, i.e., against documents that do not follow the generative model.

The implementation is available in http://sourceforge.net/p/nbgibbs. Feel free to download and use the code, and give feedback here or in the sourceforge page.