This post is related to the Gibbs sampling implementation described by Resnik & Hardisty (2010), Gibbs sampling for the uninitiated, CS-TR-4956. The referred paper is a very good tutorial on Gibbs sampling, and they work through an example of Gibbs sampling for text classification using "Naive" Bayes. I add quotation marks because the approach presented is slightly more informed than the traditional Naive Bayes method.
In the traditional Naive Bayes method, the idea is that, given a document $j$, the task is to find the label $L_j$ that maximises the probability $P(L_j|W_j)$, where $W_j$ is a vector of features. This approach uses the Bayes rule, and as a result the problem is transformed to one of finding the label that maximises:
$$
P(L_j|W_j) = \frac{P(W_j|L_j) P(L_j)}{P(W_j)}
$$
Since $P(W_j)$ is constant in the document $j$, we can ignore the denominator. The prior $P(L_j)$ is estimated by counting the number of times label $L_j$ is assigned to a document in the training data, and dividing by the number of training documents.
To compute $P(W_j|L_j)$ we introduce the naive assumption that all features are independent of each other, and therefore $P(W_j|L_j)$ is the product of all $P(W_{ji}|L_i)$, for each $W_{ji}$ separate feature value. Each individual $P(W_{ji}|L_i)$ can then be estimated using the maximum likelihood estimate on the training data, that is, by counting.
Okay, that's the known Naive Bayes method. Resnik & Hardisty's method simulates this Naive Bayes approach for text classification by offering this generative story of the documents to classify:
Given this generative story, the joint distribution of all document word frequencies $W$, labels $L$, label probability $\pi$, and word distributions $\theta_0$ and $\theta_1$ are computed as described in the paper (I won't go into this here). After some simplifications, the task is handed over to the Gibbs sampler, which is used to estimate the values of this joint distribution in the manner that is explained in the paper so well. Basically, the idea of the Gibbs sampler is to sample a new value of a parameter by removing this parameter, computing the probability of this parameter given the other parameters, and sampling a new value of the parameter. This is done with all parameters a number of times until we think we have reached a good value... more about this later. The final algorithm looks like this:
At every iteration, given the counts previously computed of all words and documents associated to a label class:
I tried the program with artificial documents generated exactly as described in the generative model, and I was pleased to see how well it was able to predict the labels given very few samples.
I also noticed that only two to three iterations were usually required to obtain stable values of $L_j$ (though the values of theta were not stable at that moment). The following figure represents the histograms of 10 distributions. Each distribution uses 1000 document sets randomly generated and then partitioned into a training and a test set. Each set contains 50 documents of 100 samples from a multinomial with 5 distinct words, and the Gibbs sampling used 5 iterations. I chose the labels of the last iteration (instead of the mode of the last three iterations, which is probably wiser). Each plot represents a distribution where the ratio between training and test set differs. We can appreciate how quickly the system learns:
Note that, for a ratio of 0, the system does not use any training set and therefore the process is completely unsupervised. We appreciate a symmetric distribution due to inversions between 0 and 1 in half of the cases. So, really, the accuracy of the unsupervised process, as we view it as a case of clustering, should be doubled. As we increase the percentage of training documents, the system has more information to avoid inversion. I actually found it surprising that we need up to 50% of the documents for training to get rid of the zeros, I expected we would need only a handful of documents.
But we can really see that this approach is very effective with very few documents. Of course the figure is based on an ideal case when the documents are generated exactly as per the generative model. The next step would be to test the robustness of the model against real-world documents, i.e., against documents that do not follow the generative model.
The implementation is available in http://sourceforge.net/p/nbgibbs. Feel free to download and use the code, and give feedback here or in the sourceforge page.
In the traditional Naive Bayes method, the idea is that, given a document $j$, the task is to find the label $L_j$ that maximises the probability $P(L_j|W_j)$, where $W_j$ is a vector of features. This approach uses the Bayes rule, and as a result the problem is transformed to one of finding the label that maximises:
$$
P(L_j|W_j) = \frac{P(W_j|L_j) P(L_j)}{P(W_j)}
$$
Since $P(W_j)$ is constant in the document $j$, we can ignore the denominator. The prior $P(L_j)$ is estimated by counting the number of times label $L_j$ is assigned to a document in the training data, and dividing by the number of training documents.
To compute $P(W_j|L_j)$ we introduce the naive assumption that all features are independent of each other, and therefore $P(W_j|L_j)$ is the product of all $P(W_{ji}|L_i)$, for each $W_{ji}$ separate feature value. Each individual $P(W_{ji}|L_i)$ can then be estimated using the maximum likelihood estimate on the training data, that is, by counting.
Okay, that's the known Naive Bayes method. Resnik & Hardisty's method simulates this Naive Bayes approach for text classification by offering this generative story of the documents to classify:
- The label $L_j$ of a given document is randomly generated by flipping a coin with heads probability $\pi = P(L_j=1)$. In other words and more formally, $L_j$ is sampled from a Bernoulli distribution with probability $\pi$.
- Words in the document $j$ are generated randomly and independently of each other, according to a probability that depends on the chosen $L_j =0$ or $L_j = 1$. This is simulated by throwing a multi-faceted die with as many faces as different words. We throw the die once per word we want to generate. The feature $W_{ji}$ indicates the number of times a word $i$ has been obtained. In other words and more formally, $W_{ji}$ is sampled from a Multinomial distribution with probability distribution theta depending on the value of $L_j$.
Given this generative story, the joint distribution of all document word frequencies $W$, labels $L$, label probability $\pi$, and word distributions $\theta_0$ and $\theta_1$ are computed as described in the paper (I won't go into this here). After some simplifications, the task is handed over to the Gibbs sampler, which is used to estimate the values of this joint distribution in the manner that is explained in the paper so well. Basically, the idea of the Gibbs sampler is to sample a new value of a parameter by removing this parameter, computing the probability of this parameter given the other parameters, and sampling a new value of the parameter. This is done with all parameters a number of times until we think we have reached a good value... more about this later. The final algorithm looks like this:
At every iteration, given the counts previously computed of all words and documents associated to a label class:
- For every document $j$:
If $j$ is not a training document then:- Subtract $j$'s word counts from the total word counts associated to label class $L_j$
- Subtract $1$ from the count of documents with label $L_j$
- Assign a new label $L_j$ as described in the paper
- Add $1$ to the count of documents with label $L_j$ (note that $L_j$ may be different now)
- Add $j$'s word counts to the total word counts associated to label class $L_j$
- Sample new values of $\theta_0$ given word counts associated to label $0$
- Sample new values of $\theta_1$ given word counts associated to label $1$
I tried the program with artificial documents generated exactly as described in the generative model, and I was pleased to see how well it was able to predict the labels given very few samples.
I also noticed that only two to three iterations were usually required to obtain stable values of $L_j$ (though the values of theta were not stable at that moment). The following figure represents the histograms of 10 distributions. Each distribution uses 1000 document sets randomly generated and then partitioned into a training and a test set. Each set contains 50 documents of 100 samples from a multinomial with 5 distinct words, and the Gibbs sampling used 5 iterations. I chose the labels of the last iteration (instead of the mode of the last three iterations, which is probably wiser). Each plot represents a distribution where the ratio between training and test set differs. We can appreciate how quickly the system learns:
Note that, for a ratio of 0, the system does not use any training set and therefore the process is completely unsupervised. We appreciate a symmetric distribution due to inversions between 0 and 1 in half of the cases. So, really, the accuracy of the unsupervised process, as we view it as a case of clustering, should be doubled. As we increase the percentage of training documents, the system has more information to avoid inversion. I actually found it surprising that we need up to 50% of the documents for training to get rid of the zeros, I expected we would need only a handful of documents.
But we can really see that this approach is very effective with very few documents. Of course the figure is based on an ideal case when the documents are generated exactly as per the generative model. The next step would be to test the robustness of the model against real-world documents, i.e., against documents that do not follow the generative model.
The implementation is available in http://sourceforge.net/p/nbgibbs. Feel free to download and use the code, and give feedback here or in the sourceforge page.