[1] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
\[\definecolor{lviolet}{RGB}{114,0,172} \definecolor{lgreen}{RGB}{45,177,93} \definecolor{lred}{RGB}{251,0,29} \definecolor{lblue}{RGB}{18,110,213} \definecolor{circle}{RGB}{217,86,16} \definecolor{average}{RGB}{203,23,206}\]
\[ {\color{lgreen} P ( \theta | D )} \color{black} = \frac{\color{lblue}P ( D | \theta ) \color{black} \color{lviolet}P( \theta)} {\color{black}P ( D )}\]
Given observations \( D \), we update our beliefs about the parameters \( \theta \) which govern the underlying data generating process. We compute the posterior distribution of the parameters as the product of the likelihood times the prior beliefs about the parameters, suitably normalized.
Scenario: We throw a bunch of coins and would like to estimate the probability \( \theta \) of the coin landing on its head. This is a simple sequence of Bernoulli experiments, where \( X_1, \ldots, X_n \sim \operatorname{Bern}( \theta) \). If we throw the coin three times and observe \( D = \{ H,H,H \} \), the likelihood of the data is given by
\[ \color{lblue} P(D|\theta) = \theta^3 (1 - \theta)^0 = \theta^3. \]
The MLE is \( \hat \theta = 1 \). As we can see, the frequentist approach yields point estimates which are entirely driven by the data. In contract, in a Bayesian analysis a prior for \(\hat \theta\) has to be specified.
A beta prior yields tractable results. \[ P(\theta) = \theta^{\alpha - 1} (1 - \theta)^{\beta - 1 } \] The posterior distribution evaluates to \[ P(\theta | D) \propto \theta^{3 + \alpha -1} (1 - \theta)^{ 0 + \beta - 1} \]
The MAP (maximum a posteriori) estimate is then given by \(\hat \theta = \frac{3 + \alpha} {3 + \alpha + \beta}\). With \(\alpha = \beta = 100\), we have \(\hat \theta = \frac{103}{203} \approx 0.5\). Almost no influence of the data! Conclusion: Whereas frequentist point estimates trust entirely the data, Bayesian estimates are conservative in nature: Much more data is required such that the prior beliefs encoded in the prior distribution are washed out. You might ask: What has this to do with the modeling of text documents?
Generalization of Bernoulli distribution: Multinomial. Assume we have access to a vocabulary of all possible words:
\[ V = \{ \color{circle} Cat \color{black}, Dog, Lunch, \color{average} Mouse \color{black}, \color{lred} Eats \color{black}, \ldots \} \]
We make the bag-of-words assumption: Order of words in a document does not matter. The probability of the text \(D_i = \text{"Cat eats mouse"}\) given the word distribution \(\ {\theta}\) then is \[ P(D_i) \propto \color{circle} \theta_1 \color{black} \cdot \color{lred} \theta_4 \color{black} \cdot \color{average} \theta_5 \]
\(K\) Topics, each endowed with word distribution \(\vec \theta_k\). For example, consider again
\[ V = \{ \color{circle} Cat, Dog, \color{lblue} Lunch, \color{circle} Mouse \color{black}, \color{lblue} Eats \color{black}, \ldots \} \]
Then we might have \[ \color{circle} \vec \theta_{\text{Animals}} = (0.2 , 0.2, 0.03, 0.1, 0.05, \ldots) \] and \[ \color{lblue} \vec \theta_{\text{Food}} = (0.03,0.04,0.3,0.07,0.2, \ldots) \]
Each document is created as follows: