2009年3月31日

Latent Dirichlet allocation

"Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan. . Journal of Machine Learning Research, 3:993–1022, January 2003

Latent Dirichlet allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. The basic idea of LDA is that documents are represented as random mixture distributions over latent topics, where topics generate words by fixed conditional distribution and those topics are infinitely exchangeable within a document. Therefore, compared with other latent topic models, LDA overcomes limiting assumption in mixture of unigrams and overfitting problem in pLSI by treating the topic mixture weights as a k-parameter hidden random variable and gets a smooth distribution on the topic simplex. Besides, LDA finds the optimal variational parameters by KL-divergence and applies variational EM algorithm to get approximate empirical Bayes estimates, alpha and beta. In conclusion, LDA is a simple model for dimensionality reduction and has modularity and extensibility for more application.

0 comments: