Image Primer: September 2014

Topic Modelling using LDA:

In Machine Learning, feature extraction of text data is a very critical step. Some of the Commonly used approaches are LDA (Latent Dirichlet Allocation), TF-IDF (Term Frequency Inverse Document Frequency), etc. In this post, I will go through the basics of Topic Modelling using LDA in Python.

The algorithm by M.Hoffman [1] uses stochastic optimization to maximize the variational objective function for the Latent Dirichlet Allocation (LDA) topic model. From programming point of view, it's easier to use the LDA implementation in Python [2]. Finding the optimal number of topics is an important step. This optimal number depends on data, number and types of attributes. Its important to try for different numbers with trial and error technique.

The basic flow to create a classifier will be as follows. First, feature extraction is done using

LDA. The generated feature vector is then applied with Singular Value Decomposition (SVD) for feature reduction because there may be some unimportant features present. Then different classifiers can be used to train the feature vector obtained.

[1] "Online Learning for Latent Dirichlet Allocation" by Matthew D. Hoffman, David M. Blei, and

Francis Bach, to be presented at NIPS 2010.

[2] https://pypi.python.org/pypi/lda

Image Primer

September 14, 2014

Topic Modelling of Text Data using LDA in Python