Wednesday, 29 December 2010
This blog will document an experimental use of topic modelling to develop a site for browsing a multithreaded collection of documents. We start from a collection of documents (each viewed as a bag of words), and use Latent Dirichlet Allocation (LDA) to model the each document as a mixture of a number of topics. A topic is a probability distribution over words. Once we choose a fixed number of topics, LDA provides a set of topics and the proportions in which they should be mixed in each document to best approximate our collection.
We use the MALLET tools from UMASS to perform this analysis.
We use the Wikileaks #cablegate collection of cables as an example corpus.
This is a work in progress. We hope it will rapidly improve.
Two documents are similar if they are modelled by similar collections of topics, so we can present the structure of the collection of cables by linking each one to its nearest neighbour. You can access the source of each document by clicking on the node that represents it.
Just as two documents can be linked by a common topic, two topics can be linked by the documents they have in common. So we can also present the structure of the set of topics inferred by LDA in a similar (actually dual) fashion. You can inspect the most frequent words in each topic by hovering over the node that represents it.
We are now working on linking these two views.
We are eager to have feedback on the interface and how it can help to discover information in a previously unconnected collection of documents.