Thursday 30 December 2010

Update

The index page at http://bit.ly/wikitopics now provides the briefest of instructions, and access to the topic and document maps. This blog is at http://wikitopics.blogspot.com

I've made the links between topics and their nearest neighbours active. Hover over the link between two topics to see the ID of a cable that supports this link, and click on the link to visit this document.

I'll add similar functionality to the document maps tomorrow - the links between documents are supported by common topics.

I also plan to provide local maps of neighbourhood of each document in the document space and tools for selecting documents by choosing combinations of topics.

Wednesday 29 December 2010

Wikileaks Topics

cables


topics


This blog will document an experimental use of topic modelling to develop a site for browsing a multithreaded collection of documents. We start from a collection of documents (each viewed as a bag of words), and use Latent Dirichlet Allocation (LDA) to model the each document as a mixture of a number of topics. A topic is a probability distribution over words. Once we choose a fixed number of topics, LDA provides a set of topics and the proportions in which they should be mixed in each document to best approximate our collection.

We use the MALLET tools from UMASS to perform this analysis.

We use the Wikileaks #cablegate collection of cables as an example corpus.

This is a work in progress. We hope it will rapidly improve.

Two documents are similar if they are modelled by similar collections of topics, so we can present the structure of the collection of cables by linking each one to its nearest neighbour. You can access the source of each document by clicking on the node that represents it.

Just as two documents can be linked by a common topic, two topics can be linked by the documents they have in common. So we can also present the structure of the set of topics inferred by LDA in a similar (actually dual) fashion. You can inspect the most frequent words in each topic by hovering over the node that represents it.

We are now working on linking these two views.

We are eager to have feedback on the interface and how it can help to discover information in a previously unconnected collection of documents.