Dynamic Clustering of Streaming Short Documents
published: Sept. 25, 2016, recorded: August 2016, views: 1525
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Clustering technology has found numerous applications in mining textual data. It was shown to enhance the performance of retrieval systems in various different ways, such as identifying different query aspects in search result diversification, improving smoothing in the context of language modeling, matching queries with documents in a latent topic space in ad-hoc retrieval, summarizing documents etc. The vast majority of clustering methods have been developed under the assumption of a static corpus of long (and hence textually rich) documents. Little attention has been given to streaming corpora of short text, which is the predominant type of data in Web 2.0 applications, such as social media, forums, and blogs. In this paper, we consider the problem of dynamically clustering a streaming corpus of short documents. The short length of documents makes the inference of the latent topic distribution challenging, while the temporal dynamics of streams allow topic distributions to change over time. To tackle these two challenges we propose a new dynamic clustering topic model - DCT - that enables tracking the time-varying distributions of topics over documents and words over topics. DCT models temporal dynamics by a short-term or long-term dependency model over sequential data, and overcomes the difficulty of handling short text by assigning a single topic to each short document and using the distributions inferred at a certain point in time as priors for the next inference, allowing the aggregation of information. At the same time, taking a Bayesian approach allows evidence obtained from new streaming documents to change the topic distribution. Our experimental results demonstrate that the proposed clustering algorithm outperforms state-of-the-art dynamic and non-dynamic clustering topic models in terms of perplexity and when integrated in a cluster-based query likelihood model it also outperforms state-of-the-art models in terms of retrieval quality.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !