Mining Topics in Documents: Standing on the Shoulders of Big Data

author: Zhiyuan (Brett) Chen, Department of Computer Science, University of Illinois at Chicago
published: Oct. 8, 2014,   recorded: August 2014,   views: 3673


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.

See Also:

Download slides icon Download slides: kdd2014_chen_mining_topics_01.pdf (590.1┬áKB)

Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Reviews and comments:

Comment1 Hung, October 19, 2015 at 10:15 a.m.:

Sorry, I really want to revive your said but I can not because My english is very bad.Thanks

Write your own review or comment:

make sure you have javascript enabled or clear this field: