Document Clustering via Dirichlet Process Mixture Model with Feature Selection

author: Guan Yu, Hong Kong Polytechnic University
published: Oct. 1, 2010,   recorded: July 2010,   views: 5692


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.

See Also:

Download slides icon Download slides: kdd2010_yu_dcd_01.ppt (489.0┬áKB)

Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Reviews and comments:

Comment1 Andrew Polar, March 9, 2012 at 7:18 p.m.:

In research presented on my site ( I found that mathematical methods such as LSA, PLSA, LDA and SVM work much worse than elementary vector approaches such as HAC, NB, RF, knn, k-means and SOM. I can achieved same result that shown in report by identifying and filtering common words as those that evenly distributed over all files and use other words to create clusters. The advantage of mathematical approach can be proven only by comparison to non-mathematical method such as HAC. The experiments by author of the method are always biased. They should be ignored. Only experiments made by someone, author never met, conducted on data author never saw can be a proof of the concept. When I conducted such experiments I found complete opposite to what authors of the methods said.

Write your own review or comment:

make sure you have javascript enabled or clear this field: