Document Clustering via Dirichlet Process Mixture Model with Feature Selection
published: Oct. 1, 2010, recorded: July 2010, views: 5691
Slides
Related content
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Description
One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !
Reviews and comments:
In research presented on my site (semanticsearchart.com) I found that mathematical methods such as LSA, PLSA, LDA and SVM work much worse than elementary vector approaches such as HAC, NB, RF, knn, k-means and SOM. I can achieved same result that shown in report by identifying and filtering common words as those that evenly distributed over all files and use other words to create clusters. The advantage of mathematical approach can be proven only by comparison to non-mathematical method such as HAC. The experiments by author of the method are always biased. They should be ignored. Only experiments made by someone, author never met, conducted on data author never saw can be a proof of the concept. When I conducted such experiments I found complete opposite to what authors of the methods said.
Write your own review or comment: