Web Spam Challenge 2007 Track II - Secure Computing Corporation Research
published: Jan. 28, 2008, recorded: September 2007, views: 5652
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
To discriminate spam Web hosts/pages from normal ones, text-based and link-based data are provided forWeb Spam Challenge Track II. Given a small part of labeled nodes (about 10%) in aWeb linkage graph, the challenge is to predict other nodes’ class to be spam or normal.We extract features from link-based data, and then combine them with text-based features. After feature scaling, Support Vector Machines (SVM) and Random Forests (RF) are modeled in the extremely high dimensional space with about 5 million features. Stratified 3-fold cross validation for SVM and out-of-bag estimation for RF are used to tune the modeling parameters and estimate the generalization capability. On the small corpus for Web host classification, the best F-Measure value is 75.46% and the best AUC value is 95.11%. On the large corpus for Web page classification, the best F-Measure value is 90.20% and the best AUC value is 98.92%.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !