Multi-Label Learning with Millions of Categories
published: May 28, 2013, recorded: September 2012, views: 3274
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Our objective is to build an algorithm for classifying a data point into a set of labels when the output space contains millions of categories. This is a relatively novel setting in supervised learning and brings forth interesting challenges such as efficient training and prediction, learning from only positively labeled data with missing and incorrect labels and handling label correlations. We propose a random forest based solution for jointly tackling these issues. We develop a novel extension of random forests for multi-label classification which can learn from positive data alone and can scale to large data sets. We generate real valued beliefs indicating the state of labels and adapt our classifier to train on these belief vectors so as to compensate for missing and noisy labels. In addition, we modify the random forest cost function to avoid overfitting in high dimensional feature spaces and learn short, balanced trees. Finally, we write highly efficient training routines which let us train on problems with more than a hundred million data points, over a million dimensional sparse feature vector and over ten million categories. Extensive experiments reveal that our proposed solution is not only significantly better than other multi-label classification algorithms but also more than 10\% better than the state-of-the-art NLP based techniques for suggesting bid phrases for online search advertisers.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !