A Dirty Model for Multi-task Learning
published: Jan. 12, 2011, recorded: December 2010, views: 575
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
We consider the multiple linear regression problem, in a setting where some of the set of relevant features could be shared across the tasks. A lot of recent research has studied the use of L1 Lq norm block-regularizations with q and 1 for such (possibly) block-structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the to which the features are shared across tasks. Indeed they show that if the extent of overlap is less than a threshold, or even if parameter in the shared features are highly uneven, then block L1 Lq regularization could actually perform than simple separate elementwise L1 regularization. We are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage support and parameter overlap when it exists, but not pay a penalty when it does not? Indeed, this falls under a more general question of whether we can model such which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we decompose the parameters into two components and regularize these differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both L1 and L1 Lq methods, over the entire range of possible overlaps. We also provide theoretical guarantees that the method performs well under high-dimensional scaling.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !