Follow the Leader with Dropout Perturbations
published: July 15, 2014, recorded: June 2014, views: 80
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L∗ of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L∗.
Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(L∗lnK−−−−−−√+lnK) regret as a function of L∗, and optimal O(lnK) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others.
A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !