Contextual Bandit Algorithms with Supervised Learning Guarantees, incl. discussion by Brendan McMahan

author: Brendan McMahan, Google, Inc.
author: Lev Reyzin, Georgia Institute of Technology
published: May 6, 2011,   recorded: April 2011,   views: 6103


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al. called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension $d$ while incurring regret at most $ O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.

See Also:

Download slides icon Download slides: aistats2011_reyzin_bandit_01.pdf (1.8┬áMB)

Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: