Using upper confidence bounds to control exploration and exploitation

Published on 2007-02-256493 Views

NIPS Workshop 2006 - Whistler

Using upper confidence bounds to control exploration and exploitation00:01

Contents00:09

Exploration vs. Exploitation01:15

Exploration vs. Exploitation: Some Applications02:08

Bandit Problems – “Optimism in the Face Uncertainty”03:06

Parametric Bandits [Lai&Robbins]05:32

Bounds07:16

UCB1 Algorithm (Auer et al., 2002)08:26

TITLE10:41

Bandits in Continuous Time11:02

Formal framework13:00

Evaluating allocation rules (policies)14:24

Gain, action values and regret16:19

Model-based UCB19:11

Algorithm 22:07

Regret bound24:52

Key proposition28:37

Open problems29:40

Levente Kocsis Remi Munos31:06

Bandits with large action-spaces31:23

Structure helps!31:46

UCT Upper Confidence based Tree search32:51

Example (t=1)33:13

Example (t=2)34:18

Example (t=3)34:45

Example (t=4)35:04

What is the next time a suboptimal action is sampled?35:17

UCT variations36:01

Theoretical results37:53

Planning in MDPs: Sailing39:16

Results in games40:22

Thank you!41:02