Policy Search for RL thumbnail
slide-image
Pause
Mute
Subtitles not available
Playback speed
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Full screen

Policy Search for RL

Published on Jul 27, 20178517 Views

Related categories

Chapter list

Reinforcement learning - policy optimization 00:00
Reinforcement Learning 00:38
Policy optimization 01:22
Policy optimization - 1 02:03
Why policy optimization 03:40
Example Policy Optimization Success Stories 05:08
Policy Optimization in the RL Landscape 07:25
Outline08:17
Pathwise DerivaRves (PD) / BackPropagation Through Time (BPTT) 09:49
Pathwise DerivaRves (PD) / BackPropagation Through Time (BPTT) - 110:53
Path Derivative for Stochastic f - Additive Noise 13:56
Path Derivative for Stochastic f - reparameterization trick14:41
Stochastic Dynamics f15:24
Stochastic f, R and ⇡✓15:41
Stochastic f, R and ⇡✓ and s0 16:01
PD/BPTT Policy Gradients: Complete Algorithm 16:22
SVG(inf)20:56
SVG variants 22:04
SVG(1)24:26
SVG(0)27:09
SVG(k)30:17
SVG(0) -> DPG 30:39
Deep Deterministic Policy Gradient (DDPG)31:28
DDPG Results 32:33
Outline - 134:03
Black Box Gradient Computation34:38
Solution 2: Fix random seed35:02
Solution 2: Fix random seed - 135:46
Solution 2: Fix random seed - 236:24
Learning to Hover 36:49
Gradient-Free Methods37:24
Cross-Entropy Method 37:58
Cross-Entropy Method - 139:33
Closely Related Approaches40:13
Applications41:48
Cross-Entropy / Evolutionary Methods42:42
Considerations43:13
Outline - 245:51
Likelihood Ratio Policy Gradient52:39
Likelihood Ratio Policy Gradient - 153:50
Derivation from Importance Sampling56:28
Likelihood Ratio Gradient: Validity 58:24
Likelihood Ratio Gradient: Intuition58:59
Let’s decompose path into states and actions 01:00:39
Likelihood ratio gradient estimate01:01:51
Likelihood ratio gradient estimate - 101:02:10
Likelihood ratio gradient estimate: baseline01:02:35
Likelihood ratio and temporal structure01:04:10
Pseudo-code reinforce aka vanilla policy gradient01:05:08
Outline - 301:06:02
Step-sizing and trust regions01:06:14
What’s in a step-size? 01:06:29
Step-sizing and trust regions - 101:07:25
Step-sizing and trust regions - 201:08:04
Evaluating the KL01:08:35
Evaluating the KL - 101:09:19
EvaluaRng the KL - 201:11:04
EvaluaRng the KL - 301:11:43
Experiments in LocomoRon 01:13:02
Learning Curves - Comparison01:13:54
Learning Curves - Comparison - 101:14:08
Atari Games01:14:13
Outline - 401:14:49
Recall Our Likelihood RaRo PG EsRmator01:15:00
Estimation of V⇡01:15:58
Recall Our Likelihood Ratio PG Estimator 01:16:46
Variance Reduction by Discounting01:17:59
Reducing Variance by Function Approximation 01:18:16
Reducing Variance by Function Approximation - 101:18:59
Actor-Critic with A3C or GAE 01:19:32
Async Advantage Actor Critic (A3C)01:21:29
A3C - labyrinth 01:21:53
GAE: Effect of gamma and lambda 01:22:23
Learning LocomoRon (TRPO + GAE) 01:23:20
Outline - 501:25:03
Stochastic Computation Graphs 01:25:21
Food for thought01:26:43
Current frontiers01:27:28
Current frontiers - 101:27:43
How to learn more and get started? 01:27:55
How to learn more and get started? - 101:28:02
How to learn more and get started? - 201:28:08