Policy Search for RL

Published on 2017-07-278565 Views

Pieter Abbeel

DLSS & RLSS 2017 - Montreal

Related categories

Presentation

Reinforcement learning - policy optimization 00:00

Reinforcement Learning 00:38

Policy optimization 01:22

Policy optimization - 1 02:03

Why policy optimization 03:40

Example Policy Optimization Success Stories 05:08

Policy Optimization in the RL Landscape 07:25

Outline08:17

Pathwise DerivaRves (PD) / BackPropagation Through Time (BPTT) 09:49

Pathwise DerivaRves (PD) / BackPropagation Through Time (BPTT) - 110:53

Path Derivative for Stochastic f - Additive Noise 13:56

Path Derivative for Stochastic f - reparameterization trick14:41

Stochastic Dynamics f15:24

Stochastic f, R and ⇡✓15:41

Stochastic f, R and ⇡✓ and s0 16:01

PD/BPTT Policy Gradients: Complete Algorithm 16:22

SVG(inf)20:56

SVG variants 22:04

SVG(1)24:26

SVG(0)27:09

SVG(k)30:17

SVG(0) -> DPG 30:39

Deep Deterministic Policy Gradient (DDPG)31:28

DDPG Results 32:33

Outline - 134:03

Black Box Gradient Computation34:38

Solution 2: Fix random seed35:02

Solution 2: Fix random seed - 135:46

Solution 2: Fix random seed - 236:24

Learning to Hover 36:49

Gradient-Free Methods37:24

Cross-Entropy Method 37:58

Cross-Entropy Method - 139:33

Closely Related Approaches40:13

Applications41:48

Cross-Entropy / Evolutionary Methods42:42

Considerations43:13

Outline - 245:51

Likelihood Ratio Policy Gradient52:39

Likelihood Ratio Policy Gradient - 153:50

Derivation from Importance Sampling56:28

Likelihood Ratio Gradient: Validity 58:24

Likelihood Ratio Gradient: Intuition58:59

Let’s decompose path into states and actions 01:00:39

Likelihood ratio gradient estimate01:01:51

Likelihood ratio gradient estimate - 101:02:10

Likelihood ratio gradient estimate: baseline01:02:35

Likelihood ratio and temporal structure01:04:10

Pseudo-code reinforce aka vanilla policy gradient01:05:08

Outline - 301:06:02

Step-sizing and trust regions01:06:14

What’s in a step-size? 01:06:29

Step-sizing and trust regions - 101:07:25

Step-sizing and trust regions - 201:08:04

Evaluating the KL01:08:35

Evaluating the KL - 101:09:19

EvaluaRng the KL - 201:11:04

EvaluaRng the KL - 301:11:43

Experiments in LocomoRon 01:13:02

Learning Curves - Comparison01:13:54

Learning Curves - Comparison - 101:14:08

Atari Games01:14:13

Outline - 401:14:49

Recall Our Likelihood RaRo PG EsRmator01:15:00

Estimation of V⇡01:15:58

Recall Our Likelihood Ratio PG Estimator 01:16:46

Variance Reduction by Discounting01:17:59

Reducing Variance by Function Approximation 01:18:16

Reducing Variance by Function Approximation - 101:18:59

Actor-Critic with A3C or GAE 01:19:32

Async Advantage Actor Critic (A3C)01:21:29

A3C - labyrinth 01:21:53

GAE: Effect of gamma and lambda 01:22:23

Learning LocomoRon (TRPO + GAE) 01:23:20

Outline - 501:25:03

Stochastic Computation Graphs 01:25:21

Food for thought01:26:43

Current frontiers01:27:28

Current frontiers - 101:27:43

How to learn more and get started? 01:27:55

How to learn more and get started? - 101:28:02

How to learn more and get started? - 201:28:08