Optimization I

Published on 2018-10-113002 Views

Jimmy Ba

DLRL Summer School 2018 - Toronto

Related categories

Presentation

Tutorial on: Optimization I00:00

Outline00:03

Neural networks00:23

Why is learning difficult - 101:22

Why is learning difficult - 203:03

Why is learning difficult - 305:44

How to train neural networks with random search - 106:23

How to train neural networks with random search - 207:23

How to train neural networks with random search - 308:27

How to train neural networks with random search - 409:23

How well does random search work? - 111:21

How well does random search work? - 213:07

Gradient descent and back-propagation14:08

Gradient descent - 114:40

Gradient descent - 215:15

How well does gradient descent work? - 116:56

How well does gradient descent work? - 217:49

Momentum: smooth gradient with moving average20:54

Stochastic gradient descent: improve efficiency - 122:34

Stochastic gradient descent: improve efficiency - 224:08

Stochastic gradient descent: improve efficiency - 324:59

Revisit gradient descent - 125:49

Revisit gradient descent - 226:11

Natural gradient descent27:25

Fisher information matrix - 128:30

Fisher information matrix - 230:53

When first-order methods fails32:42

Second-order optimization algorithms33:02

Second-order method algorithms36:03

Find a good preconditioning matrix38:40

When first-order methods work well - 139:31

When first-order methods work well - 240:13

When first-order methods fail40:45

Learning on a single machine41:24

Distributed learning42:54

Here is the training plot of a state-of-the-art ResNet trained on 8 GPUs43:53

Scalability of the “black-box” optimization algorithms44:56

Background: Natural gradient for neural networks - 146:30

Background: Natural gradient for neural networks - 247:01

Background: Natural gradient for neural networks - 347:28

Background: Kronecker-factored natural gradient - 147:36

Background: Kronecker-factored natural gradient - 247:40

Background: Kronecker-factored natural gradient - 347:54

Distributed K-FAC natural gradient49:40

Scalability experiments51:59

Thank you56:21