event thumbnail image
NIPS '07 Workshop on Efficient Machine Learning
Pascal

Speeding Up Stochastic Gradient Descent

author: Yoshua Bengio, University of Montreal

Description

n order to tackle large-scale learning problems whose solution necessarily involves a large model with many tunable parameters, difficult non-convex optimization has to be performed efficiently. Computational complexity arguments strongly suggest that deep architectures will be necessary to represent the kind of complex functions that AI involves. Unfortunately, this involves difficult optimization problems and efficient approximate iterative optimization becomes key to obtain good generalization, and not so much the regularization techniques that have been so well studied in the last two decades. Furthermore, because of the size of the data sets involved in such tasks, it is imperative that computation scale no more than linearly with respect to the number of training examples. In many cases, the algorithm to beat is stochastic gradient descent, and the comparisons have to be made by looking at the curve of test error versus computation time. Following recent interest in online versions of second-order optimization methods, we present computational tricks that yield a linear time variant of natural gradient optimization. Another issue, that is particularly difficult to address in the optimization of multi-layer neural networks, is how to parallelize efficiently. SMP machines becoming cheaper and easier to use, we compare and discuss different strategies for exploiting parallelization of training for multi-layer neural networks, showing that naive approaches fail but those taking into account the communication bottleneck yield impressive speed-ups.

You might be experiencing some problems with Your Video player.
Slides
0:00 Speeding Up Stochastic Gradient Descent
0:19 Summary
1:34 Machine Vision Example
2:41 Computation Graph and Depth
2:44 Current Learning Algorithms: Depth
2:45 Computation Graph and Depth
2:45 Machine Vision Example
3:43 Computation Graph and Depth
4:11 Current Learning Algorithms: Depth
4:24 Gist of Results on Depth of Architecture
5:06 Insufficient Depth
5:10 Optimizing Deep Architectures
6:45 What Happened in 2006?
9:42 Why Online?
13:33 Underfitting, not Overfitting
15:54 Natural Gradient
17:04 Natural Gradient Minimizes Overfitting - 1
19:53 Natural Gradient Minimizes Overfitting - 2
22:05 Exact Natural Gradient is Impractical
23:04 Low Rank Approximations of C and C−1g
24:42 Block Diagonal Modelling
26:01 - Questions
27:42 Experimental Results: Rectangles Data
28:13 Computing Faster Products
28:26 Comparing Different BLAS
29:35 On Actual Neural Net Code
29:37 Parallelization on SMPs
30:55 Data-Parallel Stochastic Gradient
32:22 The Big Picture
32:54 Experiments
33:04 Straw Man
33:16 Results per Update
33:53 Computational Speed-Up vs Convergence Slow-Down
34:35 - Questions
36:11 - Questions
38:33 - Questions

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment: