Speeding Up Stochastic Gradient Descent
Description
n order to tackle large-scale learning problems whose solution necessarily involves a large model with many tunable parameters, difficult non-convex optimization has to be performed efficiently. Computational complexity arguments strongly suggest that deep architectures will be necessary to represent the kind of complex functions that AI involves. Unfortunately, this involves difficult optimization problems and efficient approximate iterative optimization becomes key to obtain good generalization, and not so much the regularization techniques that have been so well studied in the last two decades. Furthermore, because of the size of the data sets involved in such tasks, it is imperative that computation scale no more than linearly with respect to the number of training examples. In many cases, the algorithm to beat is stochastic gradient descent, and the comparisons have to be made by looking at the curve of test error versus computation time. Following recent interest in online versions of second-order optimization methods, we present computational tricks that yield a linear time variant of natural gradient optimization. Another issue, that is particularly difficult to address in the optimization of multi-layer neural networks, is how to parallelize efficiently. SMP machines becoming cheaper and easier to use, we compare and discuss different strategies for exploiting parallelization of training for multi-layer neural networks, showing that naive approaches fail but those taking into account the communication bottleneck yield impressive speed-ups.
| Slides | |
| 0:00 | Speeding Up Stochastic Gradient Descent |
| 0:19 | Summary |
| 1:34 | Machine Vision Example |
| 2:41 | Computation Graph and Depth |
| 2:44 | Current Learning Algorithms: Depth |
| 2:45 | Computation Graph and Depth |
| 2:45 | Machine Vision Example |
| 3:43 | Computation Graph and Depth |
| 4:11 | Current Learning Algorithms: Depth |
| 4:24 | Gist of Results on Depth of Architecture |
| 5:06 | Insufficient Depth |
| 5:10 | Optimizing Deep Architectures |
| 6:45 | What Happened in 2006? |
| 9:42 | Why Online? |
| 13:33 | Underfitting, not Overfitting |
| 15:54 | Natural Gradient |
| 17:04 | Natural Gradient Minimizes Overfitting - 1 |
| 19:53 | Natural Gradient Minimizes Overfitting - 2 |
| 22:05 | Exact Natural Gradient is Impractical |
| 23:04 | Low Rank Approximations of C and C−1g |
| 24:42 | Block Diagonal Modelling |
| 26:01 | - Questions |
| 27:42 | Experimental Results: Rectangles Data |
| 28:13 | Computing Faster Products |
| 28:26 | Comparing Different BLAS |
| 29:35 | On Actual Neural Net Code |
| 29:37 | Parallelization on SMPs |
| 30:55 | Data-Parallel Stochastic Gradient |
| 32:22 | The Big Picture |
| 32:54 | Experiments |
| 33:04 | Straw Man |
| 33:16 | Results per Update |
| 33:53 | Computational Speed-Up vs Convergence Slow-Down |
| 34:35 | - Questions |
| 36:11 | - Questions |
| 38:33 | - Questions |
Lecture rating
| People found this lecture: | ||
| Worth seeing | ||
| because it is: | ||
| Valuable and informative | ||
| Well presented | ||
| Easily understandable | ||
| Acceptably recorded | ||
| You need to login to cast your vote. | ||
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Related content
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !




