## Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

author: Benjamin Recht, University of Wisconsin - Madison
published: Jan. 25, 2012,   recorded: December 2011,   views: 374
Categories
You might be experiencing some problems with Your Video player.

# Slides

0:00 Slides Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Incremental Gradient Descent Example: Computing the mean - 01 Example: Computing the mean - 02 Convergence Rates - 01 Convergence Rates - 02 SGD and BIG Data Is SGD inherently Serial? HOGWILD! “Sparse” Function: HOGWILD! “Sparse” Function: Sparse Support Vector Machines Matrix Completion Graph Cuts Convergence Theory Hogs gone wild! Convergence Theory Hogs gone wild! Speedups JELLYFISH - 01 JELLYFISH - 02 JELLYFISH - 01 JELLYFISH - 03 Example Optimization - 01 Example Optimization - 02 Example Optimization - 03 Simplest Problem? Least Squares Simple Question What about generically? - 01 What about generically? - 02 What about generically? - 03 What about generically? - 01 What about generically? - 02 But what about on average? Summary

# Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

# Description

Stochastic Gradient Descent (SGD) is a very popular optimization algorithm for solving data-driven machine learning problems. SGD is well suited to processing large amounts of data due to its robustness against noise, rapid convergence rates, and predictable memory footprint. Nevertheless, SGD seems to be impeded by many of the classical barriers to scalability: (1) SGD appears to be inherently sequential, (2) SGD assumes uniform sampling from the underlying data set resulting in poor locality, and (3) current approaches to parallelize SGD require performance-destroying, fine-grained communication. This talk aims to refute the conventional wisdom that SGD inherently suffers from these impediments. Specifically, I will show that SGD can be implemented in parallel with minimal communication, with no locking or synchronization, and with strong spatial locality. I will provide both theoretical and experimental evidence demonstrating the achievement of linear speedups on multicore workstations on several benchmark optimization problems. Finally, I will close with a discussion of a challenging problem raised by our implementations relating arithmetic and geometric means of matrices. Joint work with Feng Niu, Christopher Re, and Stephen Wright.