## Beyond stochastic gradient descent for large-scale machine learning

published: Oct. 29, 2014, recorded: September 2014, views: 174

# Slides

# Related content

# Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our**to describe your request and upload the data.**

__ticket system__*Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.*

# Description

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. Given n observations/iterations, the optimal convergence rates of these algorithms are O(1/\sqrt{n}) for general convex functions and reaches O(1/n) for strongly-convex functions. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newtonbased stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt)

# Link this page

Would you like to put a link to this lecture on your homepage?

Go ahead! Copy the HTML snippet !

## Write your own review or comment: