## Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

author: Ohad Samir, Microsoft Research New England, Microsoft Research
published: Jan. 25, 2012,   recorded: December 2011,   views: 164
Categories
You might be experiencing some problems with Your Video player.

# Slides

0:00 Slides Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Stochastic Convex Optimization Strongly Convex Stochastic Optimization - 01 Strongly Convex Stochastic Optimization - 02 Better Algorithms Related Work This Work - 01 This Work - 02 This Work - 03 This Work - 04 This Work - 05 This Work - 06 Smooth F - 01 Smooth F - 02 Smooth F - 03 Non-Smooth F Warm-up Second Example Fixing SGD - 01 Fixing SGD - 02 Experiments - 01 Experiments - 02 Conclusions and Open Problems

# Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

# Description

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be (log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.