The Parameter Server

Published on 2013-01-169406 Views

Alex Smola

In this talk I will discuss a number of vignettes on scaling optimization and inference. Despite arising from very different contexts (graphical models inference, convex optimization, neural networks)

Discrete Optimization in Machine Learning

Related categories

Presentation

Scaling with the Parameter Server Variations on a Theme00:00

Thanks00:57

Practical Distributed Inference01:21

Motivation Data & Systems01:56

Commodity Hardware01:59

The Joys of Real Hardware02:13

Scaling problems02:33

Some Problems (1)04:02

Some Problems (2)04:34

Some Problems (304:35

Multicore parallelism (1)04:38

Multicore Parallelism (2)04:42

Stochastic Gradient Descent06:55

Guarantees07:18

Speedup on TREC08:58

LDA Multicore Inference (1)09:22

LDA Multicore Inference (2)10:01

General strategy10:48

This was easy ... (1)11:25

This was easy ... (2)11:26

This was easy ... (3)11:27

Parameter Server 30,000 ft view11:38

Why (not) MapReduce?12:15

General parallel algorithm template (1)13:37

General parallel algorithm template (2)15:07

Desiderata15:56

Random Caching Trees (1)16:21

Random Caching Trees (2)17:33

Argmin Hash18:42

Distributed Hash Table (1)21:05

Distributed Hash Table (2)22:19

Distributed Hash Table (3)22:20

Distributed Hash Table (4)22:21

Distributed Hash Table (5)22:21

Exact Synchronization22:21

Motivation - Latent Variable Models22:28

Distribution23:02

Preserving the polytope23:40

Example - User Profiling (1)25:35

Example - User Profiling (2)25:38

Distribution (1)26:15

Distribution (2)26:21

Synchronization (1)26:33

Synchronization (2)29:19

Weak scaling (more data = more machines) -129:35

Weak scaling (more data = more machines) -229:42

Exact Synchronization in a Nutshell30:06

Approximate Synchronization & Dual Decomposition31:22

Motivation - Distributed Optimization31:28

Properties32:24

Dual Decomposition to the rescue33:15

Synchronous Variant (Mapreduce)34:11

Asynchronous Variant35:04

Convergence (synchronous vs. asynchronous) -136:23

Convergence (synchronous vs. asynchronous) -236:26

Acceleration (single CPU vs. 32 machines)36:46

Weak scaling (more data = more machines)36:54

Even more parameter server variants37:22

Multicore38:35