Probabilistic numerics for deep learning

Published on 2017-07-276162 Views

Michael Osborne

DLSS & RLSS 2017 - Montreal

Related categories

Presentation

Probabilistic numerics for deep learning00:00

Probabilistic numerics treats computation as a decision - 102:07

Probabilistic numerics treats computation as a decision - 202:22

Probabilistic numerics treats computation as a decision - 302:30

Probabilistic numerics is the study of numeric methods as learning algorithms.02:46

Global optimisation considers objective functions that are multi-modal and often expensive to evaluate. 03:29

The Rosenbrock is expressible in closed-form.04:22

Computational limits form the core of the optimisation problem.05:48

We are epistemically uncertain about f(x,y) due to being unable to afford its computation.06:32

Probabilistic modelling of functions08:02

Probability theory represents an extension of traditional logic, allowing us to reason in the face of uncertainty.08:20

A probability is a degree of belief. This might be held by any agent – a human, a robot, a pigeon, etc.11:06

‘I’ is the totality of an agent’s prior information. An agent is (partially) defined by I.11:40

The Gaussian distribution allows us to produce distributions for variables conditioned on any other observed variables13:29

A Gaussian process is the generalisation of a multivariate Gaussian distribution to a potentially infinite number of variables.16:48

A Gaussian process provides a non-parametric model for functions, defined by mean and covariance functions. 17:56

Gaussian processes are specified by a covariance function, which flexibly allow the expression of e.g18:17

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 119:04

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 220:08

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 320:22

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 420:22

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 520:39

Bayesian optimisation as decision theory20:41

Bayesian optimisation is the approach of probabilistically modelling f(x,y), and using decision theory to make optimal use of computation20:59

By defining the costs of observation and uncertainty, we can select evaluations optimally by minimising the expected loss with respect to a probability distribution.21:31

We define a loss function that is the lowest function value found after our algorithm ends.22:18

This loss function makes computing the expected loss simple: we’ll take a myopic approximation and consider only the next evaluation.24:29

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 125:22

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 227:12

Untitled28:15

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 428:49

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 529:04

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 629:13

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 729:20

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 829:25

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 933:10

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 1036:38

Tuning is used to cope with model parameters (such as periods).36:48

Bayesian optimisation gives a powerful method for such tuning.38:39

Snoek, Larochelle and Adams (2012) used Bayesian optimisation to tune convolutional neural networks. 39:01

Bayesian optimisation is useful in automating structured search over # hidden layers, learning rates, dropout rates, # hidden units per layer & L2 weight constraints.41:49

Bayesian stochastic optimisation42:33

Using only a subset of the data (a mini-batch) gives a noisy likelihood evaluation42:47

If we use Bayesian optimisation on these noisy evaluations, we can perform stochastic learning.44:36

Lower-variance evaluations (on smaller subsets) are higher cost: let’s also Bayesian optimise over the fidelity of our evaluations!45:22

Quiz: which of these sequences is random? - 149:01

Quiz: which of these sequences is random? - 250:19

A random number52:19

Integration beats optimisation56:18

The naïve fitting of models to data performed by optimisation can lead to overfitting.56:41

Bayesian averaging over ensembles of models reduces overfitting, and provides more honest estimates of uncertainty57:08

Our model57:43

Averaging requires integrating over the many possible states of the world consistent with data: this is often non-analytic.59:43

Numerical integration (quadrature) is ubiquitous. 01:00:12

Optimisation is an unreasonable way of estimating a multi-modal or broad likelihood integrand.01:01:52

If optimising, flat optima are often a better representation of the integral than narrow optima. 01:03:26

Bayesian quadrature makes use of a Gaussian process surrogate for the integrand (the same as you might use for Bayesian optimisation).01:04:29

Gaussian distributed variables are joint Gaussian with any affine transform of them. 01:05:31

A function over which we have a Gaussian process is joint Gaussian with any integral or derivative of it, as integration and differentiation are linear.01:06:59

We can use observations of an integrand ℓ in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.01:07:49

Bayesian quadrature generalises and improves upon traditional quadrature.01:09:21

Quiz: what is the convergence rate of Monte Carlo? - 101:18:46

Quiz: what is the convergence rate of Monte Carlo? - 201:19:35

Monte Carlo01:20:36

Probabilistic numerics views the selection of samples as a decision problem.01:22:04

Our method (Warped Sequential Active Bayesian Integration) converges quickly in wall-clock time for a synthetic integrand.01:29:45

WSABI-L converges quickly in integrating out hyperparameters in a Gaussian process classification problem (CiteSeerx data).01:29:52

Probabilistic numerics offers the propagation of uncertainty through numerical pipelines.01:29:53

Probabilistic numerics treats computation as a decision01:29:54