Probabilistic numerics for deep learning thumbnail
Pause
Mute
Subtitles
Playback speed
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Full screen

Probabilistic numerics for deep learning

Published on Jul 27, 20176148 Views

Related categories

Chapter list

Probabilistic numerics for deep learning00:00
Probabilistic numerics treats computation as a decision - 102:07
Probabilistic numerics treats computation as a decision - 202:22
Probabilistic numerics treats computation as a decision - 302:30
Probabilistic numerics is the study of numeric methods as learning algorithms.02:46
Global optimisation considers objective functions that are multi-modal and often expensive to evaluate. 03:29
The Rosenbrock is expressible in closed-form.04:22
Computational limits form the core of the optimisation problem.05:48
We are epistemically uncertain about f(x,y) due to being unable to afford its computation.06:32
Probabilistic modelling of functions08:02
Probability theory represents an extension of traditional logic, allowing us to reason in the face of uncertainty.08:20
A probability is a degree of belief. This might be held by any agent – a human, a robot, a pigeon, etc.11:06
‘I’ is the totality of an agent’s prior information. An agent is (partially) defined by I.11:40
The Gaussian distribution allows us to produce distributions for variables conditioned on any other observed variables13:29
A Gaussian process is the generalisation of a multivariate Gaussian distribution to a potentially infinite number of variables.16:48
A Gaussian process provides a non-parametric model for functions, defined by mean and covariance functions. 17:56
Gaussian processes are specified by a covariance function, which flexibly allow the expression of e.g18:17
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 119:04
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 220:08
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 320:22
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 420:22
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 520:39
Bayesian optimisation as decision theory20:41
Bayesian optimisation is the approach of probabilistically modelling f(x,y), and using decision theory to make optimal use of computation20:59
By defining the costs of observation and uncertainty, we can select evaluations optimally by minimising the expected loss with respect to a probability distribution.21:31
We define a loss function that is the lowest function value found after our algorithm ends.22:18
This loss function makes computing the expected loss simple: we’ll take a myopic approximation and consider only the next evaluation.24:29
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 125:22
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 227:12
Untitled28:15
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 428:49
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 529:04
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 629:13
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 729:20
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 829:25
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 933:10
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 1036:38
Tuning is used to cope with model parameters (such as periods).36:48
Bayesian optimisation gives a powerful method for such tuning.38:39
Snoek, Larochelle and Adams (2012) used Bayesian optimisation to tune convolutional neural networks. 39:01
Bayesian optimisation is useful in automating structured search over # hidden layers, learning rates, dropout rates, # hidden units per layer & L2 weight constraints.41:49
Bayesian stochastic optimisation42:33
Using only a subset of the data (a mini-batch) gives a noisy likelihood evaluation42:47
If we use Bayesian optimisation on these noisy evaluations, we can perform stochastic learning.44:36
Lower-variance evaluations (on smaller subsets) are higher cost: let’s also Bayesian optimise over the fidelity of our evaluations!45:22
Quiz: which of these sequences is random? - 149:01
Quiz: which of these sequences is random? - 250:19
A random number52:19
Integration beats optimisation56:18
The naïve fitting of models to data performed by optimisation can lead to overfitting.56:41
Bayesian averaging over ensembles of models reduces overfitting, and provides more honest estimates of uncertainty57:08
Our model57:43
Averaging requires integrating over the many possible states of the world consistent with data: this is often non-analytic.59:43
Numerical integration (quadrature) is ubiquitous. 01:00:12
Optimisation is an unreasonable way of estimating a multi-modal or broad likelihood integrand.01:01:52
If optimising, flat optima are often a better representation of the integral than narrow optima. 01:03:26
Bayesian quadrature makes use of a Gaussian process surrogate for the integrand (the same as you might use for Bayesian optimisation).01:04:29
Gaussian distributed variables are joint Gaussian with any affine transform of them. 01:05:31
A function over which we have a Gaussian process is joint Gaussian with any integral or derivative of it, as integration and differentiation are linear.01:06:59
We can use observations of an integrand ℓ in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.01:07:49
Bayesian quadrature generalises and improves upon traditional quadrature.01:09:21
Quiz: what is the convergence rate of Monte Carlo? - 101:18:46
Quiz: what is the convergence rate of Monte Carlo? - 201:19:35
Monte Carlo01:20:36
Probabilistic numerics views the selection of samples as a decision problem.01:22:04
Our method (Warped Sequential Active Bayesian Integration) converges quickly in wall-clock time for a synthetic integrand.01:29:45
WSABI-L converges quickly in integrating out hyperparameters in a Gaussian process classification problem (CiteSeerx data).01:29:52
Probabilistic numerics offers the propagation of uncertainty through numerical pipelines.01:29:53
Probabilistic numerics treats computation as a decision01:29:54