Welcome to bayesfunc’s documentation!¶
bayesfunc is a PyTorch library providing a number of state-of-the-art priors and variational approximate posteriors over functions.
In particular, we implement:
Global inducing variational inference for neural networks (https://arxiv.org/abs/2005.08140)
Global inducing variational inference for deep Gaussian processes (https://arxiv.org/abs/2005.08140)
Deep kernel processes (https://arxiv.org/abs/2010.01590)
In addition, we implement a number of more standard methods, primarily to give fair, easy-to-implement comparisions:
Mean field variational inference
Sparse (deep) Gaussian processes inference
Library Conventions¶
bayesfunc introduces a number of PyTorch modules mirroring the standard pytorch nn API.
As such, modules can be combined and networks created using e.g. nn.Sequential. See Examples for further details.
However these modules have a couple of differences.
Sample and minibatch¶
In standard PyTorch, inputs to modules are tensors, and the zeroth index usually represents a minibatch. For instance, if we had a minibatch of size 128,
>>> import torch
>>> import torch.nn as nn
>>> m = nn.Linear(20, 30)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 30])
However, here we are sampling from an approximate posterior over functions, and often we will want to draw mutiple different samples from that approximate posterior (e.g. to reduce the variance of our estimate of the ELBO). As such, we introduce a new convention: the zeroth index gives the sample of the approximate posterior, and the first index gives the minibatch. Consider possibly the simplest module defined in our module, FactorisedLinear, which does MFVI over the weights of a fully-connected layer. For instance, if we wanted to do the equivalent of the above, with just one sample from the posterior over functions that we apply to every input in the minibatch,
>>> import bayesfunc as bf
>>> m = bf.FactorisedLinear(20, 30)
>>> input = torch.randn(1, 128, 20)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([1, 128, 30])
Note that both the input and output are rank 3 tensors, whereas in the original pure PyTorch example, they were only rank 2 tensors.
If we wanted to draw 10 samples of the function and apply them to our inputs, we would need to give the module a tensor with shape (10, minibatch, in_features).
To efficiently replicate the inputs 10 times, we use expand
>>> m = bf.FactorisedLinear(20, 30)
>>> input = torch.randn(1, 128, 20).expand(10, -1, -1)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([10, 128, 30])
Note despite the inputs being the same for different samples, the outputs are different, because we have applied different functions.
>>> (input[0, 0, 0] == input[1, 0, 0]).item()
True
>>> (output[0, 0, 0] == output[1, 0, 0]).item()
False
Propagate function¶
To run a network, you must use the propagate function. This function prepares the network (by optionally loading up a previously sampled set of weights), runs the network, returning the output, \(\log P(f) - \log Q(f)\), and the sampled weights used,
- bayesfunc.propagate(f, *args, sample_dict=None, detach=True)¶
The ONLY way to run the neural networks defined in bayesfunc. Replaces f(input), which will now fail silently!
- Parameters
f – the bayesfunc function
input – input to the function
- Keyword Arguments
sample_dict – optional dictionary of sampled weights, to allow using the same weights for multiple different inputs
detach – if true, we detach parameters in sample_dict, stopping propagation of gradients
- outputs:
output: neural network output (as in
output = f(input)).logpq: \(\log P(f) - \log Q(f)\) the difference of prior and approximate posterior log-probabilities.
output_sample_dict: a dictionary containing all the sampled weights used in the network. If
sample_dictis set, we haveoutput_sample_dict == sample_dict.
Warning
Only properly implemented for
GILinear,GIConv2d,FactorisedLinearandFactorisedConv2D. Everything else will run, but will independently sample a new function on every invocation, ignoring thesample_dictinput argument.In standard use,
sample_dictis never set andoutput_sample_dictis never needed. These only become useful e.g. in continual learning.
Computing the ELBO¶
In variational inference, we use the ELBO as the loss,
\(\mathcal{L}(\phi) = E_{Q_\phi}[\sum_{i=1}^N \log P(y_i|x_i, f) + \log P(f) - \log Q_\phi(f)]\)
where \(x_i\) is a single input (e.g. image), \(y_i\) is a single output (e.g. label) and there are \(N\) datapoints in total. The bayesfunc library defines priors, \(P(f)\), and approximate posteriors, \(Q_\phi(f)\), over functions, where \(\phi\) are the trainable parameters of the approximate posterior.
And in practical cases, we use a minibatched estimate, of the averaged loss (averaging across datapoints),
\(\frac{1}{N} \hat{\mathcal{L}}(\phi) = \frac{1}{B} \sum_{i\in \mathcal{B}_j}\log P(\text{data}_i| f) + \frac{1}{N} (\log P(f) - \log Q_\phi(f))\)
Here, \(\mathbf{B}_i\) is the \(i\) is the size of a minibatch, and \(f\) has been sampled from the approximate posterior, \(Q_\phi(f)\). Critically, the first term, is just the standard neural-network loss (e.g. the cross-entropy),
\(\frac{1}{B} \sum_{i\in \mathcal{B}_j}\log P(\text{data}_i| f)\) = - average cross entropy
And the second term is given by the bayesfunc library, as the second argument returned by bf.propagate
\(\log P(f) - \log Q_\phi(f)\) = bf.propagate(net, inputt)[1]
As such, the full training loop might look like:
for x, y in dataloader:
# include a sample dimension
x = x.expand(1, -1, -1)
# compute the output
output, logpq, _ = bf.propagate(net, x)
# compute the log-likelihood/loss
log_like = F.cross_entropy(x, y, reduction="mean")
# the objective, where N is the number of datapoints
obj = log_like + logpq/N
optimizer.zero_grad()
(-obj).backward()
optimizer.step()
Wrapper for global inducing methods¶
Many of our function approximators require “global” inducing points, i.e. optimized psuedo inputs that look like standard data-items.
These modules (i.e. GILinear, GIConv2d, GIGP, DKP) require an InducingWrapper,
- bayesfunc.InducingWrapper(net, inducing_batch, *args, **kwargs)¶
Combines incoming test/train data with learned inducing inputs, then strips away the inducing outputs, just leaving the function approximated at inducing locations.
- Parameters
net (nn.Module) – The underlying function approximator, represented as PyTorch modules, to be wrapped.
inducing_batch (int) – The underlying function approximator, represented as PyTorch modules, to be wrapped.
- Keyword Arguments
inducing_shape (Optional[torch.Size]) – The size of the inducing inputs, including inducing_batch as the first dimension. Default:
None.inducing_data (Optional[torch.Tensor]) – The values of the inducing inputs. Useful to e.g. initialize the inducing points on top of datapoints. Default:
None.fixed (Bool) – Do we fix the inducing point locations? Default:
False.
Must specify one and only one of inducing_shape or inducing_data
Example
>>> import bayesfunc as bf >>> import torch as t >>> import torch.nn as nn >>> >>> in_features = 20 >>> hidden_features = 50 >>> out_features = 30 >>> >>> m1 = bf.GILinear(in_features, hidden_features, inducing_batch=100) >>> m2 = bf.GILinear(hidden_features, out_features, inducing_batch=100) >>> net = nn.Sequential(m1, m2) >>> >>> net = bf.InducingWrapper(net, 100, inducing_shape=(100, in_features)) >>> output, _, _ = bf.propagate(net, t.randn(3, 128, in_features)) >>> output.shape torch.Size([3, 128, 30])
Structured kernels for kernel-based methods¶
To implement kernel-based methods efficiently, we can’t propagate the full \((P_\text{i}+P_\text{t})\times(P_\text{i}+P_\text{t})\) covariance matrix, where \(P_\text{i}\) is the number of inducing points, and \(P_\text{t}\) is the number of test/training points, as \(P_\text{t}\) could be very large. Instead, we propagate a special type:
- class bayesfunc.KG(ii, it, tt)¶
Simple container class for different components of a covariance matrix. You shouldn’t need to use this unless you are developing your own kernels.
- arg:
ii: \(P_\text{i}\times P_\text{i}\) covariance matrix for inducing points.
shape=(samples, inducing_batch, inducing_batch)it: \(P_\text{i}\times P_\text{t}\) covariance matrix for inducing points.
shape=(samples, inducing_batch, mbatch)tt: \(P_\text{t}\) diagonal variances test/train points.
shape=(samples, 1, mbatch)?
Library reference: Bayesian neural networks¶
Simple approximate posteriors for Bayesian neural networks¶
These methods compute an approximate posterior over weights and are relatively simple: they don’t have global inducing points, and therefore don’t need wrapping in InducingWrapper (Wrapper for global inducing methods). That said, you can wrap them if you want, which is usually useful if you want to combine some of these simpler methods with a Global inducing method.
First, we look at factorised methods. They are easy to apply, but often don’t work that well. It can be important to initialise the approximate posterior with very low variance to get them to converge.
- class bayesfunc.FactorisedLinear(in_features, out_features, bias=True, **kwargs)¶
IID Gaussian prior and factorised Gaussian posterior over the weights of a fully-connected layer.
- arg:
in_features: size of each input sample
out_features: size of each output sample
- kwargs:
bias: If set to
False, the layer will not learn an additive bias. Default:Trueprior: The prior over weights. Default
NealPrior.var_fixed: Defaults to
None. If set to a float, it fixes the approximate posterior variance over weights to that value.var_init_mult: The approximate posterior variance is initialized to be equal to the prior variance, multiplied by
var_init_mult. Defaults to1E-3such that the variances are initialized to be small.mean_init_mult: The approximate posterior means are initialized by sampling from the prior, multiplied by
mean_init_mult. As there is no particular reason to make this small, it defaults to 1.log_var_lr: Multiplier for the learning rate for the approximate posterior variances.
- Shape:
Input:
(samples, mbatch, in_features)Output:
(samples, mbatch, out_features)
- Random Variables:
weight: the learnable weights of the module of shape
(in_features+bias, out_features), wherebias=Trueorbias=Falsewhich converts tobias=1orbias=1. Note that we implement the bias by adding a vector of ones to the input, so the dimension of the weights depends on the presence of a bias.
- Prior:
IID Gaussian, with variance \(1/\text{in_channels}\)
- Approximate Posterior:
MFVI
Examples
>>> import torch >>> import bayesfunc as bf >>> m = bf.FactorisedLinear(20, 30) >>> input = torch.randn(3, 128, 20) >>> output, _, _ = bf.propagate(m, input) >>> print(output.size()) torch.Size([3, 128, 30])
- class bayesfunc.FactorisedConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, **kwargs)¶
IID Gaussian prior and factorised Gaussian posterior over the weights of a 2D convolutional layer.
- arg:
in_channels: number of channels in input tensor
out_channels: number of channels in output tensor
kernel_size: size of convolutional kernel
- kwargs:
stride: Standard convolutional stride. Defaults to 1.
padding: Standard convolutional padding. Defaults to 0.
prior: The prior over weights. Default
NealPrior.var_fixed: Defaults to
None. If set to a float, it fixes the approximate posterior variance over weights to that value.var_init_mult: The approximate posterior variance is initialized to be equal to the prior variance, multiplied by
var_init_mult. Defaults to1E-3such that the variances are initialized to be small.mean_init_mult: The approximate posterior means are initialized by sampling from the prior, multiplied by
mean_init_mult. As there is no particular reason to make this small, it defaults to 1.log_var_lr: Multiplier for the learning rate for the approximate posterior variances.
- Shape:
Input:
(samples, mbatch, in_height, in_width, in_features)Output:
(samples, mbatch, in_height, in_width, out_features)
- Random Variables:
weight: the learnable weights of the module of shape
(out_channels, in_channels, in_features, out_features).
- Prior:
IID Gaussian, with variance \(1/(\text{fan-in}*\text{kernel_size}^2)\)
- Approximate Posterior:
MFVI
Examples:
Next, we look at “Local” inducing point methods. These haven’t really been used in neural networks, because the performance doesn’t justify the additional computational cost.
- class bayesfunc.LILinear(in_features, out_features, bias=True, **kwargs)¶
IID Gaussian prior and factorised Gaussian posterior over the weights of a fully-connected layer.
inducing_batchis set toin_features+biasto give the smallest number of inducing points that is complete.- arg:
in_features: size of each input sample
out_features: size of each output sample
- optional kwargs:
bias: If set to
False, the layer will not learn an additive bias. Default:Trueprior: Prior over neural network weights. Defaults:
NealPrior.neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default:
False.log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default:
-4.log_prec_lr: Multiplier for the learning rate of the precision parameters. Default:
1.inducing_targets: Initial value of the inducing targets. Only useful in a single-layer net. Default:
None.inducing_batch: Initial value of the inducing batch. Only useful in a single-layer net. Default:
None.
- Shape:
Input:
(samples, mbatch, in_features)Output:
(samples, mbatch, out_features)
Examples
>>> import torch >>> import bayesfunc as bf >>> m = bf.LILinear(20, 30) >>> input = torch.randn(3, 128, 20) >>> output, _, _ = bf.propagate(m, input) >>> print(output.size()) torch.Size([3, 128, 30])
- class bayesfunc.LIConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, **kwargs)¶
IID Gaussian prior and factorised Gaussian posterior over the weights of a 2D convolutional layer.
inducing_batchis set toin_features+biasto give the smallest number of inducing points that is complete.- arg:
in_channels: number of channels in input tensor
out_channels: number of channels in output tensor
kernel_size: size of convolutional kernel
- optional kwargs:
stride: Standard convolutional stride. Defaults to 1.
padding: Standard convolutional padding. Defaults to 0.
prior: Prior over neural network weights. Defaults:
NealPrior.inducing_targets: Initial value of the inducing targets (useful at the top-layer, but never necessary). Default:
None.neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default:
False.log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default:
-4.log_prec_lr: Multiplier for the learning rate of the precision parameters. Default:
1.
- Shape:
Input:
(samples, mbatch, in_height, in_width, in_features)Output:
(samples, mbatch, in_height, in_width, out_features)
Global inducing approximate posteriors for Bayesian neural networks¶
These methods were developed in https://arxiv.org/abs/2005.08140 and give state-of-the-art performance in tasks such as image classification. They require wrapping in InducingWrapper.
- class bayesfunc.GILinear(in_features, out_features, bias=True, **kwargs)¶
IID Gaussian prior and factorised Gaussian posterior over the weights of a fully-connected layer.
- arg:
in_features: size of each input sample
out_features: size of each output sample
- compulsory kwargs:
inducing_batch: This module assumes that the first
inducing_batchelements of the minibatch are inducing, and the rest are test/training inputs. Can be combined with InducingWrapper to simplify working with inducing inputs.
- optional kwargs:
bias: If set to
False, the layer will not learn an additive bias. Default:Trueprior: Prior over neural network weights. Defaults:
NealPrior.inducing_targets: Initial value of the inducing targets (useful at the top-layer, but never necessary). Default:
None.neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default:
False.log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default:
-4.log_prec_lr: Multiplier for the learning rate of the precision parameters. Default:
1.
- Shape:
Input:
(samples, mbatch, in_features)Output:
(samples, mbatch, out_features)
Examples
>>> import torch >>> import bayesfunc as bf >>> m = bf.GILinear(20, 30, inducing_batch=20) >>> input = torch.randn(3, 128, 20) >>> output, _, _ = bf.propagate(m, input) >>> print(output.size()) torch.Size([3, 128, 30])
- class bayesfunc.GIConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, **kwargs)¶
IID Gaussian prior and factorised Gaussian posterior over the weights of a 2D convolutional layer.
- arg:
in_channels: number of channels in input tensor
out_channels: number of channels in output tensor
kernel_size: size of convolutional kernel
- compulsory kwargs:
inducing_batch: This module assumes that the first
inducing_batchelements of the minibatch are inducing, and the rest are test/training inputs. Can be combined with InducingWrapper to simplify working with inducing inputs.
- optional kwargs:
stride: Standard convolutional stride. Defaults to 1.
padding: Standard convolutional padding. Defaults to 0.
prior: Prior over neural network weights. Defaults:
NealPrior.inducing_targets: Initial value of the inducing targets (useful at the top-layer, but never necessary). Default:
None.neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default:
False.log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default:
-4.log_prec_lr: Multiplier for the learning rate of the precision parameters. Default:
1.
- Shape:
Input:
(samples, mbatch, in_height, in_width, in_features)Output:
(samples, mbatch, in_height, in_width, out_features)
Warning
The inducing targets for this class are only initialised after a pass through the network (because it is only possible to infer the shape of the targets after it has seen an input). As such, you must pass data through the network before calling
opt(net.parameters(), lr=...). Not doing so will silently cause poor performance.
Library reference: deep Gaussian processes¶
For deep GPs, the fundamental class is the GIGP, which implements global inducing methods. Everything else (including local-inducing methods) are implemented in terms of GIGP
- class bayesfunc.GIGP(out_features, inducing_targets=None, log_prec_init=- 4.0, log_prec_lr=1.0, inducing_batch=None)¶
Global inducing point Gaussian process. Takes KG as input and returns features.
- arg:
out_features (int): Number of features to output.
- compulsory kwargs:
inducing_batch (int): Number of inducing points.
- optional kwargs:
inducing_targets: Initial setting of the inducing targets. Oly
log_prec_init: Initial value of the precision. Default to little evidence:
-4.log_prec_lr: Precision learning rate multiplier. Default:
1..
For testing
- bayesfunc.KernelGIGP(in_features, out_features, inducing_batch=None, **kwargs)¶
- bayesfunc.KernelLIGP(in_features, out_features, inducing_batch=None, kernel=None, **kwargs)¶
- class bayesfunc.SqExpKernel(in_features, inducing_batch=None)¶
Squared exponential kernel from features.
- arg:
in_features (int):
inducing_batch (int):
- class bayesfunc.SqExpKernelGram(log_lengthscale=0.0)¶
Squared exponential kernel from Gram matrix.
- optional kwargs:
log_lengthscale (float): initial value for the lengthscale. Default:
0..
- class bayesfunc.ReluKernelGram¶
Relu kernel from Gram matrix.
- optional kwargs:
log_lengthscale (float): initial value for the lengthscale. Default:
0..
Library reference: deep kernel processes¶
- class bayesfunc.IWLayer(inducing_batch)¶
Inverse Wishart layer from a deep kernel process. Takes a KG as input, and returns KG as output.
- arg:
inducing_batch (int): number of inducing inputs
- class bayesfunc.SingularIWLayer(in_features, inducing_batch)¶
Singular Inverse Wishart layer which takes the input features in a deep kernel process. Takes a features as input, and returns KG as output.
- arg:
in_features (int): number of features
inducing_batch (int): number of inducing points.