Welcome to bayesfunc’s documentation!¶

bayesfunc is a PyTorch library providing a number of state-of-the-art priors and variational approximate posteriors over functions.

In particular, we implement:

Global inducing variational inference for neural networks (https://arxiv.org/abs/2005.08140)

Global inducing variational inference for deep Gaussian processes (https://arxiv.org/abs/2005.08140)

Deep kernel processes (https://arxiv.org/abs/2010.01590)

In addition, we implement a number of more standard methods, primarily to give fair, easy-to-implement comparisions:

Mean field variational inference

Sparse (deep) Gaussian processes inference

Library Conventions¶

bayesfunc introduces a number of PyTorch modules mirroring the standard pytorch nn API. As such, modules can be combined and networks created using e.g. nn.Sequential. See Examples for further details. However these modules have a couple of differences.

Sample and minibatch¶

In standard PyTorch, inputs to modules are tensors, and the zeroth index usually represents a minibatch. For instance, if we had a minibatch of size 128,

>>> import torch
>>> import torch.nn as nn
>>> m = nn.Linear(20, 30)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 30])

However, here we are sampling from an approximate posterior over functions, and often we will want to draw mutiple different samples from that approximate posterior (e.g. to reduce the variance of our estimate of the ELBO). As such, we introduce a new convention: the zeroth index gives the sample of the approximate posterior, and the first index gives the minibatch. Consider possibly the simplest module defined in our module, FactorisedLinear, which does MFVI over the weights of a fully-connected layer. For instance, if we wanted to do the equivalent of the above, with just one sample from the posterior over functions that we apply to every input in the minibatch,

>>> import bayesfunc as bf
>>> m = bf.FactorisedLinear(20, 30)
>>> input = torch.randn(1, 128, 20)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([1, 128, 30])

Note that both the input and output are rank 3 tensors, whereas in the original pure PyTorch example, they were only rank 2 tensors.

If we wanted to draw 10 samples of the function and apply them to our inputs, we would need to give the module a tensor with shape (10, minibatch, in_features). To efficiently replicate the inputs 10 times, we use expand

>>> m = bf.FactorisedLinear(20, 30)
>>> input = torch.randn(1, 128, 20).expand(10, -1, -1)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([10, 128, 30])

Note despite the inputs being the same for different samples, the outputs are different, because we have applied different functions.

>>> (input[0, 0, 0] == input[1, 0, 0]).item()
True
>>> (output[0, 0, 0] == output[1, 0, 0]).item()
False

Propagate function¶

To run a network, you must use the propagate function. This function prepares the network (by optionally loading up a previously sampled set of weights), runs the network, returning the output, \(\log P(f) - \log Q(f)\), and the sampled weights used,

bayesfunc.propagate(f, *args, sample_dict=None, detach=True)¶

The ONLY way to run the neural networks defined in bayesfunc. Replaces f(input), which will now fail silently!

Parameters

f – the bayesfunc function
input – input to the function

Keyword Arguments

sample_dict – optional dictionary of sampled weights, to allow using the same weights for multiple different inputs
detach – if true, we detach parameters in sample_dict, stopping propagation of gradients

outputs:

output: neural network output (as in output = f(input)).
logpq: \(\log P(f) - \log Q(f)\) the difference of prior and approximate posterior log-probabilities.
output_sample_dict: a dictionary containing all the sampled weights used in the network. If sample_dict is set, we have output_sample_dict == sample_dict.

Warning

Only properly implemented for GILinear, GIConv2d, FactorisedLinear and FactorisedConv2D. Everything else will run, but will independently sample a new function on every invocation, ignoring the sample_dict input argument.

In standard use, sample_dict is never set and output_sample_dict is never needed. These only become useful e.g. in continual learning.

Computing the ELBO¶

In variational inference, we use the ELBO as the loss,

\(\mathcal{L}(\phi) = E_{Q_\phi}[\sum_{i=1}^N \log P(y_i|x_i, f) + \log P(f) - \log Q_\phi(f)]\)

where \(x_i\) is a single input (e.g. image), \(y_i\) is a single output (e.g. label) and there are \(N\) datapoints in total. The bayesfunc library defines priors, \(P(f)\), and approximate posteriors, \(Q_\phi(f)\), over functions, where \(\phi\) are the trainable parameters of the approximate posterior.

And in practical cases, we use a minibatched estimate, of the averaged loss (averaging across datapoints),

\(\frac{1}{N} \hat{\mathcal{L}}(\phi) = \frac{1}{B} \sum_{i\in \mathcal{B}_j}\log P(\text{data}_i| f) + \frac{1}{N} (\log P(f) - \log Q_\phi(f))\)

Here, \(\mathbf{B}_i\) is the \(i\) is the size of a minibatch, and \(f\) has been sampled from the approximate posterior, \(Q_\phi(f)\). Critically, the first term, is just the standard neural-network loss (e.g. the cross-entropy),

\(\frac{1}{B} \sum_{i\in \mathcal{B}_j}\log P(\text{data}_i| f)\) = - average cross entropy

And the second term is given by the bayesfunc library, as the second argument returned by bf.propagate

\(\log P(f) - \log Q_\phi(f)\) = bf.propagate(net, inputt)[1]

As such, the full training loop might look like:

for x, y in dataloader:
    # include a sample dimension
    x = x.expand(1, -1, -1)
    # compute the output
    output, logpq, _ = bf.propagate(net, x)
    # compute the log-likelihood/loss
    log_like = F.cross_entropy(x, y, reduction="mean")
    # the objective, where N is the number of datapoints
    obj = log_like + logpq/N
    optimizer.zero_grad()
    (-obj).backward()
    optimizer.step()

Wrapper for global inducing methods¶

Many of our function approximators require “global” inducing points, i.e. optimized psuedo inputs that look like standard data-items. These modules (i.e. GILinear, GIConv2d, GIGP, DKP) require an InducingWrapper,

bayesfunc.InducingWrapper(net, inducing_batch, *args, **kwargs)¶

Combines incoming test/train data with learned inducing inputs, then strips away the inducing outputs, just leaving the function approximated at inducing locations.

Parameters

net (nn.Module) – The underlying function approximator, represented as PyTorch modules, to be wrapped.
inducing_batch (int) – The underlying function approximator, represented as PyTorch modules, to be wrapped.

Keyword Arguments

inducing_shape (Optional[torch.Size]) – The size of the inducing inputs, including inducing_batch as the first dimension. Default: None.
inducing_data (Optional[torch.Tensor]) – The values of the inducing inputs. Useful to e.g. initialize the inducing points on top of datapoints. Default: None.
fixed (Bool) – Do we fix the inducing point locations? Default: False.

Must specify one and only one of inducing_shape or inducing_data

Example

>>> import bayesfunc as bf
>>> import torch as t
>>> import torch.nn as nn
>>>
>>> in_features = 20
>>> hidden_features = 50
>>> out_features = 30
>>>
>>> m1 = bf.GILinear(in_features, hidden_features, inducing_batch=100)
>>> m2 = bf.GILinear(hidden_features, out_features, inducing_batch=100)
>>> net = nn.Sequential(m1, m2)
>>>
>>> net = bf.InducingWrapper(net, 100, inducing_shape=(100, in_features))
>>> output, _, _ = bf.propagate(net, t.randn(3, 128, in_features))
>>> output.shape
torch.Size([3, 128, 30])

Structured kernels for kernel-based methods¶

To implement kernel-based methods efficiently, we can’t propagate the full \((P_\text{i}+P_\text{t})\times(P_\text{i}+P_\text{t})\) covariance matrix, where \(P_\text{i}\) is the number of inducing points, and \(P_\text{t}\) is the number of test/training points, as \(P_\text{t}\) could be very large. Instead, we propagate a special type:

class bayesfunc.KG(ii, it, tt)¶

Simple container class for different components of a covariance matrix. You shouldn’t need to use this unless you are developing your own kernels.

arg:

ii: \(P_\text{i}\times P_\text{i}\) covariance matrix for inducing points. shape=(samples, inducing_batch, inducing_batch)
it: \(P_\text{i}\times P_\text{t}\) covariance matrix for inducing points. shape=(samples, inducing_batch, mbatch)
tt: \(P_\text{t}\) diagonal variances test/train points. shape=(samples, 1, mbatch)?

Library reference: Bayesian neural networks¶

Simple approximate posteriors for Bayesian neural networks¶

These methods compute an approximate posterior over weights and are relatively simple: they don’t have global inducing points, and therefore don’t need wrapping in InducingWrapper (Wrapper for global inducing methods). That said, you can wrap them if you want, which is usually useful if you want to combine some of these simpler methods with a Global inducing method.

First, we look at factorised methods. They are easy to apply, but often don’t work that well. It can be important to initialise the approximate posterior with very low variance to get them to converge.

class bayesfunc.FactorisedLinear(in_features, out_features, bias=True, **kwargs)¶

IID Gaussian prior and factorised Gaussian posterior over the weights of a fully-connected layer.

arg:

in_features: size of each input sample
out_features: size of each output sample

kwargs:

bias: If set to False, the layer will not learn an additive bias. Default: True
prior: The prior over weights. Default NealPrior.
var_fixed: Defaults to None. If set to a float, it fixes the approximate posterior variance over weights to that value.
var_init_mult: The approximate posterior variance is initialized to be equal to the prior variance, multiplied by var_init_mult. Defaults to 1E-3 such that the variances are initialized to be small.
mean_init_mult: The approximate posterior means are initialized by sampling from the prior, multiplied by mean_init_mult. As there is no particular reason to make this small, it defaults to 1.
log_var_lr: Multiplier for the learning rate for the approximate posterior variances.

Shape:

Input: (samples, mbatch, in_features)
Output: (samples, mbatch, out_features)

Random Variables:

weight: the learnable weights of the module of shape (in_features+bias, out_features), where bias=True or bias=False which converts to bias=1 or bias=1. Note that we implement the bias by adding a vector of ones to the input, so the dimension of the weights depends on the presence of a bias.

Prior:

IID Gaussian, with variance \(1/\text{in_channels}\)

Approximate Posterior:

MFVI

Examples

>>> import torch
>>> import bayesfunc as bf
>>> m = bf.FactorisedLinear(20, 30)
>>> input = torch.randn(3, 128, 20)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([3, 128, 30])

class bayesfunc.FactorisedConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, **kwargs)¶

IID Gaussian prior and factorised Gaussian posterior over the weights of a 2D convolutional layer.

arg:

in_channels: number of channels in input tensor
out_channels: number of channels in output tensor
kernel_size: size of convolutional kernel

kwargs:

stride: Standard convolutional stride. Defaults to 1.
padding: Standard convolutional padding. Defaults to 0.
prior: The prior over weights. Default NealPrior.
var_fixed: Defaults to None. If set to a float, it fixes the approximate posterior variance over weights to that value.
var_init_mult: The approximate posterior variance is initialized to be equal to the prior variance, multiplied by var_init_mult. Defaults to 1E-3 such that the variances are initialized to be small.
mean_init_mult: The approximate posterior means are initialized by sampling from the prior, multiplied by mean_init_mult. As there is no particular reason to make this small, it defaults to 1.
log_var_lr: Multiplier for the learning rate for the approximate posterior variances.

Shape:

Input: (samples, mbatch, in_height, in_width, in_features)
Output: (samples, mbatch, in_height, in_width, out_features)

Random Variables:

weight: the learnable weights of the module of shape (out_channels, in_channels, in_features, out_features).

Prior:

IID Gaussian, with variance \(1/(\text{fan-in}*\text{kernel_size}^2)\)

Approximate Posterior:

MFVI

Examples:

Next, we look at “Local” inducing point methods. These haven’t really been used in neural networks, because the performance doesn’t justify the additional computational cost.

class bayesfunc.LILinear(in_features, out_features, bias=True, **kwargs)¶

IID Gaussian prior and factorised Gaussian posterior over the weights of a fully-connected layer. inducing_batch is set to in_features+bias to give the smallest number of inducing points that is complete.

arg:

in_features: size of each input sample
out_features: size of each output sample

optional kwargs:

bias: If set to False, the layer will not learn an additive bias. Default: True
prior: Prior over neural network weights. Defaults: NealPrior.
neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default: False.
log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default: -4.
log_prec_lr: Multiplier for the learning rate of the precision parameters. Default: 1.
inducing_targets: Initial value of the inducing targets. Only useful in a single-layer net. Default: None.
inducing_batch: Initial value of the inducing batch. Only useful in a single-layer net. Default: None.

Shape:

Input: (samples, mbatch, in_features)
Output: (samples, mbatch, out_features)

Examples

>>> import torch
>>> import bayesfunc as bf
>>> m = bf.LILinear(20, 30)
>>> input = torch.randn(3, 128, 20)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([3, 128, 30])

class bayesfunc.LIConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, **kwargs)¶

IID Gaussian prior and factorised Gaussian posterior over the weights of a 2D convolutional layer. inducing_batch is set to in_features+bias to give the smallest number of inducing points that is complete.

arg:

in_channels: number of channels in input tensor
out_channels: number of channels in output tensor
kernel_size: size of convolutional kernel

optional kwargs:

stride: Standard convolutional stride. Defaults to 1.
padding: Standard convolutional padding. Defaults to 0.
prior: Prior over neural network weights. Defaults: NealPrior.
inducing_targets: Initial value of the inducing targets (useful at the top-layer, but never necessary). Default: None.
neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default: False.
log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default: -4.
log_prec_lr: Multiplier for the learning rate of the precision parameters. Default: 1.

Shape:

Input: (samples, mbatch, in_height, in_width, in_features)
Output: (samples, mbatch, in_height, in_width, out_features)

Global inducing approximate posteriors for Bayesian neural networks¶

These methods were developed in https://arxiv.org/abs/2005.08140 and give state-of-the-art performance in tasks such as image classification. They require wrapping in InducingWrapper.

class bayesfunc.GILinear(in_features, out_features, bias=True, **kwargs)¶

IID Gaussian prior and factorised Gaussian posterior over the weights of a fully-connected layer.

arg:

in_features: size of each input sample
out_features: size of each output sample

compulsory kwargs:

inducing_batch: This module assumes that the first inducing_batch elements of the minibatch are inducing, and the rest are test/training inputs. Can be combined with InducingWrapper to simplify working with inducing inputs.

optional kwargs:

bias: If set to False, the layer will not learn an additive bias. Default: True
prior: Prior over neural network weights. Defaults: NealPrior.
inducing_targets: Initial value of the inducing targets (useful at the top-layer, but never necessary). Default: None.
neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default: False.
log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default: -4.
log_prec_lr: Multiplier for the learning rate of the precision parameters. Default: 1.

Shape:

Input: (samples, mbatch, in_features)
Output: (samples, mbatch, out_features)

Examples

>>> import torch
>>> import bayesfunc as bf
>>> m = bf.GILinear(20, 30, inducing_batch=20)
>>> input = torch.randn(3, 128, 20)
>>> output, _, _ = bf.propagate(m, input)
>>> print(output.size())
torch.Size([3, 128, 30])

class bayesfunc.GIConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, **kwargs)¶

IID Gaussian prior and factorised Gaussian posterior over the weights of a 2D convolutional layer.

arg:

in_channels: number of channels in input tensor
out_channels: number of channels in output tensor
kernel_size: size of convolutional kernel

compulsory kwargs:

inducing_batch: This module assumes that the first inducing_batch elements of the minibatch are inducing, and the rest are test/training inputs. Can be combined with InducingWrapper to simplify working with inducing inputs.

optional kwargs:

stride: Standard convolutional stride. Defaults to 1.
padding: Standard convolutional padding. Defaults to 0.
prior: Prior over neural network weights. Defaults: NealPrior.
inducing_targets: Initial value of the inducing targets (useful at the top-layer, but never necessary). Default: None.
neuron_prec: Use a different precision parameter for each hidden neuron? Considerably increases computational cost for relatively small performance benefit. Default: False.
log_prec_init: Initial value of the precision parameters. The default assumes that little data is available. Default: -4.
log_prec_lr: Multiplier for the learning rate of the precision parameters. Default: 1.

Shape:

Input: (samples, mbatch, in_height, in_width, in_features)
Output: (samples, mbatch, in_height, in_width, out_features)

Warning

The inducing targets for this class are only initialised after a pass through the network (because it is only possible to infer the shape of the targets after it has seen an input). As such, you must pass data through the network before calling opt(net.parameters(), lr=...). Not doing so will silently cause poor performance.

Library reference: deep Gaussian processes¶

For deep GPs, the fundamental class is the GIGP, which implements global inducing methods. Everything else (including local-inducing methods) are implemented in terms of GIGP

class bayesfunc.GIGP(out_features, inducing_targets=None, log_prec_init=- 4.0, log_prec_lr=1.0, inducing_batch=None)¶

Global inducing point Gaussian process. Takes KG as input and returns features.

arg:

out_features (int): Number of features to output.

compulsory kwargs:

inducing_batch (int): Number of inducing points.

optional kwargs:

inducing_targets: Initial setting of the inducing targets. Oly
log_prec_init: Initial value of the precision. Default to little evidence: -4.
log_prec_lr: Precision learning rate multiplier. Default: 1..

For testing

bayesfunc.KernelGIGP(in_features, out_features, inducing_batch=None, **kwargs)¶

bayesfunc.KernelLIGP(in_features, out_features, inducing_batch=None, kernel=None, **kwargs)¶

class bayesfunc.SqExpKernel(in_features, inducing_batch=None)¶

Squared exponential kernel from features.

arg:

in_features (int):
inducing_batch (int):

class bayesfunc.SqExpKernelGram(log_lengthscale=0.0)¶

Squared exponential kernel from Gram matrix.

optional kwargs:

log_lengthscale (float): initial value for the lengthscale. Default: 0..

class bayesfunc.ReluKernelGram¶

Relu kernel from Gram matrix.

optional kwargs:

log_lengthscale (float): initial value for the lengthscale. Default: 0..

Library reference: deep kernel processes¶

class bayesfunc.IWLayer(inducing_batch)¶

Inverse Wishart layer from a deep kernel process. Takes a KG as input, and returns KG as output.

arg:

inducing_batch (int): number of inducing inputs

class bayesfunc.SingularIWLayer(in_features, inducing_batch)¶

Singular Inverse Wishart layer which takes the input features in a deep kernel process. Takes a features as input, and returns KG as output.

arg:

in_features (int): number of features
inducing_batch (int): number of inducing points.