*Sources:*

**Update:** PyMC4 based on TensorFlow Probability will not be developed further. PyMC3 on Theano with the new JAX backend is the future.

This article demonstrates how to implement a simple Bayesian neural network for regression with an early PyMC4 development snapshot (from Jul 29, 2020). It can be installed with

```
pip install git+https://github.com/pymc-devs/pymc4@1c5e23825271fc2ff0c701b9224573212f56a534
```

I’ll update this article from time to time to cover new features or to fix breaking API changes. The latest update (Aug. 19, 2020) includes the recently added support for variational inference (VI). The following sections assume that you have a basic familiarity with PyMC3. If this is not the case I recommend reading Getting started with PyMC3 first.

```
import logging
import pymc4 as pm
import numpy as np
import arviz as az
import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
%matplotlib inline
print(pm.__version__)
print(tf.__version__)
print(tfp.__version__)
# Mute Tensorflow warnings ...
logging.getLogger('tensorflow').setLevel(logging.ERROR)
```

```
4.0a2
2.4.0-dev20200818
0.12.0-dev20200818
```

## Introduction to PyMC4

PyMC4 uses Tensorflow Probability (TFP) as backend and PyMC4 random variables are wrappers around TFP distributions. Models must be defined as generator functions, using a `yield`

keyword for each random variable. PyMC4 uses coroutines to interact with the generator to get access to these variables. Depending on the context, PyMC4 may sample values from random variables, compute log probabilities of observed values, … and so on. Details are covered in the PyMC4 design guide. Model generator functions must be decorated with ` @pm.model` as shown in the following trivial example:

```
@pm.model
def model(x):
# prior for the mean of a normal distribution
loc = yield pm.Normal('loc', loc=0, scale=10)
# likelihood of observed data
obs = yield pm.Normal('obs', loc=loc, scale=1, observed=x)
```

This models normally distributed data centered at a location `loc`

to be inferred. Inference can be started with `pm.sample()`

which uses the No-U-Turn Sampler (NUTS). Samplers other than NUTS are currently not implemented in PyMC4.

```
# 30 data points normally distributed around 3
x = np.random.randn(30) + 3
# Inference
trace = pm.sample(model(x))
trace
```

```
arviz.InferenceData
- posterior
- sample_stats
- observed_data
```

The returned `trace`

object is an ArviZ `InferenceData`

object. It contains posterior samples, observed data and sampler statistics. The posterior distribution over `loc`

can be displayed with:

```
az.plot_posterior(trace, var_names=['model/loc']);
```

A recent addition to PyMC4 is variational inference and supported methods currently are `advi`

and `fullrank_advi`

. After fitting the model, posterior samples can be obtained from the resulting `approximation`

object (representing a mean-field approximation in this case).

```
fit = pm.fit(model(x), num_steps=10000, method='advi')
trace = fit.approximation.sample(1000)
```

```
az.plot_posterior(trace, var_names=['model/loc']);
```

The history of the variational lower bound (= negative loss) during training can be displayed with

```
plt.plot(-fit.losses)
plt.ylabel('Variational lower bound')
plt.xlabel('Step');
```

which confirms a good convergence after about 10,000 steps. Models can also be composed through nesting and used like other PyMC4 random variables.

```
@pm.model
def prior(name, loc=0, scale=10):
loc = yield pm.Normal(name, loc=loc, scale=scale)
return loc
@pm.model
def model(x):
loc = yield prior('loc')
obs = yield pm.Normal('obs', loc=loc, scale=1, observed=x)
trace = pm.sample(model(x))
az.plot_posterior(trace, var_names=['model/prior/loc']);
```

A more elaborate example is shown below where a neural network is composed of several layers.

## Example dataset

The dataset used in the following example contains `N`

noisy samples from a sinusoidal function `f`

in two distinct regions (`x1`

and `x2`

).

```
def f(x, noise):
"""Generates noisy samples from a sinusoidal function at x."""
return np.sin(2 * np.pi * x) + np.random.randn(*x.shape) * noise
N = 40
noise = 0.1
x1 = np.linspace(-0.6, -0.15, N // 2, dtype=np.float32)
x2 = np.linspace(0.15, 0.6, N // 2, dtype=np.float32)
x = np.concatenate([x1, x2]).reshape(-1, 1)
y = f(x, noise=noise)
x_test = np.linspace(-1.5, 1.5, 200, dtype=np.float32).reshape(-1, 1)
f_test = f(x_test, noise=0.0)
plt.scatter(x, y, marker='o', c='k', label='Samples')
plt.plot(x_test, f_test, 'k--', label='f')
plt.legend();
```

## Bayesian neural network

### Model definition

To model the non-linear relationship between `x`

and `y`

in the dataset we use a ReLU neural network with two hidden layers, 5 neurons each. The weights of the neural network are random variables instead of deterministic variables. This is what makes a neural network a Bayesian neural network. Here, we assume that the weights are independent random variables.

The neural network defines a prior over the mean `loc`

of the data likelihood `obs`

which is represented by a normal distribution. For simplicity, the aleatoric uncertainty (`noise`

) in the data is assumed to be known. Thanks to PyMC4’s model composition support, priors can be defined layer-wise using the `layer`

generator function and composed to a neural network as shown in function `model`

. During inference, a posterior distribution over the neural network weights is obtained.

```
@pm.model
def layer(name, x, n_in, n_out, prior_scale, activation=tf.identity):
w = yield pm.Normal(name=f'{name}_w', loc=0, scale=prior_scale, batch_stack=(n_in, n_out))
b = yield pm.Normal(name=f'{name}_b', loc=0, scale=prior_scale, batch_stack=(1, n_out))
return activation(tf.tensordot(x, w, axes=[1, 0]) + b)
@pm.model
def model(x, y, prior_scale=1.0):
o1 = yield layer('l1', x, n_in=1, n_out=5, prior_scale=prior_scale, activation=tf.nn.relu)
o2 = yield layer('l2', o1, n_in=5, n_out=5, prior_scale=prior_scale, activation=tf.nn.relu)
o3 = yield layer('l3', o2, n_in=5, n_out=1, prior_scale=prior_scale)
yield pm.Normal(name='obs', loc=o3, scale=noise, observed=y)
```

The `batch_stack`

parameter of random variable constructors is used to define the shape of the random variable.

### Inference

Tensorflow will automatically run inference on a GPU if available. With the current version of PyMC4, MCMC inference using NUTS on a GPU is quite slow compared to a multi-core CPU (need to investigate that in more detail). To enforce inference on a CPU set environment variable `CUDA_VISIBLE_DEVICES`

to an empty value. There is no progress bar visible yet during sampling but the following shouldn’t take longer than a few minutes on a modern multi-core CPU.

```
# MCMC inference with NUTS
trace = pm.sample(model(x, y, prior_scale=3), burn_in=100, num_samples=1000)
```

Variational inference is significantly faster but the results are less convincing than the MCMC results. I need to investigate that further to see if I’m doing something wrong or if this is an issue with the current PyMC4 development snapshot. We’ll therefore use the MCMC results in the following section. If you want to see the VI results, run the following cell instead of the previous one.

```
# Variational inference with full rank ADVI
fit = pm.fit(model(x, y, prior_scale=0.5), num_steps=150000, method='fullrank_advi')
# Draw samples from the resulting mean-field approximation
trace = fit.approximation.sample(1000)
```

The full `trace`

can be visualized with `az.plot_trace(trace)`

. Here, we only display the posterior over the last layer weights (without bias).

```
az.plot_posterior(trace, var_names="model/layer/l3_w");
```

### Prediction

To obtain posterior predictive samples for a test set `x_test`

we simply call the `model`

generator function again with the test set as argument. This is a nice improvement over PyMC3 which required to setup a shared Theano variable for setting test set values. Target values are ignored during predictive sampling, only the shape of the target array `y`

matters, hence we set it to an array of zeros with the same shape as `x_test`

.

```
draws_posterior = pm.sample_posterior_predictive(model(x=x_test, y=np.zeros_like(x_test)), trace, inplace=False)
draws_posterior.posterior_predictive
```

<xarray.Dataset> Dimensions: (chain: 10, draw: 1000, model/obs_dim_0: 200, model/obs_dim_1: 1) Coordinates: * chain (chain) int64 0 1 2 3 4 5 6 7 8 9 * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 * model/obs_dim_0 (model/obs_dim_0) int64 0 1 2 3 4 5 ... 195 196 197 198 199 * model/obs_dim_1 (model/obs_dim_1) int64 0 Data variables: model/obs (chain, draw, model/obs_dim_0, model/obs_dim_1) float32 ... Attributes: created_at: 2020-08-19T12:12:02.008383 arviz_version: 0.9.0

- chain: 10
- draw: 1000
- model/obs_dim_0: 200
- model/obs_dim_1: 1

- chain(chain)int640 1 2 3 4 5 6 7 8 9
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

- draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999
array([ 0, 1, 2, ..., 997, 998, 999])

- model/obs_dim_0(model/obs_dim_0)int640 1 2 3 4 5 ... 195 196 197 198 199
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199])

- model/obs_dim_1(model/obs_dim_1)int640
array([0])

- model/obs(chain, draw, model/obs_dim_0, model/obs_dim_1)float3226.57001 26.04135 ... -14.317561
array([[[[ 2.65700092e+01], [ 2.60413494e+01], [ 2.57281590e+01], ..., [-4.75241995e+00], [-4.78487253e+00], [-4.75552654e+00]], [[ 3.16064491e+01], [ 3.10987301e+01], [ 3.04416885e+01], ..., [-5.26556778e+00], [-5.20179415e+00], [-5.47270155e+00]], [[ 4.52943230e+01], [ 4.45980377e+01], [ 4.39933929e+01], ..., ... ..., [-5.59512615e+00], [-5.61213923e+00], [-5.73862267e+00]], [[ 2.81977844e+00], [ 2.81224990e+00], [ 3.02726460e+00], ..., [-5.47826385e+00], [-5.42433500e+00], [-5.45992947e+00]], [[ 3.89422917e+00], [ 3.80268312e+00], [ 3.86203289e+00], ..., [-1.39646444e+01], [-1.41936607e+01], [-1.43175611e+01]]]], dtype=float32)

- created_at :
- 2020-08-19T12:12:02.008383
- arviz_version :
- 0.9.0

The predictive mean and standard deviation can be obtained by averaging over chains (axis `0`

) and predictive samples (axis `1`

) for each of the 200 data points in `x_test`

(axis `2`

).

```
predictive_samples = draws_posterior.posterior_predictive.data_vars['model/obs'].values
m = np.mean(predictive_samples, axis=(0, 1)).flatten()
s = np.std(predictive_samples, axis=(0, 1)).flatten()
```

These statistics can be used to plot model predictions and their variances (together with function `f`

and the noisy training data). One can clearly see a higher predictive variance (= higher uncertainty) in regions outside the training data.

```
plt.plot(x_test, m, label='Expected value');
plt.fill_between(x_test.flatten(), m + 2 * s, m - 2 * s, alpha = 0.3, label='Uncertainty')
plt.scatter(x, y, marker='o', c='k')
plt.plot(x_test, f_test, 'k--')
plt.ylim(-1.5, 2.5)
plt.legend();
```

If you think something can be improved in this article (and I’m sure it can) or if I missed other important aspects of PyMC4 please let me know.