Switching to a Bayesian workflow can be frustrating initially, when the sampling speed becomes limiting and you have no idea why. Like it happened to me when I tried to fit a Dirichlet process mixture model using PyMC3. Any attempt at working around the difficulties I ran into got blocked by the speed of NUTS sampler.

It turns out that variational inference (VI) can help in these stages. However, it seems to have its own peculiarities and even some pitfalls. Not having learned about VI during my master’s, I set out to read about it.

VI is a method for approximating probability densities. In the context of Bayesian modeling, we can use it to approximate the posterior. As such, we can consider it an alternative to markov chain monte carlo (MCMC) algorithms. There is one important advantage of VI: it’s very fast compared to MCMC. And there is an important disadvantage: you don’t get to be sure that the approximation will give you what you’re looking for, unlike with MCMC.

Let’s start with remembering what we’re looking for in Bayesian inference problem: the posterior, or the density of model parameters conditional on the data. Mathematically speaking:

\[p(z | x)\]

Commonly used modern methods for getting to the posterior are based on sampling, a.k.a markov chain monte carlo algorithms. VI, on the other hand, approaches the problem from an optimisation point of view, by making use of a clever trick. Before we get to the trick, let’s specify the objective of the optimisation: Kullbeck-Leibler (KL) divergence .

The Kullbeck-Leibler divergence is a measure if the “difference in information contained within two distributions” 1. In the VI context, specifically, the KL divergence from the posterior to the approximating density is relevant: \(KL( q(\mathbf{z}) \| p(\mathbf{z} \vert \mathbf{x} ) )\)

Buried in this divergence is a term that’s very difficult to evaluate: \(p(\boldsymbol{x})\). The trick is to optimise for another objective which is (1) easier to evaluate and (2) is equivalent to minimising the KL divergence: Evidence lower bound (ELBO).

\[ELBO(q) = \mathbb{E}[logp(\mathbf{z}, \mathbf{x})] - \mathbb{E}[logq(\mathbf{z})]\]

Blei et al. (2018) note that “the variational objective mirrors the usual balance between likelihood and prior” 2 and this is visible once the identity is rearranged:

\[ELBO(q) = \mathbb{E}[logp(\mathbf{x}|\mathbf{z})] + KL(q(\mathbf{z})||p(\mathbf{z}))\]

Great. But how do we specify \(q\)? A commonly used family of probability distributions is “mean-field variational family”. The mean-field variational family is essentially a joint distribution of independent random variables. So we have:

\[q(\boldsymbol{z}) = \prod_{j=1}^{m}q_j(z_j).\]

This makes one of the key properties of VI with mean-field family clear: we’re assuming that model parameters are mutually independent. If they are correlated, the approximation will not be able to capture that information.

Now, we need an optimisation algorithm and we’re ready to go. I’ll refer you to Blei et al. (2018) 2 for the details, where you can read about coordinate ascent mean-field variational inference.

Now let’s turn to what’s happening when we use the variational API of PyMC3. Here you can see that the default of the method= parameter is ADVI (automatic differentiation variational inference). ADVI is an algorithm that can express a given probabilistic model as an optimisation problem. The significant advantage of it is that we don’t have to think about technical requirements that we need to satisfy to make the approximation work. Any model we can write in PyMC3 (also Stan) can be run through ADVI.

Equipped with the magic provided by PyMC3 and with a basic understanding of variational inference. It’s time to see it in practice!

  1. https://ermongroup.github.io/cs228-notes/inference/variational/ 

  2. https://arxiv.org/abs/1601.00670  2