Gradient estimation for stochastic computation graphs

Deep generative models (DGMs) are powerful tools for learning a joint distribution over observed and unobserved data. Example of DGMs include variational auto-encoders (VAEs) and generative adversarial networks (GANs). DGMs are typically estimated as to maximise a lowerbound on log-likelihood of observations. The key technique here is variational inference (VI), in particular, amortised VI (which can be thought of as VI powered by NNs).

A lot of the success of NNs in supervised learning is due to the flexibility of maximum likelihood estimation powered by stochastic optimisation. For as long as we can express the probability of observations through a tractable and differentiable computation graph, we can count on the back-propagation algorithm (or other automatic differentiation toolboxes) to obtain gradient estimates on mini-batches of data.

The main research challenge in DMGs concerns leveraging the full power of automatic differentiation and stochastic optimisation to latent-variable models, where intractable marginals need to be approximate by Monte Carlo estimation.

This list contains papers that in my view one should read to navigate more easily through literature on deep generative models and their applications.

The depths

Some of the developments are very theoretical and you need to dig deep, you may skip through this section and get back to it on demand and as you grow more comfortable with the landscape.

Variational inference

Our first mandatory checkpoint is VI.

I would start with a historical read, it will help you understand what the ELBO (VI’s objective) accomplishes.

Then VI from the point of view of what it was initially proposed for: approximate posterior inference in Bayesian modelling.

If you just care about deep generative models, and you don’t really plan to look into Variational Bayes and modelling with conditionally conjugate models, you can skip the next block in your first pass through the list:

Here we get to variational inference in deep learning, but at this point we are going to be using the score function estimator rather than the famous reparameterised gradient. I think this makes for a better order.

If you want to go the extra mile, read the REINFORCE estimator paper:

Now we get to the territory of reparameterised gradients.

And here is a good view on implementation using automatic differentiation toolkits:

If you care about semi-supervised learning you will like to read about this one:

The reparameterised gradient was initially developed for a Gaussian approximate posterior, but we can go beyond that in at least three ways. We can design more expressive approximations by extending the hierarchy of the inference model:

We can use known distributions that are not (directly) reparameterisable:

Or we can focus on being able to sample and to assess the density at a point, but not really knowing the density function in closed-form, by using a normalising flow.

If you are curious about undestanding more about the challenges behind optimising the evidence lowerbounder (ELBO), you will like the following:

If you are really serious about VI, you start questioning the ELBO, and you wonder why should one use KL divergence. Then start here:

Baselines and Control variates

Discrete variables and relaxations

More on Normalising flows

Implicit models

Some NLP Applications