# Gradient estimation for stochastic computation graphs

Deep generative models (DGMs) are powerful tools for learning a joint distribution over observed and unobserved data. Example of DGMs include variational auto-encoders (VAEs) and generative adversarial networks (GANs). DGMs are typically estimated as to maximise a lowerbound on log-likelihood of observations. The key technique here is variational inference (VI), in particular, amortised VI (which can be thought of as VI powered by NNs).

A lot of the success of NNs in supervised learning is due to the flexibility of maximum likelihood estimation powered by stochastic optimisation. For as long as we can express the probability of observations through a tractable and differentiable computation graph, we can count on the back-propagation algorithm (or other automatic differentiation toolboxes) to obtain gradient estimates on mini-batches of data.

The main research challenge in DMGs concerns leveraging the full power of automatic differentiation and stochastic optimisation to latent-variable models, where intractable marginals need to be approximate by Monte Carlo estimation.

This list contains papers that in my view one should read to navigate more easily through literature on deep generative models and their applications.

# The depths

Some of the developments are very theoretical and you need to dig deep, you may skip through this section and get back to it on demand and as you grow more comfortable with the landscape.

- A Stochastic Approximation Method
- Control variates
- Exponential families

# Variational inference

Our first mandatory checkpoint is VI.

I would start with a historical read, it will help you understand what the ELBO (VI’s objective) accomplishes.

Then VI from the point of view of what it was initially proposed for: approximate posterior inference in Bayesian modelling.

If you just care about deep generative models, and you don’t really plan to look into Variational Bayes and modelling with conditionally conjugate models, you can skip the next block in your first pass through the list:

- Variational Bayesian Inference with Stochastic Search
- Stochastic Variational Inference
- Black Box Variational Inference

Here we get to variational inference in deep learning, but at this point we are going to be using the *score function estimator* rather than the famous *reparameterised gradient*. I think this makes for a better order.

If you want to go the extra mile, read the *REINFORCE estimator* paper:

Now we get to the territory of reparameterised gradients.

- Doubly Stochastic Variational Bayes for non-Conjugate Inference
- Auto-Encoding Variational Bayes
- Stochastic Backpropagation and Approximate Inference in Deep Generative Models

And here is a good view on implementation using automatic differentiation toolkits:

If you care about semi-supervised learning you will like to read about this one:

The reparameterised gradient was initially developed for a Gaussian approximate posterior, but we can go beyond that in at least three ways. We can design more expressive approximations by extending the hierarchy of the inference model:

We can use *known* distributions that are not (directly) reparameterisable:

- Automatic Differentiation Variational Inference
- Rejection Sampling Variational Inference
- The Generalized Reparameterization Gradient

Or we can focus on being able to sample and to assess the density at a point, but not really knowing the density function in closed-form, by using a *normalising flow*.

- Variational Inference with Normalising Flows
- Improving Variational Inference with Inverse Autoregressive Flow

If you are curious about undestanding more about the challenges behind optimising the evidence lowerbounder (ELBO), you will like the following:

- Towards a Deeper Understanding of Variational Autoencoding Models and InfoVAE: Information Maximizing Variational Autoencoders
- Fixing a Broken ELBO

If you are really serious about VI, you start questioning the ELBO, and you wonder why should one use KL divergence. Then start here:

# Baselines and Control variates

- Simple Statistical Gradient-Flowing Algorithms for Connectionist Reinforcement Learning
- Policy Gradient Methods for Reinforcement Learning with Function Approximation
- MuProp: Unbiased Backpropagation for Stochastic Neural Networks
- REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models
- Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

# Discrete variables and relaxations

- Categorical Reparameterization with Gumbel-Softmax
- The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
- Lost Relatives of the Gumbel Trick

## More on Normalising flows

- Variational Inference with Normalising Flows
- Improving Variational Inference with Inverse Autoregressive Flow
- Multiplicative Normalizing Flows for Variational Bayesian Neural Networks
- Conditional Density Estimation with Bayesian Normalising Flows

## Implicit models

## Some NLP Applications

- Generating Sentences from a Continuous Space
- Semantic Parsing with Semi-Supervised Sequential Autoencoders
- Language as a Latent Variable: Discrete Generative Models for Sentence Compression
- Discovering Discrete Latent Topics with Neural Variational Inference
- Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction
- Deep Generative Model for Joint Alignment and Word Representation