Mark who I met in machine learning study meetup had recommended me to study a research paper about discrete variational autoencoder. I have read today. As so does variational inference, it includes many mathematical equations, but what the author wants to tell was very straightforward. Two previous posts, Variational Method, Independent Component Analysis, are relevant to the following discussion.


To understand the paper, above all, we need to know what the autoencoder is and what variational autoencoder is. Thus, I want to discuss about them today. Many models in machine learning are generative. So are neural network models. The autoencoder is a neural network that is trained to attempt to copy its input to its output1. However, we do not want trivial identity mapping. The hidden layers are more, then, it would become identity, so autoencoder goes through compression procedure. It can have some practical applications such as data denoiser or dimensional reduction. The research of autoencoder is still quite active as much as it is not well-understood.

Variational autoencoder

Then what is variational autoencoder? As I know the variational autoencoder was discussed first in the paper, and this 14-page paper is not too hard to read and understand.


This is the algorithm introduced in the original paper of variational autoencoder (variational autoencoder). It is quite similar to the way how we optimize the probability density in the post of variational method. I will introduce a couple of differences. Here, in the paper, the model is generative. Let us say $z$ is the source to generate variable $x$. $\theta$ is the parameter of generative model for the target probability distribution, $P{\theta} (\boldsymbol{z} \mid\boldsymbol{x})$, and $\phi$ is the parameter of approximate disposal probability, $Q{\phi}(\boldsymbol{z} \mid\boldsymbol{x})$. Basically, we take the derivative of the variational free energy, which is functional of $Q$, with respect to $Q$’s and find the equations of optimizations as we did in the post.

However, in the algorithm of above figure, take the gradient with respect to the parameters. The parameters and the proposed density are conjugates, so yield the same equations. Besides, $$P(\boldsymbol{z} \mid\boldsymbol{x})$$ also varies in the autoencoder, so we need to find the extremum of the variation, too. That’s why we use the gradient descent with respect to $\theta$ as well.

Why variational autoencoder

We have discussed the property of good autoencoder. We do not want to have identity mapping. The process needs to be automatic and quick. These are well-known characteristic of variational inference. When we use variational method, we use some strong assumption to make it easy to obtain the clear relations from extremum of the variational free energy. This made the convergent approximation less precise to the target distribution. However, this tendency acts in a good way in variational autoencoder.

Assumption from the VAE paper

  • General assumption for variational autoencoder : The data are generated by some random process, involving an unobserved continuous random variable $z$.The prior $p{\theta^* }(z)$ and likelihood $p{\theta^* }(x|z)$ come from parametric families of distributions $p{\theta^* }(z)$ and $p{\theta^* }(x|z) $, and that their PDFs are differentiable almost everywhere w.r.t. both $\theta$ and $z$.

  • Assumption for AEVB algorithm : an approximate posterior in the form q(zjx), but please note that the technique can be applied to the case q(z), i.e. where we do not condition on x, as well.

  • Assumption for the following example : The true (but intractable) posterior takes on a approximate Gaussian form with an approximately diagonal covariance. In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure.

Hand digit mnist data for VAE

Let us see how VAE works via mnist data. Keras example for VAE was very helpful. If you run the Python code, you will get two figures.

One of them is,


We use $z$, unobserved continuous random variable. It generates the data, and is a 2-dimensional vector. Then, encoder is a mapping from $z$ to $x$ and decoder is an reverse mapping. The above scattered plot shows 2-dimensional $z$-space and its category up to the 10 digits.

From the assumption of Gaussian,

$$ \boldsymbol{z} = \boldsymbol{\mu_z} + \boldsymbol{\sigma_z} \epsilon $$

where $\epsilon$ is a $\mathcal{N}(0, I)$, and $I$ is a $2 \times 2$ identity matrix. This assumption makes enable of back propagation of the network. We also assume the encoder and decoder are also multivariate Gaussian with diagonal covariance structure. Then, the means and covariances transform linearly via neural networks.

$$ \log p(\boldsymbol{x}|\boldsymbol{z}) = \log \mathcal{N}(\boldsymbol{x} ; \boldsymbol{\mu}, \boldsymbol{\sigma^2I}) $$

$$ \boldsymbol{\mu} = \boldsymbol{W h}+\boldsymbol{b} $$

$$ \boldsymbol{\sigma^2} = \boldsymbol{W’ h’} + \boldsymbol{b’} $$

$$ \boldsymbol{x} = \tanh(\boldsymbol{h” z} + \boldsymbol{b”}) $$

Thus, these relations are very similar to the eq (6), (7), (8) in the previous variational method post. To updated this, we iteratively fit the neural network with above relations. To train the model, we used the Keras, and the example is found in Keras example for VAE

The other figure from the reference is,


This is a good figure, but not easy for the tutorial, so I will focus on the individual digit and what happened.

The 10 samples of $\boldsymbol{z}$ are obtained from encoding of $\boldsymbol{x}$.

z_sample = x_test_encoded[:10]
array([[ 0.69002199, -2.90184522],
       [ 0.41241992,  0.603185  ],
       [-3.11358571, -1.56584907],
       [ 0.01399346,  2.09900308],
       [ 2.07487154, -0.32094592],
       [-2.62375212, -1.29237843],
       [ 0.92650926, -0.95425624],
       [ 1.69461596,  0.27989849],
       [-0.33635819,  0.34108013],
       [ 0.34854144, -1.5491755 ]], dtype=float32)

 array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9], dtype=uint8))

The 8th and 10th digit are both 9. We fit our model generator to train on the mapping, $\boldsymbol{z}$ to $\boldsymbol{x}$.

[ 1.69461596  0.27989849]


In sum, from given data, we find the variable $\boldsymbol{z}$ and we regenerate the values in the data space.



This is an original data image. Above two images show the image of the number 9, but look different. Blur, and even shape is also deformed. This is, therefore, a good autoencoder avoided identity mapping.

[ 0.34854144 -1.5491755 ]

Do you see the z-value is quite different from above?




  1. I copied the sentence from the famous deep learning book because it is a good and concise explanation of the autoencoder. One of the strongest point of the book is that this book concerns about the recentest research topics, too. The autoencoder’s original paper appeared in December, 2013. [return]