I announce over and over that the chronicle ordering of the post are irrelevant for beginners’ favor. There are many blanks I skipped. I would fill the holes later.

## Variational method

During my physics coursework and researches, I used this method countlessly. I even had a book of the name. It is quite simple, but also as big topic as being a book. Simply put, it is a technique to find equations and solutions (sometimes approximate solutions) by extremizing functionals which is mainly just integrals of fields, and treat the functions in the integral, as parameters.

Thus, it is such a broad subject, but here we will consider some specific technique to obtain the approximate probability density using Feynman-Bogolubov inequality1 2. The other good thing of the inequality is that it yields the bound. The low bound of the Helmholtz free energy(functional).

## The disposal density as an approximation of the target density

Thus, I assume that you know the variational principle or Feynman-Bogolubov inequality. The trial function in the next example is the disposal probability distribution in order to find the target probability density. We will use the target density appeared in the chapter 24, Exact marginalization, of the book. The exercises 24.3 of the chapter is very good at understanding Bayesian theory and maximum likelihood. I will handle the problem if I have a chance next time. Exact marginalization means that we calculate the probability by integrating over continuous variables (or summation over discrete variables). That is processed by assuming prior distribution.

$$P(\mu, \sigma| Data) \sim \exp(-{ N(\mu-\bar{x})^2+S\over 2 \sigma^2})$$

and the variational free energy (with Bayesian rule), the functional of trial function is

$$\tilde{F}(Q) = \int d\mu~d\sigma ~Q(\mu, \sigma) \log {Q(\mu, \sigma) \over P(Data|\mu, \sigma)P(\mu,\sigma)}$$

and additionally we assume $Q(\mu, \sigma)$ is separable.

$$Q(\mu, \sigma) = Q_{\mu}(\mu)Q_{\sigma}(\sigma)$$

From extrema of the free energy, $\tilde F$ with respect to $Q_{\mu}(\mu)$

and $Q_{\sigma}(\sigma)$, we obtain two conditions to optimize $Q_{\mu}(\mu)$ and $Q_{\sigma}(\sigma)$.

$$Q_{\mu}(\mu) = \mathcal{N}(\mu; \bar{x}, \sigma_Q^2)$$

and

$$Q_{\sigma}(\sigma) = {2 \over \sigma^3} \Gamma({1\over\sigma^2}, b, c)$$

where

$$\sigma_Q^2 = 1/(N\bar{\beta})$$

and

$$\bar{\beta} = \int d\sigma Q_{\sigma}(\sigma) 1/\sigma^2$$

and

$$b = {2 \over N \sigma_Q^2 + S}, ~~~~~ c = N/2$$

Data is given. $N$, $\bar x$, and $S$. Then by the algorithm, Python code is as follows.3

import numpy as np
import scipy.stats as st
import random
import matplotlib.pyplot as plt
import scipy.special as sp
import scipy.integrate as integrate
%precision 12

%matplotlib inline
plt.style.use('ggplot')

def P_mu(x_bar, mu, sigma):
return 1/(2*np.pi*sigma**2)**(2.0)* np.exp(-( (mu - x_bar)**2)/(2*sigma**2))

def Gamma(x, s, c):
return 1/(sp.gamma(c)*s) * (x/s)**(c-1) * np.exp(-x/c)

N = 5.0
S = 1.0

# random intial values
sigma_Q = 0.5*random.random()
b = 10* random.random()
c = 10* random.random()
x_bar = random.random()

# randodm initial Q
def Qmu(X1):
return P_mu(x_bar, X1, sigma_Q )
def Qsigma(X2):
return 2.0/(X2**3.0)*1/(sp.gamma(c)*b) * (1.0/X2**2/b)**(c-1) * np.exp(-1.0/X2**2/b)

# will be integrated
def argument(X2):
return Qsigma(X2) * 1/ X2**2

update = 0
while update < 15:

plt.figure()
x1 = np.arange(0.001, 3.0, 0.01)
x2 = np.arange(0.001, 3.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
P = 1/(2*np.pi*X2**2)**(N/2.0)* np.exp(-( N*(X1 - 1.0)**2+S)/(2*X2**2))
plt.contour(X1,X2,P)

Q = Qmu(X1) * Qsigma(X2)
plt.contour(X1,X2, Q ,linestyles = 'dotted',colors='k')
plt.axis([0,2,0,1.2])
plt.xlabel('$\mu$')
plt.ylabel('$\sigma$')

if update % 2 == 0:
x_bar = 1.0
sigma_Q = np.sqrt(1/(N*beta_bar))

else:
sigma_Q = np.sqrt(1/(N*beta_bar))
b = 2.0/(N*sigma_Q**2 + S)
c = N/2.0

update += 1


## Effectiveness and weakness of variational method as a tool of the optimization.

From the random initial $Q(\mu, \sigma)$(the dotted line in the figures), it approximately approaches to the target distribution. Theoretically, the peak of $Q_{\sigma}(\sigma)$ should be at $\sigma = 0.50$, and the peak of $P(\mu, \sigma)$ should be at $(\mu,\sigma) = (1.0, 0.45)$. In other words, even the peak of $Q(\sigma)$ is deviated from the peak of the target density in principle. Besides, $Q(\mu, \sigma)$, the approximate density is more compact than the target density. When you want to use the disposal density obtained from this variational principle, you should keep them in mind.

Then, is it a bad approximation? Variational method is good. Really good. In the above plots and in only few turns, it reaches very good approximate solution. MCMC approximation can’t be fast like it. In some image optimization, variational inference shows really amazing efficiency and precision. Why the approximation of the example was not good is, just simple. The assumption was not right. The target density is not separable. Variational method needs strong assumptions to provide exact form of equations. With some incorrect assumptions, it still furnishes great and fast approximation, but often not so exact. Thus, I would use this powerful tool to get the great disposal functions, and might continue the process with other method such as Markov chain Monte Carlo method.

x = np.arange(0.1, 2.0, 0.01)
plt.plot(x,Qsigma(x))
1/(2*np.pi*X2**2)**(N/2.0)* np.exp(-( N*(X1 - 1.0)**2+S)/(2*X2**2))
plt.plot(x,55/(2*np.pi*x**2)**(N/2.0)* np.exp(-S/(2*x**2)))


x = np.arange(0.4, 0.48, 0.001)
plt.plot(x,Qsigma(x))
plt.plot(x,55/(2*np.pi*x**2)**(N/2.0)* np.exp(-S/(2*x**2)))


## Mean field approximation

This variational method is useful to induce the mean field approximation of Ising model. Easily evaluated. In the near future, I would handle Ising model. When I was a little kid, I wondered how the magnet is made. The Ising model resolves the mystery and very useful to study Hopfield net and Boltzmann machine.

## Mathematica

As I mentioned above, I used Mathematica to debug my Python code. Some of you are not familiar to it, so I want to introduce about it to you.

Mathematica is rather unique in that it provides the mathematical symbolic and algebraic computing. It is often very convenient and quick in scientific researches. For example, consider a following example.

def f(x):
return x

def g(x):
return x**2

print f(x) * g(x)


The answer is $x^3$, but many computer programs are null at this. You must put the values of the argument, $x$ or newly define the function returns $x^3$. Mathematica does this procedure with no effort, intuitively.

For the other example, when you want to draw a plot of any continuous function in Python, you should set the (discrete but tiny gapped) lists or arrays. To me, it is quite inconvenient. In Mathematica, set the range of $x$ for your function, that’s all for the plot.

### Then why do I barely use it in machine learning?

Sometimes, this kind of settings blur the key points of the programming. That’s the same as why I do not use completed libraries in this blog. The methods which is supported in those libraries includes many *args and **kwargs, and complex Classes. It disturbs you when you want to fully understand how it works and when you face some bugs or troubles while using it. It might take time to resolve the troubles as much as ones program it from nothing. Of course, you can code most of what you code in other programs, but it is not efficient.

In sum,

• Mathematica is super convenient for simple plotting of your 1-D, 2-D, 3-D graphs.

• Awesome algebraic calculations will save time.

• Best at checking or testing your ideas in beginning level.

1. Feynman is one of my favorite physicists, with Einstein, Dirac. He was a great teacher as well. He had very broad range of interest on the world, so was influential even to data science. Apparently, he was one of the first scientists conceived the possibility of the quantum computer. [return]
2. I won’t prove the inequality. That is equivalent to the proof of the Gibbs’ inequality in the book even though it looks totally different. See the proof using Baker-Cambell-Hausdorff theorem and convexity of exponential function. Free energy in statistical mechanics is an exponential function.) [return]
3. Rather than using the statistical probability library, I defined the distributions. It is much clear to understand how the approximation by variational method works. However, I made a trivial typo mistake in parameter of the $\Gamma(x,b,c)$ function, and it took many hours to find my mistake. At first, I thought the above equation (5) was mistaken, and re-calculated the equations of variations over and over. Definitely the error was in $Q_{\sigma}(\sigma)$, so I compared the numerical values of the distribution with what I obtained by using Mathematica. The gap was big, but I doubted if Python numerics work properly. Very stupid, ain’t I? That kind of numerical mistakes never present the error message, so very fatal. [return]