## Generative model

Generative model is a model for generating all variables including outputs. I will give a very simple example with strong assumptions.

Data $\boldsymbol{x^{(n)} }$ are generated by an unknown matrix, $\boldsymbol{G}$.

$$\boldsymbol{x} = \boldsymbol{G}~\boldsymbol{s}$$

The goal is to find the source variable $\boldsymbol{s}$.

• we assume that the number of sources is equal to the number of observations
• We assume that the latent variables are independently distributed, with marginal distributions
• We assume that the vector $\boldsymbol{x}$ is generated without noise for simplicity.

$$z_i = \phi_i(s_i)$$

and

$$\boldsymbol{W}^{-1} = \boldsymbol{G}$$

and

$$\phi_i(s_i) \equiv {d\ln~p_i(s_i) \over ds_i}.$$

$\phi_i(s_i)$ can be chosen appropriately for the proper probability density. For example, $\phi_i(s_i) = -\tanh(s_i)$ or $CauchyPDF(s_i)$

### Contour plots from the generative model

import random
import numpy as np

import scipy.stats as st
import matplotlib.pyplot as plt
import scipy.special as sp
import scipy.integrate as integrate
%precision 6
from __future__ import division

%matplotlib inline
plt.style.use('ggplot')


Set the matrix and check the matrix multiplications set correctly.

G = np.array([[3/4, 1/2],[1/2, 1]])
G2 = np.array([[2, -1],[-1, 3/2]])

W = np.linalg.inv(G)
W2 = np.linalg.inv(G2)

ax , ay = np.dot(W, [x, y])
ax2 , ay2 = np.dot(W2, [x, y])

zx, zy  = -np.tanh(ax), -np.tanh(ay)
zx2, zy2  = -np.tanh(ax2), -np.tanh(ay2)

px, py = 1/np.cosh(ax), 1/np.cosh(ay)
px2, py2 = 1/np.cosh(ax2), 1/np.cosh(ay2)


Distributions over two observables generated by $1/\cosh$ distributions on the latent variables, for $G$ and $G2$ ( or $W$ and $W2$).


N = 100
x, y = np.linspace(-4,4, N), np.linspace(-4,4, N)
X, Y = np.meshgrid(x, y)
Z , Z2 = np.zeros((N,N)), np.zeros((N,N))
for i in range(N):
for j in range(N):
Z[i,j]=1/np.cosh(np.dot(W, [X[i,j], Y[i,j]]))*1/np.cosh(np.dot(W, [X[i,j], Y[i,j]]))
for i in range(N):
for j in range(N):
Z2[i,j]=1/np.cosh(np.dot(W2, [X[i,j], Y[i,j]]))*1/np.cosh(np.dot(W2, [X[i,j], Y[i,j]]))
plt.contour(Z)
plt.contour(Z2)
plt.colorbar()
plt.show() From the hyper-cosine distribution, it is hard to see the effect of generative matrix, $G$. It has heavier tail than Gaussian but still quite center oriented. Let us try Cauchy distribution. See the contours of the generative distributions when the latent variables have Cauchy distributions.


N = 100
x, y = np.linspace(-2,2, N), np.linspace(-2,2, N)
X, Y = np.meshgrid(x, y)
Z, Z2 = np.zeros((N,N)), np.zeros((N,N))
for i in range(N):
for j in range(N):
Z[i,j]=Cauchy(np.dot(W, [X[i,j], Y[i,j]]))*Cauchy(np.dot(W, [X[i,j], Y[i,j]]))

for i in range(N):
for j in range(N):
Z2[i,j]=Cauchy(np.dot(W2, [X[i,j], Y[i,j]]))*Cauchy(np.dot(W2, [X[i,j], Y[i,j]]))

plt.contour(Z)
plt.contour(Z2)
plt.colorbar()
plt.show() Still the effect is small. I want to see the tail from the Cauchy distribution. To set the proper level of contour, check the maximum and minimum of the function.

Z.min(), Z.max()

(0.000000, 9.123957)


The maximum of the probability density is around 9.1 because I did not normalize the density distribution.


N = 100
x, y = np.linspace(-8,8, N), np.linspace(-8,8, N)
X, Y = np.meshgrid(x, y)
Z, Z2 = np.zeros((N,N)), np.zeros((N,N))
for i in range(N):
for j in range(N):
Z[i,j]=Cauchy(np.dot(W, [X[i,j], Y[i,j]]))*Cauchy(np.dot(W, [X[i,j], Y[i,j]]))

for i in range(N):
for j in range(N):
Z2[i,j]=Cauchy(np.dot(W2, [X[i,j], Y[i,j]]))*Cauchy(np.dot(W2, [X[i,j], Y[i,j]]))

contour_levels = np.arange(0.01, 0.3, 0.01)
plt.contour(Z, contour_levels )
plt.contour(Z2, contour_levels )
plt.colorbar() Now it looks better.

## Learning algorithm and maximum likelihood again1

As I mentioned earlier, the generative matrix, $G$, is unknown. If our initial matrix was improper, we can run the learning algorithm for optimization. However, we can’t arbitrarily change the matrix.

For learning about $\boldsymbol{G}$, investigate the likelihood,

$P(\boldsymbol{x}^{(n)} | \boldsymbol{G}, \mathcal{H})$.

$$\ln~P(\boldsymbol{x}^{(n)}|\boldsymbol{G}, \mathcal{H}) = -\ln ~|\det ~\boldsymbol{G}|+\sum_i \ln~p_i(G^{-1}_{ij}x_j)$$

Take the derivative of the logarithm likelihood, then we obtain

$${\partial \over \partial W_{ij}}\ln~P(\boldsymbol{x}^{(n)}|\boldsymbol{G}, \mathcal{H}) = G_{ij }+x_jz_i$$

Induce of equation (6) may be found in this link2, the proof of the identities of the book, (34.10)-(34.12) are also linked. These identities are very useful to take the derivative with respect to $W{ij}$ or $G{ij}$.

Now for the Python coding example, I will make contour plots by learning.

$$\Delta \boldsymbol{W} = \epsilon ( [\boldsymbol{W^{T}}]^{-1}+\boldsymbol{zx^T})$$

The epsilon is chosen by hand here since we are not really solving the problem by the real data. I put only one generative matrix for each graph to see more clearly.

plt.hold(False)
N = 100
x, y = np.linspace(-4,4, N), np.linspace(-4,4, N)
X, Y = np.meshgrid(x, y)
Z  = np.zeros((N,N))

G = np.array([[3/4, 1/2],[1/2, 1]])
W = np.linalg.inv(G)

turns = 0
epsilon = 0.0001
WN = W
while turns <5:

Z  = np.zeros((N,N))

for i in range(N):
for j in range(N):
Z[i,j]=1/3.14141*1/np.cosh(np.dot(WN, [X[i,j], Y[i,j]]))*1/3.14141*1/np.cosh(np.dot(WN, [X[i,j], Y[i,j]]))

plt.contour(Z, np.array([0.035]))
plt.hold(True)
dW = np.linalg.inv(np.transpose(WN))+np.dot([-np.tanh(np.dot(WN, [x, y]))], np.transpose([x,y]))

WN= WN+epsilon *dW

turns +=1

plt.show()


One contour plot for each turn (same blue plots) plt.hold(False)
N = 100
x, y = np.linspace(-4,4, N), np.linspace(-4,4, N)
X, Y = np.meshgrid(x, y)
Z  = np.zeros((N,N))

G = np.array([[3/4, 1/2],[1/2, 1]])
W = np.linalg.inv(G)

turns = 0
epsilon = 0.0001
WN = W
while turns <5:

Z  = np.zeros((N,N))

for i in range(N):
for j in range(N):
Z[i,j]=1/3.14141*1/np.cosh(np.dot(WN, [X[i,j], Y[i,j]]))*1/3.14141*1/np.cosh(np.dot(WN, [X[i,j], Y[i,j]]))

plt.contour(Z)
plt.hold(True)
dW = np.linalg.inv(np.transpose(WN))+np.dot([-np.tanh(np.dot(WN, [x, y]))], np.transpose([x,y]))

WN= WN+epsilon *dW

turns +=1
plt.colorbar()
plt.show()


See the learnings for many contours. plt.hold(False)
N = 100
x, y = np.linspace(-4,4, N), np.linspace(-4,4, N)
X, Y = np.meshgrid(x, y)
Z  = np.zeros((N,N))

G = np.array([[3/4, 1/2],[1/2, 1]])
W = np.linalg.inv(G)

turns = 0
epsilon = 0.0002
WN = W
while turns <10:

Z  = np.zeros((N,N))

for i in range(N):
for j in range(N):
Z[i,j]=Cauchy(np.dot(WN, [X[i,j], Y[i,j]]))*Cauchy(np.dot(WN, [X[i,j], Y[i,j]]))

plt.contour(Z, np.array([0.35]))
plt.hold(True)
dW = np.linalg.inv(np.transpose(WN))+np.dot([-np.tanh(np.dot(WN, [x, y]))], np.transpose([x,y]))

WN= WN+epsilon *dW

turns +=1

plt.show()


For Cauchy distribution Thus, we carry out that the distribution is getting broader in the specific direction.

### Covaiance and tensors

The name of the section 34.3 in the the book, Information Theory, Inference, and Learning Algorithms, is “A covariant, simpler, and faster learning algorithm”. Nevertheless, I could not find any further explanation about the meaning of covariance here. It is quite confusing because we use the terminology variance to describe the distributions. The covariance in statistics is defined to exhibit how much two values are linearly related to each other. $Cov(f(x),g(x))=E[(f(x)-E[f(x)])(g(x)-E[g(x)])]$ where $E(f)$ is the expectation of a function, $f$.

Here, the covariance is absolutely different thing. I love this book, but sometimes I feel this book is not much self-contained for engineers. The concept is significantly studied in differential geometry or theory of gravitation and high energy physics such as quantum field theory and string theory. Thus, I feel very comfortable to handle the tensorial expressions in differential geometry present in machine learning.

Simply put, tensor is a high dimensional matrix or extended component quantities. Matrix is a mere rank 2-tensor. Rank 3 tensor looks like this. $T_{ijk}$ and

$T_{jk}^i$ and so on.

Why do we, all of a sudden, talk about geometry or general relativity of Einstein? See the equation (6) again. The derivative was taken with respect to matrix, $W_{ij}$. Then, we should make the equation tensorial, which follows the transformation rules of tensors.

The book mentioned about metric and curvature. Both are very well-known tensors. Let’s consider two paths from the same starts to same ends, from $(0,~0)$ to $(1,~1)$. Two observers move along the different paths and measure the time takes to go to the goal. In the flat world, it takes same time. Both movements are just equivalent with different orders. However, imagine something there was hills3 for one path. Then it would take more time because the steep uphill is fatal. In other words, because of the curved spacetime, the path was not commutable, and the quantities to exhibit the commutator is a curvature.

Metric tensor is the most famous rank 2 tensor, $g_{ij}$, which has two components. It appears when we define intervals.

$$ds^2 = g_{ij}dx^idx^j$$

I discern upper index and low index of the vector and tensor. In cartesian spacetime, the metric tensor is an identity matrix, so $x^i$ and $x_i$ are identical, but if the metric tensor has off-diagonal terms, we should distinguish them. The metric tensor indeed plays a role to lower or raise the indices.

$$\partial_i \equiv {\partial \over \partial x^i},~~ and ~~ \partial^i \equiv {\partial \over \partial x_i}$$

As I commented the curvature tensor emerges from the non-commutivity, the non-diagonalized matrix derivative causes a curvature.

$$R^{(ij)(kl)} \equiv {\partial \over \partial W_{ij}}{\partial \over \partial W_{kl}} - {\partial \over \partial W_{kl}}{\partial \over \partial W_{ij}} \ne 0$$

The above rank 4-tensor is such as a curvature tensor4.

In contrast, the curvature in cartesian flat space is none. In this case, we do not need this covariantize steps at all.

$$\partial_i \partial_j - \partial_j \partial_i = 0$$

### Covariant learning algorithm

Thus, I rewrite the equation (34.19) of the book with the tensor convention.

$$\Delta \omega_{i} = \epsilon {\partial L \over \partial \omega_i}$$

This is not covariant in terms of index, too. LHS is low index vector, and RHS is upper index vector because low index derivative is taken as upper index vector. To make the RHS upper index vector, we should multiply a rank 2 tensor with two low indices such as $M_{ij}$. That is the equation (34.20) of the book.

In our case, we take the derivative with respect to $W_{ij}$,

and then we need rank 4-tensor such as $M_{ijkl}$. The most famous rank 4-tensor is a curvature tensor. By multiplying a curvature tensor obtained, we induce the covariant learning algorithm.

The following learning was made with the covariant algorithm.

# covaraint

plt.hold(False)
N = 100
x, y = np.linspace(-4,4, N), np.linspace(-4,4, N)
X, Y = np.meshgrid(x, y)
Z  = np.zeros((N,N))

G = np.array([[3/4, 1/2],[1/2, 1]])
W = np.linalg.inv(G)

turns = 0
epsilon = 0.0001
WN = W
while turns <10:

Z  = np.zeros((N,N))

for i in range(N):
for j in range(N):
Z[i,j]=1/3.14141*1/np.cosh(np.dot(WN, [X[i,j], Y[i,j]]))*1/3.14141*1/np.cosh(np.dot(WN, [X[i,j], Y[i,j]]))

plt.contour(Z)
plt.hold(True)
dW = WN+np.dot([-np.tanh(np.dot(WN, [x, y]))], np.transpose(np.dot(np.dot(np.transpose(WN), WN),[x,y])))

WN= WN+epsilon *dW

turns +=1
plt.colorbar()
plt.show() 1. Again? Did I post about maximum likelihood? Well, I did not. I planned to discuss about it in soft K-means clustering. We will be back to the clustering and handle how maximum likelihood leads us to better clustering, someday. [return]
2. I would create pdf documents in LateX someday, too. Now, please be contented by the ugly handwriting in photos. [return]
3. For example, uphill to $(1,~0)$ and downhill to $(1,~1)$. [return]
4. The tensor is some special case of the Ricci tensor. [return]