## Single neuron still has a lot to say

In the post of the first neural network tutorial, we studied a perceptron as a simple supervised learning machine. The perceptron is an amazing structure to understanding inference.

In the post of the first neural network tutorial, I said I would leave you to find the objective function and and draw the plot of it. I just introduce here.

### Objective function and its contour plot.

import random
import numpy as np

import scipy.stats as st
import matplotlib.pyplot as plt
import scipy.special as sp
import scipy.integrate as integrate
%precision 12
from __future__ import division

%matplotlib inline
plt.style.use('ggplot')

def sigmoid(a):
return 1/(1+np.exp(-a))


# initial x(2d vector) and target pair in 10*10 box

def weight():
return np.array([np.random.uniform(-1,1), np.random.uniform(-1,1), np.random.uniform(-1,1)])


def linreg(x):
return -w/w-x*w/w


Here, I change the initial data points a bit. Make them asymmetric to see asymmetric result of the objective function.

def initial_values2(N): # N x 4 matrix
return np.array([(1, np.random.uniform(0,5), np.random.uniform(0,10),np.random.binomial(1, 0.5) ) for i in range(N)])

def weight():
return np.array([np.random.uniform(-1,1), np.random.uniform(-1,1), np.random.uniform(-1,1)])

N = 100
w = weight()
I = initial_values2(N)
x = I[:,:3]
t = I[:,3]

for i in range(N):
if x[:,2][i]>-0.5*x[:,1][i] + 6:
t[i]=1
else:
t[i]=0


X = x[:,1:]
w1 = np.linspace(-5.0, 5.0, 200)
w2 = np.linspace(-5.0, 5.0, 200)
Z = np.zeros((200,200))

W1, W2 = np.meshgrid(w1, w2)
for i in range(200):
for j in range(200):
w = np.array((W1[i,j],W2[i,j]))
Z[i,j] = np.dot( t, np.log(sigmoid(np.dot(X, w)))) + np.dot((1-t), np.log(1-sigmoid(np.dot(X, w))))

plt.contour(W1,W2,Z, 200)
plt.colorbar() I assumed $w_0 = 0$ for simplicity, and the center of the contour corresponds to the most probable $(w_1, w_2)$ obtained by supervised learning in the previous post.

### Inference from 10 data points

In advance, when you do this kind of matrix multiplications, be familiar to checking the shapes of the matrices in order to check the matrices are well-matched or to decide if you need to transpose some of them.

np.shape(w), np.shape(x), np.shape((t-sigmoid(np.dot(x, w))))

((3,), (100, 3), (100,))


Let us choose 10 data points among $10 \times 10$ box, generated by the following method.

def initial_values(N): # N x 4 matrix
return np.array([(1, np.random.uniform(0,10), np.random.uniform(0,10),np.random.binomial(1, 0.5) ) for i in range(N)])

N = 100
I = initial_values2(N)
x = I[:,:3]
t = I[:,3]

for i in range(N):
if x[:,2][i]>-0.5*x[:,1][i] + 6:
t[i]=1
else:
t[i]=0


X = np.linspace(0.01, 10, 100)
Y = -0.5*X + 6
plt.plot(X,Y,color='r')
N = 10

eta = 0.01
alpha = 0.01
turn = 1
turns = 1000
wlist=[w]
while turn < turns:
a = np.dot(x[:10], w)
y = sigmoid(a)
e = t[:10] - y
g = -np.dot(np.transpose(x[:10]), e) # sum , batch

w = w - eta * ( g + alpha * w )
wlist += [w]
turn += 1
uplist = []
downlist = []
for i in range(N):
if t[i]==1:
uplist = uplist +[x[:10][i]]

else:
downlist = downlist +[x[:10][i]]
uplist = np.array(uplist)
downlist = np.array(downlist)

plt.plot(uplist[:,1], uplist[:,2],'x')
plt.plot(downlist[:,1], downlist[:,2],'o')
X = np.linspace(0.01, 10, 100)
Y = linreg(X)
plt.plot(X,Y,color = 'b')
plt.axis([0,10,0,10]) Note that I intentionally set features 0 if the points are below the linear line (red line of the above plot), y = 12 x + 6, and else feature 1. Previously when we use 100 data points, the supervised learning yielded us quite precise line plot. However, if we use only 10 data, the estimated line plots (blue line of the above plot) are not so good to make inference for the next target point of estimation.

The Bayesian prediction of the next $t^{(N+1)}$ is nothing but an expectation value of $y(\boldsymbol{x}^{(N+1)}| \boldsymbol{w})$. The expectation value as an integration in the probability distribution can be found by Markov chain Monte Carlo sampling.

Of course, the target probability distribution for the sampling is improbable. That has been discussed much in this blog. It is one of defects of Monte Carlo sampling and we also studied more efficient Monte Carlo sampling to handle this problem. Here, we will use the algorithm similar to Hamiltonian Monte Carlo method, called Langevin Monte Carlo method. The difference from the leapfrog algorithm is that we have one more loop to obtain the momentum after $T$ in the leapfrog algorithm. You can review it in the post of the Hamiltonian Monte Carlo sampling.

x = x[:10]
t= t[:10]

def M(w, p, alpha):
return  -np.dot(t, np.log(sigmoid(np.dot(x, w)))) - np.dot((1-t), np.log(1-sigmoid(np.dot(x, w)))) + 0.5*alpha* np.dot(w,w)

def H(w, p, alpha):
return M(w, p, alpha)+np.dot(p,p)

return alpha*w - np.dot((t-sigmoid(np.dot(x, w))),x)

alpha = 0.01

# initial point
w0 = np.array([-10, 2,3])

Listw = [w0]

# number of samples
N = 40000

eta = 0.01
epsilon = np.sqrt(2*eta)

for i in range(N):
p0 = np.random.normal(0,1,3)

ph = p0 - epsilon* gradDescent(w0, alpha)/2.0
wn = w0 + epsilon* ph
pn = ph - epsilon* gradDescent(wn, alpha)/2.0

a = H(wn, pn, alpha)/H(w0, p0, alpha)

if a >=1:
Listw = Listw +[wn]
w0,p0 = wn, pn
else:
if random.random() < a:
Listw = Listw + [wn]
w0,p0 = wn, pn

Listw = np.array(Listw)

len(Listw)

38689


The Markov chain converges after 10000 iterations.

plt.plot(Listw[10000:][:,1], Listw[10000:][:,2], '.')
plt.plot(1.256838,   2.530788, 'o', color='b')
plt.axis([-1,6,-2,7]) Blue dot is the estimated $(w_1, w_2)$ in $\boldsymbol{w}$-space.

We randomly draw 25 samples from the Markov chain Monte Carlo samples.

selected_w = np.array([Listw[10000:][1000*i] for i in range(25)])

X = np.linspace(0.01, 10, 100)
Y = linreg(X)

for i in range(25):
plt.plot(X, -(selected_w[:,0][i]+selected_w[:,1][i]*X)/selected_w[:,2][i])
plt.plot(X,Y, linewidth=3, color='b')

X = np.linspace(0.01, 10, 100)

Y = -0.5*X + 6
plt.plot(X,Y,linewidth=3,color='r')

plt.show() The blue line is the estimated line from $\boldsymbol{w}$, and the red line is the function given to draw the samples, i.e. the exact solution. Both are in the range of the inference estimation.

XX, YY = np.meshgrid(X, X)
Z = sigmoid(-w-w*XX-w*YY)
plt.contour(XX, YY, Z)
plt.colorbar()

plt.plot(uplist[:,1], uplist[:,2],'x')
plt.plot(downlist[:,1], downlist[:,2],'o') This is the probability contour function for $P(t=1,~\boldsymbol{w}, \boldsymbol{x})$, obtained from the expectation value of $y = 1/(\exp(-\boldsymbol{w} \cdot \boldsymbol{x}))$. The estimation for the given data is good, but not at all good for predictions. Already wrong from the exact solution.

XX, YY = np.meshgrid(X, X)
Z = []
for i in range(25):
w = selected_w[i]
Z.append(sigmoid(-w-w*XX-w*YY))


Z = np.array(Z)
np.shape(Z)

(25, 100, 100)

np.shape(sum(Z))

(100, 100)

ZZZ = sum(Z)/30
plt.contour(XX, YY, ZZ, 10)
X = np.linspace(0.01, 10, 100)
Y = -0.5*X + 6
plt.plot(X,Y,color='r', linewidth=3)
plt.colorbar()
plt.plot(uplist[:,1], uplist[:,2],'x')
plt.plot(downlist[:,1], downlist[:,2],'o')

X = np.linspace(0.01, 10, 100)
Y = linreg(X)
plt.plot(X, Y, linewidth=3, color = 'b') This is obtained by taking the average of 25 $y$’s from the Monte Carlo samples. Again, the blue line is the estimated line from $\boldsymbol{w}$, and the red line is the function given to draw the samples, i.e. the exact solution. Bayesian predictions become moderate because we do not have enough data. If we use enough many data, the predictions will become more tight.

This is an example of Bayesian 1-neuron network, and if you construct the multiple neurons network, it is the Bayesian network.