# Neural Network (1): Perceptron and Stochastic Gradient Descent

## Single neuron is amazing

One of the lessons I had during physics program is that we should start to understand small thing deeply however complicated the system which you want to know is. Not just it is easier but also it helps a lot to understand the more complex ones.

Neural network is often compared to black magic. We do not understand why and how exactly so effective it is, but it makes great estimations in some specific matters. To resolve that kind of mystery, and to begin tutorials of deep learning, I want to start from a single neuron, a perceptron. This perceptron study will be very helpful for understanding more complex Hopfield network and Boltzmann machine.

Even this simple, single perceptron is a very good supervised learning machine. We would solve a simple supervised model in 2 dimensional space.

## Stochastic algorithm and advanced reading group

Besides, we will study stochastic gradient descent compared with batch gradient descent, and will see the power of the randomness.

Yesterday afternoon, I found out there is *the advanced reading group on machine learning* at downtown. The topic was surprisingly *a review of the variational inference*. Did not I post a review of the subject? I quickly scanned the review research paper and joined the meetup.

That meeting was very great. It is my first real discussion with Data scientists based on statistics and computer science, not from physics. To see their different perspective on this topic was also interesting^{1}. I also did my best to introduce physicists’ point of view on the topic.

Anyway, one of the big sections in the review paper was stochastic variational inference. As we discussed in the previous post, we should solve differential equations of the free energy, or the objective functional, and the solutions are often the sum of complicated multiplied matrices. When the size of the sample increases, it becomes extremely expensive. To avoid this trouble, data scientists use randomness and it is even magical. Random makes it inaccurate, but help to find what you want faster!

## Perceptron

### Batch gradient descent

We have 100 random 2d position vectors in the $10 \times 10$ box. The vectors have features, 0 or 1. We want to find the best straight line to split the samples. I intentionally set features 0 if the points are below the linear line, y = ^{1}⁄_{2} x + 6, and else feature 1.

```
import random
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import scipy.special as sp
import scipy.integrate as integrate
%precision 6
from __future__ import division
%matplotlib inline
plt.style.use('ggplot')
```

```
def sigmoid(a):
return 1/(1+np.exp(-a))
```

```
# initial x(2d vector) and target pair in 10*10 box
def initial_values(N): # N x 4 matrix
return np.array([(1, np.random.uniform(0,10), np.random.uniform(0,10),np.random.binomial(1, 0.5) ) for i in range(N)])
def weight():
return np.array([np.random.uniform(-1,1), np.random.uniform(-1,1), np.random.uniform(-1,1)])
```

```
def linreg(x):
return -w[0]/w[2]-x*w[1]/w[2]
```

```
N = 100
w = weight()
I = initial_values(N)
x = I[:,:3]
t = I[:,3]
for i in range(N):
if x[:,2][i]>-0.5*x[:,1][i] + 6:
t[i]=1
else:
t[i]=0
eta = 0.01
alpha = 0.01
turn = 1
turns = 1000
wlist=[w]
while turn < turns:
a = np.dot(x, w)
y = sigmoid(a)
e = t - y
g = -np.dot(np.transpose(x), e) # sum , batch
w = w - eta * ( g + alpha * w )
wlist += [w]
turn += 1
uplist = []
downlist = []
for i in range(N):
if t[i]==1:
uplist = uplist +[x[i]]
else:
downlist = downlist +[x[i]]
uplist = np.array(uplist)
downlist = np.array(downlist)
plt.plot(uplist[:,1], uplist[:,2],'x')
plt.plot(downlist[:,1], downlist[:,2],'o')
X = np.linspace(0.01, 10, 100)
Y = linreg(X)
plt.plot(X,Y)
plt.axis([0,10,0,10])
```

Looks good supervised learning enough? I skip the detail explanation about perceptron (**NN**). This is very famous, and broadly-known. I want to focus only on two lines.

```
g = -np.dot(np.transpose(x), e)
w = w - eta * ( g + alpha * w )
```

The first line is the gradient descent. The defined x is a $100 \times 3$ matrix, and e is a 100 dimensional vector. The complexity of the matrix product is $\mathcal{O}(100 \times 3 \times 100)$. With a large number of data is it too expensive.

### Regularization

The change of the weight to minimize the objective function^{2} is usually a multiplication of a learning rate and the gradient descent. However, here, The extra 2nd term, alpha * w is introduced as a *weight decay regularizer*.

One of the most important topics in quantum field theory is the regularization. The detail is mathematically complicated and apart from the machine learning, but the intuition and role of it is very similar to the *weight decay regularizer*.

In physics, it is very consequential to compute the total energy of the system. Most of researches in physics start from obtaining the energy of the concerned systems. For an example, consider a charged particle. It has the radial electric field and the magnitude of it is inverse to the distance. Then, the total energy in the spacetime by the field is proportional to $\int 1/r dr$. This is definitely infinite even in the finite volume including the particle because $log ~r$ diverges as $r \rightarrow 0$. It is a bad news to physicists.

Roughly, the idea is considering a cutoff radius $\Lambda$, which is very tiny but not zero, and split the integral. Then one finite part and the other one, but with some mathematical trick, the other part becomes finite. I know it sounds terrible, but it is true. Well, I would say it is the very illusion that it looks infinite. I think it is related to the question, *what on the earth is the vacuum energy*.

Without *weight decay regulaizer*, the points very near to the line gradually contributes more, and diverges at last. This is an example of overfitting, and *weight decay regulaizer* saves you from the evil infinity.

```
# the graph of y-intrsection vs slope of the linear graph
wlist = np.array(wlist)
plt.plot(-wlist[:,0]/wlist[:,2], wlist[:,1]/wlist[:,2])
plt.plot(-wlist[:,0][-1]/wlist[:,2][-1], wlist[:,1][-1]/wlist[:,2][-1], 'o')
print -wlist[:,0][-1]/wlist[:,2][-1], wlist[:,1][-1]/wlist[:,2][-1]
plt.axis([-10,10,-10,10])
```

The above plot is the graph of y-intersection vs slope of the linear plot during the iterations. Blue point is the last point of iterations. I did not draw the contour plot of the objective function. I will leave it for you. I did not write down the objective function in the post, so you might need to look at the textbook. When you contour plot it, you will find the ellipse around the blue point and the blue point should be about the minimum of the objective function. Note that the there is a clear pattern of approaches.

## Stochastic gradient descent and on-line learning

To dodge the cost problem of large numbered gradient descent, we use the stochastic gradient descent.

```
N = 100
w = weight()
I = initial_values(N)
a= np.zeros(N)
y = np.zeros(N)
e = np.zeros(N)
turns = 1000
wlist =[w]
turn = 1
while turn < turns:
for i in range(N):
a[i] = np.dot(w, x[i])
y[i] = sigmoid(a[i])
e[i] = t[i] - y[i]
eta = 0.1/N**(0.4)
w = w + eta * e[i] * x[i]
wlist += [w]
turn = turn +1
```

```
uplist = []
downlist = []
for i in range(N):
if t[i]==1:
uplist = uplist +[x[i]]
else:
downlist = downlist +[x[i]]
uplist = np.array(uplist)
downlist = np.array(downlist)
plt.plot(uplist[:,1], uplist[:,2],'x')
plt.plot(downlist[:,1], downlist[:,2],'o')
X = np.linspace(0.01, 10, 100)
Y = linreg(X)
plt.plot(X,Y)
plt.axis([0,10,0,10])
```

What happened? Worse than before! I tried to tune this up to make the better approximation, but could not. Before answer the question, see how the algorithm works.

The crucial part is the while loop. Can you notice what the gradient is? I did not clearly express it in the code.

```
g[i] = e[i] * x[i] # The gradient vector is N-dimensional.
```

There is no sum or matrix-multiplications. That saves a lot of cost. Of course, that gradient value is not correct gradient vector, but it is enough for rough trial and errors. Besides, the learning rate, $\eta$, is updated for every turn, and also is getting smaller. During the first a few iterations, it quickly and roughly pursues the approximate solution, and gradually tries better fine tuning.

Here, the example is unfair to the on-line learning, but if the sample is large-numbered, it will be powerful or even magical to reach convergent point faster.^{3} Besides, the linear plot is very sensitive at the value of slope. Sigmoid function is very sensitive around the slope, too.

```
# the graph of y-intrsection vs slope of the linear graph
wlist = np.array(wlist)
plt.plot(-wlist[:,0]/wlist[:,2], wlist[:,1]/wlist[:,2])
plt.plot(-wlist[:,0][-1]/wlist[:,2][-1], wlist[:,1][-1]/wlist[:,2][-1], 'o')
plt.axis([-10,10,-10,10])
```

The first few steps looks very random and the size of the step is decreasing. There is no correct or at least no good convergent point of the learning here.

Thus, to choose the stochastic gradient descent for the example was not right. However, it is very good for our understanding as an easy example. Then, how will you use the on-line learning? Even for this simple example, if my computer was horribly poor working like 60’s, then I would run the on-line learning first, but not too many iterations, and then when I feel it gives not too bad approximation, I will continue to process with the batch gradient descent to obtain fine-tuned approximation. The on-line learning would help us to choose better learning rate as well.

- The only bad thing was that they use too many abbreviations I am not familiar to. Furthermore, I feel that using abbreviations is just trendy all over the world now. It makes me hard to understand internet threads even in my mother tongue languages. I do not like abbreviations.
^{[return]} - I did not write the objective function. Gradient descent is obtained from the gradient of the objective.
^{[return]} - Nevertheless, in on-line learning, it is not so clear if it converges. At least, it does not at my example.
^{[return]}