# Clustering (2) : Soft K-mean

## Hard K-means and responsibilities

If you did not read the first part of the clustering series. Please go check it out. I use the same data points and this post starts from troubleshooting the *hard K-means algorithm* in the previous post.

In the previous post, we defined assignment. The equivalent representation of this assignment of points to clusters is given by responsibilities, $r^{(n)}_k$. In the assignment step, we set $r^{(n)}_k$ to one if mean k is the closest mean to datapoint $ {\textbf x}^{(n)}$; otherwise, $r^{(n)}_k$ is zero.

$$
r^{(n)}_k = \{
\begin{array}{cc}
1 ~~~~~~if ~\hat {k} ~^ {(n)}=k\\

0 ~~~~~~if ~\hat {k} ~^ {(n)}\ne k

\end{array}
$$

In the update step, the means are adjusted to
match the sample means of the data points that they are responsible for. The update step is very similar to how to find *the center of the mass* in physics.

$$ {\textbf m}^{(k)}= {\sum_n ~ r^{(n)}_k {\textbf x}^{(n)} \over R^{(k)}} $$

where $R^{(k)}$ is the total responsibility of mean $k$

$$ R^{(k)}=\sum_n~r^{(n)}_k $$

## Soft K-means

In soft K-means, the K responsibility is

$$ r^{(n)}_k = {exp(-\beta ~ d(\boldsymbol{m}^{(k)},\boldsymbol{x}^{(n)} )) \over \sum_{k’}~exp(-\beta ~ d(\boldsymbol{m}^{(k’)},\boldsymbol{x}^{(n)} ))} $$

and

$$ \boldsymbol{m}^{(k)} = { \sum_n r^{(n)}_k \boldsymbol{x}^{(n)}\over \sum_n r^{(n)}_k} $$

$\beta$ is a *stiffness parameter*. From trials and errors, I set the parameter by hand.

```
import random
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
import math
data1=[]
for i in range(100):
data1=data1+[[0.5+0.5*random.random(),0.5+0.5*random.random()]]
data2=[]
for i in range(100):
data2=data2+[[0.5*random.random(),0.5*random.random()]]
#data3 =[]
#for i in range(30):
# data3=data3+[[0.8,random.random()]]
data=data1+data2#+data3
datax=[]
for i in range(len(data)):
datax=datax+[data[i][0]]
datay=[]
for i in range(len(data)):
datay=datay+[data[i][1]]
plt.scatter(datax,datay)
plt.show()
assign = []
for j in range(4):
assign=assign+[[random.random(),random.random()]]
sum=[]
for i in range(len(data)):
sum=sum+[[]]
r=[]
for i in range(4):
r=r+[[]]
for j in range(len(data)):
r[i]=r[i]+[[]]
update_assign=[]
for k in range(4):
update_assign=update_assign+[[]]
a=0
b=1/(7*np.var(data1))
while a<10:
a=a+1
sum=[]
for i in range(len(data)):
sum=sum+[[]]
for n in range(len(data)):
s=0
for k in range(4):
s=s+np.exp(-b*(LA.norm(np.array(assign[k])-np.array(data[n]))))
sum[n]=s
for j in range(4):
r[j][n]=np.exp(-b*(LA.norm(np.array(assign[j])-np.array(data[n]))))/s
#print a,j,n,r[j][n]
rsum=[]
for i in range(4):
rsum=rsum+[[]]
for k in range(4):
rs=0
for n in range(len(data)):
rs=rs+r[k][n]
rsum[k]=rs
ua=[0,0]
for n in range(len(data)):
ua[0]=ua[0]+r[k][n]*data[n][0]/float(rs)
ua[1]=ua[0]+r[k][n]*data[n][1]/float(rs)
update_assign[k]=ua
assign= update_assign
c=[[],[],[],[]]
for n in range(len(data)):
a=[]
for k in range(4):
a=a+[r[k][n]]
i=a.index(max(a))
c[i]=c[i]+[data[n]]
circle0= plt.Circle(assign[0],1/math.sqrt(b),color='r',fill=False)
circle1= plt.Circle(assign[1],1/math.sqrt(b),color='g',fill=False)
circle2= plt.Circle(assign[2],1/math.sqrt(b),color='y',fill=False)
circle3= plt.Circle(assign[3],1/math.sqrt(b),color='b',fill=False)
fig, ax=plt.subplots()
plt.xlim([-0.3,1.3])
plt.ylim([-0.3,1.3])
ax.add_artist(circle0)
ax.add_artist(circle1)
ax.add_artist(circle2)
ax.add_artist(circle3)
ax.scatter([x for x,y in c[0]],[y for x,y in c[0]],color='r')
ax.scatter([x for x,y in c[1]],[y for x,y in c[1]],color='g')
ax.scatter([x for x,y in c[2]],[y for x,y in c[2]],color='y')
ax.scatter([x for x,y in c[3]],[y for x,y in c[3]],color='b')
plt.show()
```

The radius of the circle equals to $\beta$. When the size of the circle is similar to the size of the cluster, the figure looks good. Note that the 4 circles are split to two group. Two circles for each group are almost overlapped. The colors determined by assignments are not looking good, but the circles show the this machine knows $K = 2$, not 4.

### Maximum likelihood for a mixture of Gaussian and soft K-means clustering

In 2d space, let us assume the probability distribution is *a mixture of two Gaussians*.

$$ P(x|\mu_1, \mu_2, \sigma) = \sum^2_k ~ p_k \frac{1}{\sqrt{2\pi}\sigma}{exp(-(x-\mu_k)^2\over 2\sigma^2} $$

Assume

A data set consists of N points $x_n$ which are assumed to be independent samples from this distribution.

The prior probability of the class label k is $p_1$ = $p_2$ =

^{1}⁄_{2}.Both set has the same standard deviation.

$\mu_i$, $\sigma$ are known.

By *Bayes’ theorem*,

$$ P(k_n = 1|x_n, \boldsymbol{\theta})= {P(x_n|k_n=1,\boldsymbol{\theta})P(k_n=1,\boldsymbol{\theta}) \over P(x_n,\boldsymbol{\theta})} $$

$$ = {P(x_n|k_n=1,\boldsymbol{\theta})P(k_n=1,\boldsymbol{\theta}) \over \sum_{k_n} P(x_n|k_n=1,\boldsymbol{\theta})P(k_n=1,\boldsymbol{\theta}) +P(x_n|k_n=2,\boldsymbol{\theta})P(k_n=2,\boldsymbol{\theta}) } $$

where $\boldsymbol{\theta}=(\mu_k, ~\sigma_k)$.

$$ P(k_n=1,\boldsymbol{\theta}) \equiv p_1,~~P(k_n=2,\boldsymbol{\theta}) \equiv p_2 $$

Then,

$$ P(k_n=1|x_n,\boldsymbol{\theta})= {p_1 \over p_1+p_2 \exp[-(w_1x_n+w0)]} $$

$$ P(k_n=2|x_n,\boldsymbol{\theta})= {p_2 \over p_2+p_1 \exp[-(w_1x_n+w0)]} $$

where $w_1=2(\mu_1-\mu_2)$, $w_0=-(\mu_1-\mu_2)(\mu_1+\mu_2)$.

$$ P(k_n=k|x_n,\boldsymbol{\theta}) \equiv p_{k|n} $$

By assumption, the prior probability $p_1$=$p_2$=^{1}⁄_{2} then,

$$ P(k_n=1|x_n,\boldsymbol{\theta})= {1 \over 1+ \exp[-(w_1x_n+w0)]} $$

$$ P(k_n=2|x_n,\boldsymbol{\theta})= {1 \over 1 + \exp[-(w_1x_n+w0)]} $$

$$ L\equiv \log \Pi_n P(x_n|{\mu_k},~\sigma) $$

then trivially,

$$ {\partial \over \partial{\mu_k}}L = \sum_n {p_{k|n} (x_n-\mu_k)\over \sigma^2} $$

$$ {\partial^2 \over \partial{\mu_k}^2}L = -\sum_n {p_{k|n} \over \sigma^2} $$

The new updated $\boldsymbol \mu’$ should maximize the likelihood. Then,

$$ {\partial \over \partial{\mu_k’}}L = \sum_n {p_{k|n} (x_n-\mu_k’)\over \sigma^2}=0 $$

$$ \sum_n p_{k|n} ~x_n - \sum_n p_{k|n} ~\mu’_k=\sum_n p_{k|n} ~x_n - \mu’_k\sum_n p_{k|n}=0 $$

Therefore,

$$ \mu’_k={\sum_n p_{k|n} ~x_n \over \sum_n p_{k|n} } $$

Note that this equation is exactly the same as the updated means from the responsibilities and data points in soft K-means clustering. $p_{k|n}$ is the responsibility $r^{(k)}_n$.

$$ {\partial \over \partial{\mu_k}}L ~/ { \partial^2 \over \partial{\mu_k}^2}L = {\sum_n p_{k|n} ~x_n - \mu_k\sum_n p_{k|n} \over -\sum_n p_{k|n}}= -\mu’_k+\mu_k $$

Thus,

$$
\mu’_k=\mu_k- {\partial \over \partial{\mu_k}}L ~/ {\partial^2 \over \partial{\mu_k}^2}L

$$

The algorithm we have derived for maximizing the likelihood is identical to the soft K-means algorithm.

## Enhanced soft K-means algorithm

*Enhanced soft K-means* algorithm is nothing but a generalization of the soft K-means. We are also able to obtain the algorithm by *maximizing likelihood* and *Bayes’ theorem*.

As I mentioned above, I chose the stiffness parameter by myself. Also mentioned that as the size of the circle is similar to the size of the cluster, the figure looks good. Note that the 4 circles are split to two group. Apparently, $\beta = 1/\sigma^2$. Let the machine to find the stiffness parameter based on the variance of the K-cluster.

$$ r^{(n)}_k = {\pi_k {1 \over \prod^I_i\sqrt{2\pi}\sigma^{(k)}_i}\exp(-d(\boldsymbol{m}^{(k)},\boldsymbol{x}^{(n)} )/\sigma_k^2) \over \sum_{k’}~\pi_k {1 \over \prod^I_i\sqrt{2\pi}\sigma^{(k’)}_i}\exp(-d(\boldsymbol{m}^{(k)},\boldsymbol{x}^{(n)} )/\sigma_{k’}^2)} $$

$$ \boldsymbol{m}^{(k)} = { \sum_n r^{(n)}_k \boldsymbol{x}^{(n)}\over R^{(k)}} $$

$$ R^{(k)} = \sum_n r^{(n)}_k $$

$$ \sigma_k^2 = {\sum_n r_k^{(n)}(\boldsymbol{x}^{(n)}-\boldsymbol{m}^{(k)})^2 \over I~R^{(k)}} $$

$$ \pi_k = {R^{(k)}\over \sum_k R^{(k)}} $$

$I$ is the dimension of $\boldsymbol{x}$. For our example, $I = 2$.

```
cutoff = 0.00000001
assign = []
for j in range(4):
assign=assign+[[random.random(),random.random()]]
sum=[]
for i in range(len(data)):
sum=sum+[[]]
r=[]
for i in range(4):
r=r+[[]]
for j in range(len(data)):
r[i]=r[i]+[[]]
update_assign=[]
for k in range(4):
update_assign=update_assign+[[]]
I = 2 # dimension
v=[] # variance or 1/beta
for k in range(4):
v=v+[np.var(data)]
p=[]
for k in range(4):
p=p+[1]
rsum=[[],[],[],[]]
shrink = False
while not shrink:
sum=[]
for n in range(len(data)):
sum=sum+[[]]
for n in range(len(data)):
s=0
for k in range(4):
s=s+p[k]/(np.sqrt(2*np.pi*v[k]))**I*np.exp(-1/v[k]*(LA.norm(np.array(assign[k])-np.array(data[n]))))
sum[n]=s
for k in range(4):
r[k][n]=p[k]/(np.sqrt(2*np.pi*v[k]))**I*np.exp(-1/v[k]*(LA.norm(np.array(assign[k])-np.array(data[n]))))/s
#print a,j,n,r[j][n]
for k in range(4):
rs=0
for n in range(len(data)):
rs=rs+r[k][n]
rsum[k]=rs
ua=[0,0]
for n in range(len(data)):
ua[0]=ua[0]+r[k][n]*data[n][0]/float(rsum[k])
ua[1]=ua[1]+r[k][n]*data[n][1]/float(rsum[k])
update_assign[k]=ua
assign= update_assign
v=[0,0,0,0]
for k in range(4):
for n in range(len(data)):
v[k]=v[k]+r[k][n]*LA.norm(np.array(data[n])-np.array(assign[k]))**2/float(I*rsum[k])
if v[k] < cutoff:
shrink = True
rsumsum=0
for k in range(4):
rsumsum=rsumsum+rsum[k]
p[k]=rsum[k]/float(rsumsum)
c=[[],[],[],[]]
for n in range(len(data)):
a=[]
for k in range(4):
a=a+[r[k][n]]
i=a.index(max(a))
c[i]=c[i]+[data[n]]
""
circle0= plt.Circle(assign[0],np.sqrt(v[0]),color='r',fill=False)
circle1= plt.Circle(assign[1],np.sqrt(v[1]),color='g',fill=False)
circle2= plt.Circle(assign[2],np.sqrt(v[2]),color='y',fill=False)
circle3= plt.Circle(assign[3],np.sqrt(v[3]),color='b',fill=False)
fig, ax=plt.subplots()
plt.xlim([-0.3,1.3])
plt.ylim([-0.3,1.3])
ax.add_artist(circle0)
ax.add_artist(circle1)
ax.add_artist(circle2)
ax.add_artist(circle3)
ax.scatter([x for x,y in c[0]],[y for x,y in c[0]],color='r')
ax.scatter([x for x,y in c[1]],[y for x,y in c[1]],color='g')
ax.scatter([x for x,y in c[2]],[y for x,y in c[2]],color='y')
ax.scatter([x for x,y in c[3]],[y for x,y in c[3]],color='b')
plt.show()
```

See the number of colors and circles. Still you can see one yellow point. If you print the size of the 4 circles, you will get two very small radii.

### Cutoff and regularization

Another important thing is that *while loop* checks if the any of $\sigma$ is too tiny, smaller than cutoff. Without it, the algorithm ends up with overfitting. This is a good example of the *regularization* which we studied at the perceptron post.

### More enhanced soft K-means algorithm (spectral clustering)

If the sample looks long in one direction, the circular clustering would not work any more. We can easily generalize the above algorithm. Change only two relations as follows.

$$ r^{(n)}_k = \frac{\pi_k {1 \over \prod^I_i\sqrt{2\pi}\sigma^{(k)}_i}\exp(-\sum^I_i(m^{(k)}_i -x^{(n)}_i))^2/ 2(\sigma^{(k)}_i)^2)}{\sum_{k’}~\pi_k {1 \over \prod^I_i\sqrt{2\pi}\sigma^{(k’)}_i}\exp(-\sum^I_i(m^{(k’)}_i -x^{(n)}_i))^2 /2(\sigma^{(k’)}_i)^2)} $$

$$ \sigma^{2(k)}_i = {\sum_n r_k^{(n)}(x_i^{(n)}-m_i^{(k)})^2 \over R^{(k)}} $$

Also replace few parts from the above Python code.

```
turns = 0
while turns < 100 :
T = 0.04
if turns > 500:
T = 1
sum=[]
for n in range(len(data)):
sum=sum+[[]]
for n in range(len(data)):
s=0
for k in range(4):
s=s+p[k]/(2*np.pi*np.sqrt(vx[k]*vy[k]))*np.exp(-(assignx[k]-datax[n])**2/(2*T*vx[k])-(assigny[k]-datay[n])**2/(2*T*vy[k]))
sum[n]=s
for k in range(4):
r[k][n]=p[k]/(2*np.pi*np.sqrt(vx[k]*vy[k]))*np.exp(-(assignx[k]-datax[n])**2/(2*T*vx[k])-(assigny[k]-datay[n])**2/(2*T*vy[k]))/s
for k in range(4):
rs=0
for n in range(len(data)):
rs=rs+r[k][n]
rsum[k]=rs
ua=[0,0]
for n in range(len(data)):
ua[0]=ua[0]+r[k][n]*datax[n]/float(rsum[k])
ua[1]=ua[1]+r[k][n]*datay[n]/float(rsum[k])
update_assignx[k]=ua[0]
update_assigny[k]=ua[1]
assignx= update_assignx
assigny= update_assigny
for k in range(4):
for n in range(len(data)):
vx[k]=vx[k]+r[k][n]*(datax[n]-assignx[k])**2/float(rsum[k])
vy[k]=vy[k]+r[k][n]*(datay[n]-assigny[k])**2/float(rsum[k])
rsumsum=0
for k in range(4):
rsumsum=rsumsum+rsum[k]
p[k]=rsum[k]/float(rsumsum)
turns += 1
c=[[],[],[],[]]
for n in range(len(data)):
a=[]
for k in range(4):
a=a+[r[k][n]]
i=a.index(max(a))
c[i]=c[i]+[data[n]]
ell =[]
colors = ['r','g','b','y']
for i in range(4):
if c[i] != []:
ell.append(Ellipse(xy=[assignx[i],assigny[i]],width=np.sqrt(vx[i]/4),height=np.sqrt(vy[i]/4),color=colors[i],fill=False))
fig=plt.figure(0)
ax=fig.add_subplot(111, aspect='equal')
plt.xlim([-0.3,1.3])
plt.ylim([-0.3,1.3])
for i in range(len(ell)):
ax.add_artist(ell[i])
for i in range(4):
ax.scatter([x for x,y in c[i]],[y for x,y in c[i]],color=colors[i])
plt.show()
```

## Simulated annealing

In the post about efficient Monte Carlo method, I briefly introduce *simulated annealing* (SA). At that time, the SA was to make a convergent Markov chain faster. The temperature parameter in exponent of probability density distribution can be used to slow down the process. In this long data points example, the algorithm miss to distinguish the group and just make one big cluster quickly. In other words, the objective function finds the local minimum and get stuck around it instead of finding the global minimum. Thus, I used very low temperature to slow down the process, and after finding the long cluster, speed up the process. This SA is useful to avoid overfitting. The other awesome method to find the global minimum among many local minima is a quantum annealing. Using the tunneling quantum effect and transverse field, we jump from minimum to minimum.