As I mentioned at the previous posting, one of the purposes of this blog is to supplement the Github of my data science study. I will gradually post and present all the iPython notebooks or Mathematica notebooks.

I felt there’s no good Python tutorial for spectral clustering (at least from my search). Who can’t use scikit-learn among who is serious about machine learning. It was not difficult to find the theory of spectral clustering as well. However, it was hard to find a Python tutorial for the algorithm of spectral clustering.

The theory for clustering and soft k-means can be found in the book Information Theory, Inference, and Learning Algorithms of David Mackay. In particular, I have read chapter 20 ~ 22 and used the algorithms to make Python codes in various clusterings.

Cluster

Today, the topic of our discussion is K-means clustering, and it is well-known clustering you’ll see at first if you start to learn clustering.

First, a good clustering has predictive power. Second, clusters can be a useful aid to communication because they allow lossy compression. A third reason for making a cluster model is that failures of the cluster model may highlight interesting objects that deserve special attention. A fourth reason for liking clustering algorithms is that they may serve as models of learning processes in neural systems.

What I want to show you is when it does not work well. And then I will also post how we avoid the troubles.

K-means

The K-means algorithm is an algorithm for putting N data points in an M-dimensional space into K clusters. Each cluster is parameterized by a vector ${\textbf m}^{(k)}$ called its mean.

First of all, set K means ${\textbf m}^{(k)}$ to random values. In the assignment step, each data point n is assigned to the nearest mean.

$$\hat {k}~ ^ {(n)} = argmin_k d({\textbf m}^{(k)}, {\textbf x}^{(n)})$$

We will make n-groups whose elements have the smallest distances.

#!/usr/local/Cellar/python/2.7.6/bin/python
# -*- coding: utf-8 -*-

from numpy import *
from matplotlib import pyplot

import random
import numpy as np
import matplotlib.pyplot as plt

N=100

data1=[[0.5+0.5*random.random(),0.5+0.5*random.random()] for i in range(N)]
data2=[[0.5*random.random(),0.5*random.random()] for i in range(N)]

data1=np.array(data1)
data2=np.array(data2)

data = np.concatenate((data1,data2))

X=c_[data[:,0],data[:,1]]

def displaydata():
pyplot.scatter(X[:,0],X[:,1])
pyplot.show()

def findClosestCentroids( X, centroids ):
K = shape( centroids )[0]
m   = shape( X )[0]
idx = zeros( (m, 1) )

for i in range(0, m):
lowest = 999
lowest_index = 0

for k in range( 0, K ):
cost = X[i] - centroids[k]
cost = cost.T.dot( cost )
if cost < lowest:
lowest_index = k
lowest  = cost

idx[i] = lowest_index
return idx + 1 # add 1, since python's index starts at 0

def computeCentroidsLoop(X, idx, K):
m, n = shape( X )
centroids = zeros((K, n))

for k in range(1, K+1):
counter = 0
cum_sum = 0
for i in range( 0, m ):
if idx[i] == k:
cum_sum += X[i]
counter += 1
centroids[k-1] = cum_sum / counter
return centroids

def computeCentroids( X, idx, K ):
m, n = shape( X )
centroids = zeros((K, n))

data = c_[X, idx] # append the cluster index to the X

for k in range( 1, K+1 ):
temp = data[data[:, n] == k] # quickly extract X that falls into the cluster
count = shape( temp )[0]		# count number of entries for that cluster

for j in range( 0, n ):
centroids[k-1, j] = sum(temp[:, j]) / count

return centroids

def runkMeans( X, initial_centroids, max_iters, plot=False ):
K = shape( initial_centroids )[0]
centroids = copy( initial_centroids )
idx = None

for iteration in range( 0, max_iters ):
idx = findClosestCentroids( X, centroids )
centroids = computeCentroids( X, idx, K )

if plot is True:
data = c_[X, idx]

# Extract data that falls in to cluster 1, 2, and 3 respectively, and plot them out
data_1 = data[data[:, 2] == 1]
pyplot.plot( data_1[:, 0], data_1[:, 1], 'ro', markersize=5 )

data_2 = data[data[:, 2] == 2]
pyplot.plot( data_2[:, 0], data_2[:, 1], 'go', markersize=5 )

data_3 = data[data[:, 2] == 3]
pyplot.plot( data_3[:, 0], data_3[:, 1], 'bo', markersize=5 )

data_4 = data[data[:, 2] == 4]
pyplot.plot( data_4[:, 0], data_4[:, 1], 'yo', markersize=5 )

pyplot.plot( centroids[:, 0], centroids[:, 1], 'k*', markersize=14 )

pyplot.xlim([-0.1,1.1])
pyplot.ylim([-0.1,1.1])

pyplot.show( block=True )

return centroids, idx

def kMeansInitCentroids( X, K ):
return np.random.permutation( X )[:K]

def findcentroids(X, centroids):

K = 4

initial_centroids = centroids

idx = findClosestCentroids( X, initial_centroids )
print idx[0:K]

centroids = computeCentroids( X, idx, K )
print centroids

def Kmeans(X,centroids):
K = 4

max_iters = 10

runkMeans( X, centroids, max_iters, plot=True )


Here are the data points. I chose random 50 data from the two boxes with 0.5 lengthy edges. We know the proper K-number for K-means clustering is 2. However, when we set K > 2, the K-means clustering gives a ugly result. Besides, the clustering strongly depends on the initial choice of the 4 assignment points.

set_printoptions(precision=6, linewidth=200)
displaydata()


Choose the K-means for K = 4. The proper K should be 2 for the data points. However, what happens if K = 4? From the below plots, the program is not working well.

For this given example, $K = 2$ is obvious, but for the most realistic problems, we can not always set the correct $K$ number. Thus, our goal is make the machine itself find the $K$.

centroids = array([[random.random(), random.random()], [random.random(), random.random()], [random.random(), random.random()],[random.random(), random.random()]])
findcentroids(X,centroids)

[[ 4.]
[ 4.]
[ 2.]
[ 4.]]
[[ 0.199317  0.208257]
[ 0.79478   0.780224]
[ 0.53369   0.865798]
[ 0.54678   0.551297]]

Kmeans(X,centroids)


The K-means(cetroids) converge to the wrong points in the above figures.

Appendix : to convert iPython notebooks to markdown

$ipython nbconvert Clustering.ipynb --to markdown  or $ jupiter nbconvert Clustering.ipynb --to markdown


This make a simple Clustering.md file and a folder, Clustering_files. The folder includes the figures in the notebooks. You might have to change the name of the folders in case that you store the image files and notebooks in the specific folder rather than _post folder.