**Table of Content**

- What is K-Means Clustering
- How it Works
- Steps in K-Means Clustering
- Assignment of Clusters
- Understanding the Formula
- More Explanation of the K-Means Algorithm
- Example of K-Means Clustering

### 1. What is k-Means Clustering

It is a clustering method that tends to partition your data into partitions called clusters. Let’s say you have you have n data points, the k-means algorithms would assign each of the data point to the cluster with the nearest mean.

Note: the k in k-means represents the number of clusters, that is k clusters.

### 2. How it works

Suppose we have a set of observations *{x _{1}, x_{2},… x_{n}}* which consists in a set of N random variable

*x*(x is a D d-dimensional real vector). The goal is to partition the data set into some number K of clusters, where the value of K is known.

A cluster is a group of data points whose inter-point distances are minimal when compare with distance to points outside the cluster.

The first step is to find the

*m*for

_{k},*k = 1,…, K,*in which

*m*is the mean associated to the

_{k}*k*cluster.

_{th}We now assign each of the data points to clusters, such that the sum of squares of the distances of each data point to its closest mean m

_{k}is minimum.

### 3. Steps of the k-Means Algorithm

**Step 1:** For each unit, x, in the input space, place it in the cluster whose current centroid it is nearest to.**Step 2: **After all points have been assigned (x1,…xn), adjust the locations of the centroids of the k clusters**Step 3:** Reassign all the points to their nearest centroid**Repeat steps 2 and 3 until convergence**

Convergence occurs when the points not move between clusters and the centroids are stabilized

This algorithm would become clearer when we consider an example later in this article.

Let’s now look at a little more details on how the clusters are assigned

### 4. Assignment of Clusters

(based on “Pattern Recognition and Machine Learning” by Christopher M. Bishop)

For each data point x_{n}, we would use a corresponding set of binary variables we denoted as r_{nk} ={0,1} where k =1,…,K.

r_{nk} describes which of the K clusters the data point x_{n} is assigned to. If r_{nk} is assigned to k, then r_{nk} = 1 and r_{nj} = 0 for j not equal to k.

Note the the two superscript in indicates that there would be two loops:**Loop 1(1-N)**: Iterate through all the data points**Loop 2(1-K):** Iterate through all the cluster

Now, for each iteration, you need to calculate a value for J which is called the distortion measure and is given by the sum of squares function:

### 5. Lets try to understand this formula

First note that this __formula __is sometimes called distortion measure since it represent how much each data point say x_{k} is separated from the mean m_{k} of that cluster.

In other words, J represents the sum of squares of the distance of each data point to its assigned vector m_{k}.

- N is to total number of data points,
- K is the number of clusters
- x
_{n}is the vector of measurement n - m
_{k}is the mean for cluster k - r
_{nk}is an indicator variable that indicates whether to assign x_{n}to k

We need to determine the value of {r_{nk}} and mk that gives the least value of J.

This is achieved by the use of k-Means Algorithm

### 6. More Explanation of the k=Means Algorithm

This algorithm is an iterative procedure involving steps:

Input set of points x_{1},…x_{n}

Choose a value for K

Place m_{1},…, m_{k}(let’s call this centroid) are random locations within your data space

Repeat the following steps until convergence

for each point x

- find the nearest centroid, c
_{j}(compute the distance between xi, compute the distance between x_{i}and m_{j}for every centroid m_{j}, and pick the cluster with the minimum distance) - find the nearest centroid
- assign this point x
_{j}to the cluster of the nearest centroid

For each cluster j = 1,…, K, and recompute the centroid position

- Take all the data points that fall into this cluster
- new centroid c
_{j}= mean of all points x - assign to cluster j in the previous step

Stop when none of the cluster assignments change

### 7. Example of k-Means Clustering

The example below is based on

K = 2 and N = 14

Take some time to examine the progression through the steps and make sure you understand how it works