k-means clustering is the quintessential clustering algorithm which automatically groups data into clusters

Usage

Algorithm

Choose k as a number of clusters wanted
Randomly initialize cluster centroids (each is the center point of the cluster)
Repeat these steps until there are few to no changes in EITHER the point assignment to centroids OR the location of cluster centroids. Do this by minimizing the cost
1. Assign each point to the closest centroid . Frobenius norm is computed.
2. Recompute the centroids of newly formed clusters . Each = the average coordinates of all points in the cluster

What if a cluster has 0 points

DELETE that cluster, and the number of clusters is then

What if clusters are NOT well-separated

If the visualization reveals that the data is NOT separated AT ALL, clustering might be inappropriate. Double-check with assumptions.

You can still stick to k-means, but it might be better to try an alternative algorithm

Choose k

(ML Spec)

< (see Notation)
elbow method:
- Build different models with different values of
- Plot vs a metric (e.g. cost function, inertia, silhouette score) and choose at the elbow of the curve
optimize for the ultimate purpose

Initialize

(ML Spec) Repeat these initialization (, usually ) times and pick a situation with the lowest cost/distortion

Randomly pick training examples
Set equal to these examples
Implement and minimize cost function

Minimize cost function

The cost function to be minimized for k-means is called “distortion”

The cost will either decrease or stay the same after each iteration

Why instead of ?

Although is the number of cluster index, is the variable used for cluster assignment. Because this process works for every training example/data point, the number of cluster assignment variables should be equal to the number of training examples . See the first loop:

Notation

= index of cluster () to which example is currently assigned
= cluster centroid numbered
= cluster centroid of the cluster to which example has been assigned
= number of training example

Assumptions

(Raykov et. al, 2016)

Clusters are spherical, with data points being modeled as lying within a sphere around their cluster centroid.
Data has a clustering tendency, and clusters are well-separated.
The number of groupings is fixed and assumed known
Because k-means clusters data points purely on their (Euclidean) geometric closeness to the cluster centroid, it implicitly assumes
- All clusters have the same radii
- Few or no outliers, which are unusually far away from the cluster centroid by comparison to the rest of the points in that cluster. Their presence will dramatically impair the results of k-means
- Each cluster contains roughly the same number of data, although clusters might have different densities

Code

(Code Lab)

import numpy as np
 
def init_centroids(X, K):
    """
    This function initializes K centroids that are to be 
    used in K-Means on the dataset X
    
    Args:
        X (ndarray): Data points 
        K (int):     number of centroids/clusters
    
    Returns:
        centroids (ndarray): Initialized centroids
    """
    
    # Randomly reorder the indices of examples
    randidx = np.random.permutation(X.shape[0])
    
    # Take the first K examples as centroids
    centroids = X[randidx[:K]]
    
    return centroids
    
def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): k centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """
    K = centroids.shape[0]
    m = X.shape[0]
    idx = np.zeros(m, dtype=int)
    
    for i in range(m):
        distances = [] #Store ALL distances between m points and K centroids
        
        for j in range(K):
            
            #Calculate Euclidean distance (Frobenius norm)
            norm = np.linalg.norm(X[i] - centroids[j]) 
            distances.append(norm)
        
        #get the closet centroid for the current data point
        idx[i] = np.argmin(distances) 
    return idx
 
def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    m, n = X.shape
    centroids = np.zeros((K, n))
  
    for k in range(K):
        
        # get all points assigned to the centroid k
        points = X[idx == k]
        
        # recompute the centroid as the mean coordinates of all data points in that cluster
        centroids[k] = np.mean(points, axis = 0) #mean by columns
    
    return centroids

My (Chiffon) Nguyen

Explorer

k-means clustering

Usage

Algorithm

Choose k

Initialize

Minimize cost function

Notation

Assumptions

Code

Graph View

Table of Contents

Backlinks