Randomly initialize cluster centroids (each is the center point of the cluster)
Repeat these steps until there are few to no changes in EITHER the point assignment to centroids OR the location of cluster centroids. Do this by minimizing the cost
Assign each point to the closest centroid. Frobenius norm is computed.
Recompute the centroids of newly formed clusters . Each = the average coordinates of all points in the cluster
What if a cluster has 0 points
DELETE that cluster, and the number of clusters is then
What if clusters are NOT well-separated
If the visualization reveals that the data is NOT separated AT ALL, clustering might be inappropriate. Double-check with assumptions.
The cost will either decrease or stay the same after each iteration
Why instead of ?
Although is the number of cluster index, is the variable used for cluster assignment. Because this process works for every training example/data point, the number of cluster assignment variables should be equal to the number of training examples .
See the first loop:
Notation
= index of cluster () to which example is currently assigned
= cluster centroid numbered
= cluster centroid of the cluster to which example has been assigned
Clusters are spherical, with data points being modeled as lying within a sphere around their cluster centroid.
Data has a clustering tendency, and clusters are well-separated.
The number of groupings is fixed and assumed known
Because k-means clusters data points purely on their (Euclidean) geometric closeness to the cluster centroid, it implicitly assumes
All clusters have the same radii
Few or no outliers, which are unusually far away from the cluster centroid by comparison to the rest of the points in that cluster. Their presence will dramatically impair the results of k-means
Each cluster contains roughly the same number of data, although clusters might have different densities
import numpy as npdef init_centroids(X, K): """ This function initializes K centroids that are to be used in K-Means on the dataset X Args: X (ndarray): Data points K (int): number of centroids/clusters Returns: centroids (ndarray): Initialized centroids """ # Randomly reorder the indices of examples randidx = np.random.permutation(X.shape[0]) # Take the first K examples as centroids centroids = X[randidx[:K]] return centroidsdef find_closest_centroids(X, centroids): """ Computes the centroid memberships for every example Args: X (ndarray): (m, n) Input values centroids (ndarray): k centroids Returns: idx (array_like): (m,) closest centroids """ K = centroids.shape[0] m = X.shape[0] idx = np.zeros(m, dtype=int) for i in range(m): distances = [] #Store ALL distances between m points and K centroids for j in range(K): #Calculate Euclidean distance (Frobenius norm) norm = np.linalg.norm(X[i] - centroids[j]) distances.append(norm) #get the closet centroid for the current data point idx[i] = np.argmin(distances) return idxdef compute_centroids(X, idx, K): """ Returns the new centroids by computing the means of the data points assigned to each centroid. Args: X (ndarray): (m, n) Data points idx (ndarray): (m,) Array containing index of closest centroid for each example in X. Concretely, idx[i] contains the index of the centroid closest to example i K (int): number of centroids Returns: centroids (ndarray): (K, n) New centroids computed """ m, n = X.shape centroids = np.zeros((K, n)) for k in range(K): # get all points assigned to the centroid k points = X[idx == k] # recompute the centroid as the mean coordinates of all data points in that cluster centroids[k] = np.mean(points, axis = 0) #mean by columns return centroids