I read the paper k-means++: The Advantages of Careful Seeding and didn't quite understand the algorithm provided which is:
"Let D(x) denote the shortest distance from a data point x to the closest center we have already chosen.
1a. Choose an initial center c1 uniformly at random
1b. Choose the next center ci, selecting ci = x' ∈ X with probability (D(x')^2) / Sum_of(D(x)^2)
1c. Repeat Step 1b until we have chosen a total of k centers.
2-4. Proceed as with the standard k-means algorithm"
(Better look at the algorithm in the link above)
Especially step 1b. What do they mean by "selecting ci = x' ∈ X with probability (D(x')^2) / Sumof(D(x)^2)". Do they mean selecting the element that has the largest proportion ? And how can perform such computation can lead to select the optimal centroids ?
The function D(x) is defined for all points x ∈ X.
In step 1b, we must choose some point to be a new center. We will choose randomly between all points (that are not already centers). But we will not give an equal chance to every point; we will assign different probabilities to different points before we choose. These probabilities must add up to 1.
Consider D(x)^2. We can evaluate this at every point, and add up the values: Sum_of(D(x)^2).
Then we can assign each point x' a probability equal to D(x')^2 / Sum_of(D(x)^2). These probabilities add up to 1, and give a better chance to points far away from all existing centers.
1. Choose one center uniformly at random from among the data points. 2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. 3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2. 4. Repeat Steps 2 and 3 until k centers have been chosen. 5. Now that the initial centers have been chosen, proceed using standard k-means clustering.