I had to illustrate a k-means algorithm for my thesis, but I could not find any existing examples that were both simple and looked good on paper. See below for Python code that does just what I wanted.

# Adapted from http://hackmap.blogspot.com/2007/09/k-means-clustering-in-scipy.html
import numpy
import matplotlib
from scipy.cluster.vq import *
import pylab
# generate 3 sets of normally distributed points around
# different means with different variances
pt1 = numpy.random.normal(1, 0.2, (100,2))
pt2 = numpy.random.normal(2, 0.5, (300,2))
pt3 = numpy.random.normal(3, 0.3, (100,2))
# slightly move sets 2 and 3 (for a prettier output)
pt2[:,0] += 1
pt3[:,0] -= 0.5
xy = numpy.concatenate((pt1, pt2, pt3))
# kmeans for 3 clusters
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1])),3)
colors = ([([0.4,1,0.4],[1,0.4,0.4],[0.1,0.8,1])[i] for i in idx])
# plot colored points
pylab.scatter(xy[:,0],xy[:,1], c=colors)
# mark centroids as (X)
pylab.scatter(res[:,0],res[:,1], marker='o', s = 500, linewidths=2, c='none')
pylab.scatter(res[:,0],res[:,1], marker='x', s = 500, linewidths=2)

The output looks like this (also available in vector format here):

The X’s mark cluster centers. Feel free to use any of these files for whatever purposes. An attribution would be nice, but is not required :-).

14 responses

Do you want to comment?

Comments RSS and TrackBack Identifier URI ?

Thanks for posting your k-means example. I was having some trouble and I couldn’t find any examples until I stumbled onto your implementation. Thank you!

July 4, 2011 10:34 pm

You’re welcome, Seth! Glad I could help.

July 6, 2011 7:09 pm

Maciej! thank you very much, i have been looking an example like this for a while. This helped a lot! :)
And by the way, the graphic rocks!!, matplotlib gave me a big headache in the past

December 11, 2011 7:18 pm

You are most welcome :-)

December 14, 2011 11:00 pm

Thanks man!

March 14, 2012 8:21 am


Can I implement it 1D data? I have 1D data but this code is not working on it (obviously because it looks for columns)..so what additions would you recommend me to do in this script to make it go for 1D data too?

December 18, 2012 6:11 pm

Thanks for the post, Maciej. I have used kmeans to identify clusters (rings) in a matrix of sea surface height. The objective is to identify the rings and to determine their centroids. But kmeans, like kmeans2, requires as input parameter the number of clusters to be sought. That is a problem because I usually do not know previously how many rings will be present in the area. So, I was wondering how to avoid this kmeans limitation. Do you have any idea?

March 15, 2013 9:10 am

Hello, thanks for the info.
Is it possible to use scipy k-means in a capacitated k-means?
If so, how?

July 29, 2013 6:52 pm

Hi Maciej,

Thank you so much for your post! It was extremely useful.
However I might need your help. I’m working on a raw dataset of crimes in the city of chicago and I’m trying to cluster them up according to the type of crime committed using k-means. However I’m struggling to define the clusters and mostly to write a code for those. Any change I might get your help?

Thank you

February 24, 2016 10:44 am

Many thanks!!

September 29, 2016 2:24 pm

Comment now!