Tag Archives: python

Plot covariance estimates in a GaussianMixture cluster


Moving covariance matrix functions from GMM to GaussianMixture in sklearn

So in sklearn 0.18 the GMM model is deprecated and looks to be removed in 0.20. The replacement is GaussianMixture available from sklearn.mixture. I have some code that currently uses GMM and needed to port it to GaussianMixture. One of the major convienence features to GMM are accessors to mean, covars, etc. The Covariance matrix itslef is critical for many of the clustering applications that would motivate the original use of GMM. This post shows how to port that covariance accessor to the promoted GaussianMixture implementation.

Covariance Matrix

Remember that the covariance matrix is a matrix representing the covariances between all of the elements between two vectors, X and Y:

Cov[X,Y]=E[(X − E[X])(Y − E[Y])]=E[XY] − E[X]E[Y]

An alternative form, perhaps more expressive when considering the matrix forms that we are dealing with here and using Σ as the covariance matrix, and μ is the mean of any random vector X:

Σ = E[(X − μ)(X − μ)T ] = E[XXT ] − μμT

Python Intelligent Algorithms

Working with a GMM clustering over the iris dataset, there is a dependency on the make_ellipses method published by Ron Weiss ronweiss@gmail.com.
The most non-trivial move required to shift from GMM to GaussianMixture involves:

v, w = np.linalg.eigh(gmm._get_covars()[n][row_idx[:, None], col_idx])

which needs to shift to

v, w = np.linalg.eigh(gmm.covariances_[n][row_idx[:, None], col_idx])

in order to work with the newer model package.
The full listing of the function as I have ported it is then:

def make_ellipses(gmm, ax, x, y):
Extracts a covariance matrix in 2D from a higher dimensional feature space.
Calculates an ellipse along maximal variance in a GMM object in 2D,
both direction and the respective magnitude, i.e., the eigenvector and eigenvalue
of the covariance matrix. It writes the resulting ellipse onto an existing pyplot plot.
:param gmm: sklearn GaussianMixture object
:param ax: plot axis - the 2D subset of the full feature space
:param x: the first dimension of the 2D plot axis
:param y: the second dimension of the 2D plot axis
for n, color in enumerate('rgb'):
    row_idx = np.array([x, y])
    col_idx = np.array([x, y])
    # FIXME GMM has method _get_covars not present in GaussianMixture
    #v, w = np.linalg.eigh(gmm._get_covars()[n][row_idx[:, None], col_idx])
    v, w = np.linalg.eigh(gmm.covariances_[n][row_idx[:, None], col_idx])
    u = w[0] / np.linalg.norm(w[0])
    angle = np.arctan2(u[1], u[0])
    angle = 180 * angle / np.pi  # convert rads to degrees
    v *= 9
    ell = mpl.patches.Ellipse(gmm.means_[n, [x, y]], v[0], v[1], 180 + angle, color=color)

For an excellent discussion on the use of the ellipses for plotting the covariances of your GaussianMixture see GMM covariances

Iterables vs. Generators in Python

Iterables vs. Generators in Python

I’ve had some people asking me lately what the difference between a Python iteratable and a Python generator is, and when to use them. This is a short write-up to show some of the differences and the benefits of use.


Simply put, an iteratable in Python is any object that will allow you to use it in with an in operator, e.g.:

    A = [1,2,3,4,5]
    for a in A:

will output:


An iteratable is an object that will hold other items, primitives or more complex objects. Lists, tuples, sets, dicts are all examples of iterables. So are things like strings and files. We use them all the time, quite properly, in our code.

And when we use a comprehension, (hint, hint!) we are also creating an iterable, so

    B = [x for x in range(1,10000)]

creates a list of 100 values, from 0…10000.

So that’s great. but perhaps we should be concerned about things like efficiency, and speed, and memory when we are building larger applications. Let’s examine a couple of data points. Here I’m using sys.getsizeof to get a fairly decent idea of memory usage.


not surprisingly, B > A. Reasonable given that B has 10000 items while A has 5. But think about your application and how long A or B are going to hang around. Are you using them once or many times? Are you losing available memory for a one time operation or do you need to access that iterable again and again? If you only need it once, it is costly to make the iterable and then have it hang around.


Welcome the generator that will not store the values but rather the method used to compute the values. Let’s create a generator that gives the same result as B above.

    C = (x for x in range(1,10000))

The difference in the definition is the use of () rather than the list comprehension []. What did we get? Let’s look at the differences.

< class ‘list’>
< class ‘list’>
< class ‘generator’>

But try using C in your print loop and you will get the same result! Size differences?


Now that’s interesting! not only is C 0.09% the size of B, both giving the same result when used for computation, but C is 8.7% the size of A, which only has 5 values compared to C’s 10000 (0.05% of 10000)!

I got the output I wanted from C with a similar iteration, namely:

    for c in C:

and, sure enough, I have 10000 values in output. But when I ran it a second time, I got no output! That’s because my generator was consumed when I iterated over it. They can only be used once. So there are many times when I only need to make the iterable and run over it once in flow, and so that’s fine. And I gain considerable efficiencies.

A more complex example

And these are merely simple iterables. Imagine what happens with more complex structures, e.g.

    letters = ['a', 'b', 'c', 'd', 'e']
    colors = ['red', 'yellow', 'blue', 'green']
    squares = [x*x for x in range(1,10)]

    newlist = [(l, c, s) for l in letters for c in colors for s in squares]
    newgen = ((l, c, s) for l in letters for c in colors for s in squares)

Both newgen and newlist produce objects that can yield a 180 element group of tuples, e.g.:

(‘a’, ‘red’, 1)
(‘a’, ‘red’, 4)
(‘a’, ‘red’, 9)
(‘a’, ‘red’, 16)
(‘a’, ‘red’, 25)
(‘a’, ‘red’, 36)
(‘a’, ‘red’, 49)
(‘a’, ‘red’, 64)
(‘a’, ‘red’, 81)
(‘a’, ‘yellow’, 1)
(‘a’, ‘yellow’, 4)
(‘a’, ‘yellow’, 9)
(‘a’, ‘yellow’, 16)
(‘a’, ‘yellow’, 25)
(‘a’, ‘yellow’, 36)
(‘a’, ‘yellow’, 49)
(‘a’, ‘yellow’, 64)
(‘a’, ‘yellow’, 81)
(‘a’, ‘blue’, 1)
(‘a’, ‘blue’, 4)
(‘a’, ‘blue’, 9)
(‘a’, ‘blue’, 16)
(‘a’, ‘blue’, 25)
(‘a’, ‘blue’, 36)
(‘a’, ‘blue’, 49)
(‘a’, ‘blue’, 64)
(‘a’, ‘blue’, 81)
(‘a’, ‘green’, 1)
(‘a’, ‘green’, 4)
(‘a’, ‘green’, 9)
(‘a’, ‘green’, 16)
(‘a’, ‘green’, 25)
(‘a’, ‘green’, 36)
(‘a’, ‘green’, 49)
(‘a’, ‘green’, 64)
(‘a’, ‘green’, 81)
(‘b’, ‘red’, 1)
(‘b’, ‘red’, 4)
(‘b’, ‘red’, 9)
(‘b’, ‘red’, 16)
(‘b’, ‘red’, 25)
(‘b’, ‘red’, 36)
(‘b’, ‘red’, 49)
(‘b’, ‘red’, 64)
(‘b’, ‘red’, 81)
(‘b’, ‘yellow’, 1)
(‘b’, ‘yellow’, 4)
(‘b’, ‘yellow’, 9)
(‘b’, ‘yellow’, 16)
(‘b’, ‘yellow’, 25)
(‘b’, ‘yellow’, 36)
(‘b’, ‘yellow’, 49)
(‘b’, ‘yellow’, 64)
(‘b’, ‘yellow’, 81)
(‘b’, ‘blue’, 1)
(‘b’, ‘blue’, 4)
(‘b’, ‘blue’, 9)
(‘b’, ‘blue’, 16)
(‘b’, ‘blue’, 25)
(‘b’, ‘blue’, 36)
(‘b’, ‘blue’, 49)
(‘b’, ‘blue’, 64)
(‘b’, ‘blue’, 81)
(‘b’, ‘green’, 1)
(‘b’, ‘green’, 4)
(‘b’, ‘green’, 9)
(‘b’, ‘green’, 16)
(‘b’, ‘green’, 25)
(‘b’, ‘green’, 36)
(‘b’, ‘green’, 49)
(‘b’, ‘green’, 64)
(‘b’, ‘green’, 81)
(‘c’, ‘red’, 1)
(‘c’, ‘red’, 4)
(‘c’, ‘red’, 9)
(‘c’, ‘red’, 16)
(‘c’, ‘red’, 25)
(‘c’, ‘red’, 36)
(‘c’, ‘red’, 49)
(‘c’, ‘red’, 64)
(‘c’, ‘red’, 81)
(‘c’, ‘yellow’, 1)
(‘c’, ‘yellow’, 4)
(‘c’, ‘yellow’, 9)
(‘c’, ‘yellow’, 16)
(‘c’, ‘yellow’, 25)
(‘c’, ‘yellow’, 36)
(‘c’, ‘yellow’, 49)
(‘c’, ‘yellow’, 64)
(‘c’, ‘yellow’, 81)
(‘c’, ‘blue’, 1)
(‘c’, ‘blue’, 4)
(‘c’, ‘blue’, 9)
(‘c’, ‘blue’, 16)
(‘c’, ‘blue’, 25)
(‘c’, ‘blue’, 36)
(‘c’, ‘blue’, 49)
(‘c’, ‘blue’, 64)
(‘c’, ‘blue’, 81)
(‘c’, ‘green’, 1)
(‘c’, ‘green’, 4)
(‘c’, ‘green’, 9)
(‘c’, ‘green’, 16)
(‘c’, ‘green’, 25)
(‘c’, ‘green’, 36)
(‘c’, ‘green’, 49)
(‘c’, ‘green’, 64)
(‘c’, ‘green’, 81)
(‘d’, ‘red’, 1)
(‘d’, ‘red’, 4)
(‘d’, ‘red’, 9)
(‘d’, ‘red’, 16)
(‘d’, ‘red’, 25)
(‘d’, ‘red’, 36)
(‘d’, ‘red’, 49)
(‘d’, ‘red’, 64)
(‘d’, ‘red’, 81)
(‘d’, ‘yellow’, 1)
(‘d’, ‘yellow’, 4)
(‘d’, ‘yellow’, 9)
(‘d’, ‘yellow’, 16)
(‘d’, ‘yellow’, 25)
(‘d’, ‘yellow’, 36)
(‘d’, ‘yellow’, 49)
(‘d’, ‘yellow’, 64)
(‘d’, ‘yellow’, 81)
(‘d’, ‘blue’, 1)
(‘d’, ‘blue’, 4)
(‘d’, ‘blue’, 9)
(‘d’, ‘blue’, 16)
(‘d’, ‘blue’, 25)
(‘d’, ‘blue’, 36)
(‘d’, ‘blue’, 49)
(‘d’, ‘blue’, 64)
(‘d’, ‘blue’, 81)
(‘d’, ‘green’, 1)
(‘d’, ‘green’, 4)
(‘d’, ‘green’, 9)
(‘d’, ‘green’, 16)
(‘d’, ‘green’, 25)
(‘d’, ‘green’, 36)
(‘d’, ‘green’, 49)
(‘d’, ‘green’, 64)
(‘d’, ‘green’, 81)
(‘e’, ‘red’, 1)
(‘e’, ‘red’, 4)
(‘e’, ‘red’, 9)
(‘e’, ‘red’, 16)
(‘e’, ‘red’, 25)
(‘e’, ‘red’, 36)
(‘e’, ‘red’, 49)
(‘e’, ‘red’, 64)
(‘e’, ‘red’, 81)
(‘e’, ‘yellow’, 1)
(‘e’, ‘yellow’, 4)
(‘e’, ‘yellow’, 9)
(‘e’, ‘yellow’, 16)
(‘e’, ‘yellow’, 25)
(‘e’, ‘yellow’, 36)
(‘e’, ‘yellow’, 49)
(‘e’, ‘yellow’, 64)
(‘e’, ‘yellow’, 81)
(‘e’, ‘blue’, 1)
(‘e’, ‘blue’, 4)
(‘e’, ‘blue’, 9)
(‘e’, ‘blue’, 16)
(‘e’, ‘blue’, 25)
(‘e’, ‘blue’, 36)
(‘e’, ‘blue’, 49)
(‘e’, ‘blue’, 64)
(‘e’, ‘blue’, 81)
(‘e’, ‘green’, 1)
(‘e’, ‘green’, 4)
(‘e’, ‘green’, 9)
(‘e’, ‘green’, 16)
(‘e’, ‘green’, 25)
(‘e’, ‘green’, 36)
(‘e’, ‘green’, 49)
(‘e’, ‘green’, 64)
(‘e’, ‘green’, 81)

Yet look at the type and size differences:

< class ‘list’> 1680
< class ‘generator’> 80

newlist is 21 times larger than newgen. So, as you write iterable obejcts, especially iterables that are being used for another computation, consider generators!

async or threaded file downloads in python 3.x

Capturing here some nice examples of using asyncio and threads to manage multiple file downloads in Python 3.x. Go here for more in-depth discussion.

You could use a thread pool to download files in parallel:

 #!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

You could also download several files at once in a single thread using asyncio:

 #!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, 
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

TDD in shell scripts

So what is the state of the religious war in trying to make *having a bash* more robust with modern patterns? I have written some amazing throw-aways myself over the years, but the TDD approach would tell you that this is now almost always a waste of time, that you will want to do it the right way and reuse what you have invested in, or be able to when you have to against plan. Continue reading TDD in shell scripts

List loaded modules in python

Just a handy little function to list the loaded modules compiled and available in the namespace.

import sys

yields something like

>[‘copy_reg’, ‘encodings’, ‘site’, ‘__builtin__’, ‘__main__’, ‘encodings.encodings’, ‘abc’, ‘posixpath’, ‘errno’, ‘encodings.codecs’, ‘_abcoll’, ‘types’, ‘_codecs’, ‘_warnings’, ‘genericpath’, ‘stat’, ‘zipimport’, ‘encodings.__builtin__’, ‘warnings’, ‘UserDict’, ‘encodings.utf_8’, ‘sys’, ‘codecs’, ‘readline’, ‘os.path’, ‘signal’, ‘linecache’, ‘posix’, ‘encodings.aliases’, ‘exceptions’, ‘os’]