Tag Archives: comp sci

Iterables vs. Generators in Python

Iterables vs. Generators in Python

I’ve had some people asking me lately what the difference between a Python iteratable and a Python generator is, and when to use them. This is a short write-up to show some of the differences and the benefits of use.

Iterables

Simply put, an iteratable in Python is any object that will allow you to use it in with an in operator, e.g.:

    A = [1,2,3,4,5]
    for a in A:
        print(a)

will output:

1
2
3
4
5

An iteratable is an object that will hold other items, primitives or more complex objects. Lists, tuples, sets, dicts are all examples of iterables. So are things like strings and files. We use them all the time, quite properly, in our code.

And when we use a comprehension, (hint, hint!) we are also creating an iterable, so

    B = [x for x in range(1,10000)]

creates a list of 100 values, from 0…10000.

So that’s great. but perhaps we should be concerned about things like efficiency, and speed, and memory when we are building larger applications. Let’s examine a couple of data points. Here I’m using sys.getsizeof to get a fairly decent idea of memory usage.

sys.getsizeof(A)
920
sys.getsizeof(B)
87632

not surprisingly, B > A. Reasonable given that B has 10000 items while A has 5. But think about your application and how long A or B are going to hang around. Are you using them once or many times? Are you losing available memory for a one time operation or do you need to access that iterable again and again? If you only need it once, it is costly to make the iterable and then have it hang around.

Generators

Welcome the generator that will not store the values but rather the method used to compute the values. Let’s create a generator that gives the same result as B above.

    C = (x for x in range(1,10000))

The difference in the definition is the use of () rather than the list comprehension []. What did we get? Let’s look at the differences.

type(A)
< class ‘list’>
type(B)
< class ‘list’>
type(C)
< class ‘generator’>

But try using C in your print loop and you will get the same result! Size differences?

sys.getsizeof(A)
920
sys.getsizeof(B)
87632
sys.getsizeof(C)
80

Now that’s interesting! not only is C 0.09% the size of B, both giving the same result when used for computation, but C is 8.7% the size of A, which only has 5 values compared to C’s 10000 (0.05% of 10000)!

I got the output I wanted from C with a similar iteration, namely:

    for c in C:
        print(c)

and, sure enough, I have 10000 values in output. But when I ran it a second time, I got no output! That’s because my generator was consumed when I iterated over it. They can only be used once. So there are many times when I only need to make the iterable and run over it once in flow, and so that’s fine. And I gain considerable efficiencies.

A more complex example

And these are merely simple iterables. Imagine what happens with more complex structures, e.g.

    letters = ['a', 'b', 'c', 'd', 'e']
    colors = ['red', 'yellow', 'blue', 'green']
    squares = [x*x for x in range(1,10)]

    newlist = [(l, c, s) for l in letters for c in colors for s in squares]
    newgen = ((l, c, s) for l in letters for c in colors for s in squares)

Both newgen and newlist produce objects that can yield a 180 element group of tuples, e.g.:

(‘a’, ‘red’, 1)
(‘a’, ‘red’, 4)
(‘a’, ‘red’, 9)
(‘a’, ‘red’, 16)
(‘a’, ‘red’, 25)
(‘a’, ‘red’, 36)
(‘a’, ‘red’, 49)
(‘a’, ‘red’, 64)
(‘a’, ‘red’, 81)
(‘a’, ‘yellow’, 1)
(‘a’, ‘yellow’, 4)
(‘a’, ‘yellow’, 9)
(‘a’, ‘yellow’, 16)
(‘a’, ‘yellow’, 25)
(‘a’, ‘yellow’, 36)
(‘a’, ‘yellow’, 49)
(‘a’, ‘yellow’, 64)
(‘a’, ‘yellow’, 81)
(‘a’, ‘blue’, 1)
(‘a’, ‘blue’, 4)
(‘a’, ‘blue’, 9)
(‘a’, ‘blue’, 16)
(‘a’, ‘blue’, 25)
(‘a’, ‘blue’, 36)
(‘a’, ‘blue’, 49)
(‘a’, ‘blue’, 64)
(‘a’, ‘blue’, 81)
(‘a’, ‘green’, 1)
(‘a’, ‘green’, 4)
(‘a’, ‘green’, 9)
(‘a’, ‘green’, 16)
(‘a’, ‘green’, 25)
(‘a’, ‘green’, 36)
(‘a’, ‘green’, 49)
(‘a’, ‘green’, 64)
(‘a’, ‘green’, 81)
(‘b’, ‘red’, 1)
(‘b’, ‘red’, 4)
(‘b’, ‘red’, 9)
(‘b’, ‘red’, 16)
(‘b’, ‘red’, 25)
(‘b’, ‘red’, 36)
(‘b’, ‘red’, 49)
(‘b’, ‘red’, 64)
(‘b’, ‘red’, 81)
(‘b’, ‘yellow’, 1)
(‘b’, ‘yellow’, 4)
(‘b’, ‘yellow’, 9)
(‘b’, ‘yellow’, 16)
(‘b’, ‘yellow’, 25)
(‘b’, ‘yellow’, 36)
(‘b’, ‘yellow’, 49)
(‘b’, ‘yellow’, 64)
(‘b’, ‘yellow’, 81)
(‘b’, ‘blue’, 1)
(‘b’, ‘blue’, 4)
(‘b’, ‘blue’, 9)
(‘b’, ‘blue’, 16)
(‘b’, ‘blue’, 25)
(‘b’, ‘blue’, 36)
(‘b’, ‘blue’, 49)
(‘b’, ‘blue’, 64)
(‘b’, ‘blue’, 81)
(‘b’, ‘green’, 1)
(‘b’, ‘green’, 4)
(‘b’, ‘green’, 9)
(‘b’, ‘green’, 16)
(‘b’, ‘green’, 25)
(‘b’, ‘green’, 36)
(‘b’, ‘green’, 49)
(‘b’, ‘green’, 64)
(‘b’, ‘green’, 81)
(‘c’, ‘red’, 1)
(‘c’, ‘red’, 4)
(‘c’, ‘red’, 9)
(‘c’, ‘red’, 16)
(‘c’, ‘red’, 25)
(‘c’, ‘red’, 36)
(‘c’, ‘red’, 49)
(‘c’, ‘red’, 64)
(‘c’, ‘red’, 81)
(‘c’, ‘yellow’, 1)
(‘c’, ‘yellow’, 4)
(‘c’, ‘yellow’, 9)
(‘c’, ‘yellow’, 16)
(‘c’, ‘yellow’, 25)
(‘c’, ‘yellow’, 36)
(‘c’, ‘yellow’, 49)
(‘c’, ‘yellow’, 64)
(‘c’, ‘yellow’, 81)
(‘c’, ‘blue’, 1)
(‘c’, ‘blue’, 4)
(‘c’, ‘blue’, 9)
(‘c’, ‘blue’, 16)
(‘c’, ‘blue’, 25)
(‘c’, ‘blue’, 36)
(‘c’, ‘blue’, 49)
(‘c’, ‘blue’, 64)
(‘c’, ‘blue’, 81)
(‘c’, ‘green’, 1)
(‘c’, ‘green’, 4)
(‘c’, ‘green’, 9)
(‘c’, ‘green’, 16)
(‘c’, ‘green’, 25)
(‘c’, ‘green’, 36)
(‘c’, ‘green’, 49)
(‘c’, ‘green’, 64)
(‘c’, ‘green’, 81)
(‘d’, ‘red’, 1)
(‘d’, ‘red’, 4)
(‘d’, ‘red’, 9)
(‘d’, ‘red’, 16)
(‘d’, ‘red’, 25)
(‘d’, ‘red’, 36)
(‘d’, ‘red’, 49)
(‘d’, ‘red’, 64)
(‘d’, ‘red’, 81)
(‘d’, ‘yellow’, 1)
(‘d’, ‘yellow’, 4)
(‘d’, ‘yellow’, 9)
(‘d’, ‘yellow’, 16)
(‘d’, ‘yellow’, 25)
(‘d’, ‘yellow’, 36)
(‘d’, ‘yellow’, 49)
(‘d’, ‘yellow’, 64)
(‘d’, ‘yellow’, 81)
(‘d’, ‘blue’, 1)
(‘d’, ‘blue’, 4)
(‘d’, ‘blue’, 9)
(‘d’, ‘blue’, 16)
(‘d’, ‘blue’, 25)
(‘d’, ‘blue’, 36)
(‘d’, ‘blue’, 49)
(‘d’, ‘blue’, 64)
(‘d’, ‘blue’, 81)
(‘d’, ‘green’, 1)
(‘d’, ‘green’, 4)
(‘d’, ‘green’, 9)
(‘d’, ‘green’, 16)
(‘d’, ‘green’, 25)
(‘d’, ‘green’, 36)
(‘d’, ‘green’, 49)
(‘d’, ‘green’, 64)
(‘d’, ‘green’, 81)
(‘e’, ‘red’, 1)
(‘e’, ‘red’, 4)
(‘e’, ‘red’, 9)
(‘e’, ‘red’, 16)
(‘e’, ‘red’, 25)
(‘e’, ‘red’, 36)
(‘e’, ‘red’, 49)
(‘e’, ‘red’, 64)
(‘e’, ‘red’, 81)
(‘e’, ‘yellow’, 1)
(‘e’, ‘yellow’, 4)
(‘e’, ‘yellow’, 9)
(‘e’, ‘yellow’, 16)
(‘e’, ‘yellow’, 25)
(‘e’, ‘yellow’, 36)
(‘e’, ‘yellow’, 49)
(‘e’, ‘yellow’, 64)
(‘e’, ‘yellow’, 81)
(‘e’, ‘blue’, 1)
(‘e’, ‘blue’, 4)
(‘e’, ‘blue’, 9)
(‘e’, ‘blue’, 16)
(‘e’, ‘blue’, 25)
(‘e’, ‘blue’, 36)
(‘e’, ‘blue’, 49)
(‘e’, ‘blue’, 64)
(‘e’, ‘blue’, 81)
(‘e’, ‘green’, 1)
(‘e’, ‘green’, 4)
(‘e’, ‘green’, 9)
(‘e’, ‘green’, 16)
(‘e’, ‘green’, 25)
(‘e’, ‘green’, 36)
(‘e’, ‘green’, 49)
(‘e’, ‘green’, 64)
(‘e’, ‘green’, 81)

Yet look at the type and size differences:

< class ‘list’> 1680
< class ‘generator’> 80

newlist is 21 times larger than newgen. So, as you write iterable obejcts, especially iterables that are being used for another computation, consider generators!

Don't put your auth tokens in your R source code

I’ve been working with the great open data source which is BLS. You can get some of the data with the v1 API, but to use the v2 API you need to have a token. That simply takes a registration and a validation. Cheers to BLS. And cheers to Mikeasilva

So, now you have your API token and you want to go grab some data into some R and cook it up. So you might do something like:

install_github("mikeasilva/blsAPI")
payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'),
                'startyear'='2010',
                'endyear'='2012',
                'catalog'='true',
                'calculations'='true',
                'annualaverage'='true',
                'registrationKey'= 'MYVERYOWNTOKENREGISTEREDTOME')
response json

Sadly, when you check your code into github, or share it with someone else, they have your API token. A better way exists, padawan. Go to your home dir


> normalizePath("~/")

in the R console will tell you if you don't know. So will a simple
cd

in a shell, but if you know what a shell is you knew that already :). In your home dir, edit a new file called .Renviron, unless it already exists, which questions why you are reading this post. In .Renviron, you can enter key-values per line:


BLS_API_TOKEN=11111111122222222333333333
GITHUB_TOKEN=11111111133333333322222222
BIGDATAUSERNAME=BIGDADDY
BIGDATAPASSWD=ROCKS
KEY=VALUE

and, beautifully, you can grab any and all of these values in your R code with the following:


myValueOfInterest <- Sys.getenv(KEY)
typeof(MyValueOfInterest)
[1] "character"

so you can easily pass it as a parameter to those connections. All much better than embedding it directly into the source. N.B.: If you happened to include your home dir as part of your project dir, don't commit the .Renviron. Also, go change your project directory to something more sensible like a child dir. While you're at it, look at some of the other methods available via Sys, e.g.:


Sys.setenv()
Sys.unsetenv()

Now your interaction with the v2 API is more like:


payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'),
                'startyear'='2010',
                'endyear'='2012',
                'catalog'='true',
                'calculations'='true',
                'annualaverage'='true',
                'registrationKey'= BLS_API_TOKEN)
response <- blsAPI(payload)
json <- fromJSON(response)
response json 

Verbosity and algorithmic defences

The flood of information that is coming through our personal data buses is increasing all the time. I came across a couple of comparative statistics the other day that blew me away, and I wonder if we are all foolishly ignoring that data deluge and betting on algorithms to save the day. Continue reading Verbosity and algorithmic defences

Conditions for success by hard coding

I was reading the latest [IEEE]() Software and came across a piece by [Grady Booch](http://www.handbookofsoftwarearchitecture.com) that I wanted to capture here and reflect upon. The piece is mainly about his experience with systems that grow in complexity over time and the changes that need to be made in dealing with them as they mature through the normal application lifecycle. Continue reading Conditions for success by hard coding

Social Network Analysis tool for LMS fora

[SNAPP](http://www.snappvis.org/) is an interesting browser bookmarklet that will analyse a forum and build a social graph of the participants, providing a way to model interactions of the forum participants. This is a neat project that feeds into the growing domain of learning analytics. Continue reading Social Network Analysis tool for LMS fora

Performance of NoSQL vs SQL

Doing some work on looking at performance of NoSql engines versus traditional Cobb relational DBs, and found some actual benchmark data that is interesting and impressive. This approach is already critical in big data computation in scientific and commercial environments, both in experimentation and production environments, and will only become increasingly so unless the licensing model for RDBMS and storage evaporates into yesterday, which I think is highly unlikely. Continue reading Performance of NoSQL vs SQL

Enterprise Architecture definition

Been following a very flouncy thread on *structure smells* in an EA, and the guy starting the thread couldn’t define sufficiently well the starting point for what he meant by *enterprise* and *application* architectures. Here is the [TOGAF](http://www.opengroup.org/togaf/) definition, which most sane people would accept, along with some examples of the other stuff that was coming out, for …. comparative value. Continue reading Enterprise Architecture definition

RDFa in TinyMCE for semantic web content markup

[TinyMCE](http://tinymce.moxiecode.com/) is a nice, ajax compatible, platform independent web based Javascript HTML WYSIWYG editor control released as Open Source under LGPL by Moxiecode. TinyMCE has the ability to convert HTML TEXTAREA fields or other HTML elements to editor instances. TinyMCE is very easy to integrate into other Content Management Systems. [RDFaCE](http://aksw.org/Projects/RDFaCE) is an online RDFa content editor based on TinyMCE. It supports different views for semantic content authoring and uses existing semantic Web APIs to facilitate annotating and editing of RDFa contents. The RDFaCE system architecture consists of three layers: Rich Text Editor, RDFaCE plugin and External Services layer. Continue reading RDFa in TinyMCE for semantic web content markup