Tag Archives: Dev

async or threaded file downloads in python 3.x

Capturing here some nice examples of using asyncio and threads to manage multiple file downloads in Python 3.x. Go here for more in-depth discussion.

You could use a thread pool to download files in parallel:

 #!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

You could also download several files at once in a single thread using asyncio:

 #!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, 
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

qdapRegex library for R

I just found the qdapRegex package for R, part of the larger qdap packages that Jason Gray and Tyler Rinker have put together for supporting text munging/processing for discourse analysis, etc. There’s a lot in there, with four libraries, including the Regex set, some tools, dictionaries and a qdap proper for the qualitative analysis (pre)-processing.

Continue reading qdapRegex library for R

Don't put your auth tokens in your R source code

I’ve been working with the great open data source which is BLS. You can get some of the data with the v1 API, but to use the v2 API you need to have a token. That simply takes a registration and a validation. Cheers to BLS. And cheers to Mikeasilva

So, now you have your API token and you want to go grab some data into some R and cook it up. So you might do something like:

payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'),
                'registrationKey'= 'MYVERYOWNTOKENREGISTEREDTOME')
response json

Sadly, when you check your code into github, or share it with someone else, they have your API token. A better way exists, padawan. Go to your home dir

> normalizePath("~/")

in the R console will tell you if you don't know. So will a simple

in a shell, but if you know what a shell is you knew that already :). In your home dir, edit a new file called .Renviron, unless it already exists, which questions why you are reading this post. In .Renviron, you can enter key-values per line:


and, beautifully, you can grab any and all of these values in your R code with the following:

myValueOfInterest <- Sys.getenv(KEY)
[1] "character"

so you can easily pass it as a parameter to those connections. All much better than embedding it directly into the source. N.B.: If you happened to include your home dir as part of your project dir, don't commit the .Renviron. Also, go change your project directory to something more sensible like a child dir. While you're at it, look at some of the other methods available via Sys, e.g.:


Now your interaction with the v2 API is more like:

payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'),
                'registrationKey'= BLS_API_TOKEN)
response <- blsAPI(payload)
json <- fromJSON(response)
response json 

Docker goodness with git for builds

So, Docker continues to grow and gain adoption. Google, AWS, OpenStack, etc. are all building in docker utility. Here is a good synopsis of some of the myths about docker and its real benefits and shortcomings.


But make no mistake, containerization is here and only going to grow. There is much activity about how VMWare is having to respond to containers and docker in particular, here, here and here.

What I’m now interested in for enterprise adoption is the building of interfaces leveraging the docker APIs to allow for ops to leverage this goodness, allow for separation of duties, and clean promotion to production for the enterprise. Its there for many languages, including:


Serialize R objects for transport across the wire

I’ve been thinking lately about serialization and transport of R objects. It seems to me that there is still some clunkiness in having modular classes share objects and that the predominant paradigm is still to store them at rest. But there are options, including save() and saveRDS() which will happily serialize your object to rest or to a connection. SaveRDS seems to be a better method unless you need to write multiple objects with one call, in which case you must fall back to save(). Simple to use:

> anObject <- (rnorm(100, mean=0, sd=1)) > anObject <- data.frame(x=1:length(anObject), y=anObject) > saveRDS(object = anObject, file = "/local/path/to/use/objectAtRest.rds")

For some instances, text as storage and even transport protocol may be better. Not fighting any environment issues for sharing an object, or wanting to transport cleanly across the wire. There is a base method, dput(), which will represent the serialization into text format for at rest or connections, but it seems to be very clunky and temperamental. Even the R base documentation tells us it is not a good way to share objects. But anyone outside the bubble would think immediately of xml or, more lightweight, json. And there are three packages (at least) that read/write json in R.

JSONlite is a package that originally forked from , and then underwent rewrite. It has several useful methods:

  • flatten – converts a nested df into a 2D df
  • prettify, minify – [adds,removes] indentation to a JSON string
  • serializeJSON – robust, and consequently, more verbose, serialization to/from R objects. Encodes all data and attributes
  • stream_in, stream_out – for line-by-line processing of JSON over a connection. Common with large datasets in JSON DBs
  • toJSON, fromJSON – serializes to/from JSON with type conventions discussed here
  • unbox – utility method which marks atomic df or vector as singleton, for use with restrictive predetermined JSON structures
  • validate – test if a string contains valid JSON
  • rbind.pages – combine a list of dfs to a single df, intended to help with paged JSON coming in over the wire.
  • So, if you need textual representations of objects then I would use toJSON() over dput().

    # Get JSON over the wire and convert to a local df

    > aDataFrame <- fromJSON("https://api.github.com/users/hadley/orgs")

    > anObjectInJSON <- toJSON(anObject, pretty=TRUE)

    Moving label titles in ggplot

    Looking to create a small multiples plot in ggplot with a wordy y axis title. Here is the code:
    ggplot(myData) +
    aes(x=x, y=y) +
    geom_point() +
    facet_wrap(~a_third_variable) +
    labs(x="XXX", y="Many, many words about YYY")

    Want to see some of the wonderful things that one can adjust with the axes, go read
    formats for axes. But, I had the y-axis title overlapping the scale ticks on the y-axis and there was nothing I found in the cookbook to deal with this.

    So, to adjust the position use a theme method:
    theme(axis.title.y = element_text(vjust=0.5))

    Remember that this is adjusting from the perspective of the axis, so for the y-axis I want to shift it vertically.

    Need to be able to silence warns on lubridate functions

    The lubridate package in R is excellent. It is intuitive for working with all kinds of date methods and very comprehensive. All praise to Hadley for continuing to maintain this excellent addition to the community.

    In recent work, I have noticed that methods will correctly handle NA values in the input object, but I don’t see a way to turn off the warnings when irrelevant. A long time ago, NA was breaking methods, but Hadley fixed that with b8e90c.

    And it works.

    > test <- c("1/1/15", "2/2/15", "3/3/15", NA, "5/5/15") > test
    [1] "1/1/15" "2/2/15" "3/3/15" NA "5/5/15"
    > is.na(test)
    > dmy(test)
    [1] "2015-01-01 UTC" "2015-02-02 UTC" "2015-03-03 UTC" NA "2015-05-05 UTC"

    In addition, in the wild, I am using the method on a vector and getting warnings that I would like to ignore. I haven’t seen a param to pass to ignore warnings, so it would be a nice addition.

    Real Time BI and CI in OBIEE

    I was fortunate enough to meet up with Stew at IOUG 2015 in Las Vegas, where he was once again peddling his elixir of Agile into the dark underworld of OBIEE BI. I saw Stewart give a seminar last year at IOUG 2014 where he was advocating for the XML native format of OBIEE 12c and how it was going to facilitate meaningful VCS for the OBIEE RPD at least. I went back to the ranch very excited by all of this, but again got the response of that won’t work here, we don’t work that way. Sighs.

    Nonetheless, Stewart and Kevin have moved on from being ACEs to starting a new consultancy called Red Pill Analytics. They have some of their presentations and articles up on their main site and it is worth a trawl. I will try to write a bit more about this in the next couple of days, but an important idea to highlight is their active selling of development-as-a-service. So, the model is that you purchase a capacity – small, medium or large, and then fill the sprint backlog on a regular basis. It’s an agile contract and should work like any other agile structure. But you have access to some BI wizards in the model so it should inject some rapid pushout to prod of some deliverables.

    I think that this is froody both because of what it can do to capacity-low OBIEE environments as well as what it demonstrates for the realization of capacity based rather than contract based external engagements with third party organizations. The first is a must have for organizations that have a local lack of talent but high demand and produces a feasible way forward for enterprise class platforms like OBIEE while still retaining ability to respond to the enhancement request stream from functional areas.

    The second is all about a new model of sourcing talent and capacity for an organization from an external service in an agile model. This is great to see and I suspect that we will see it in many other areas as time moves forward. It truly is a simple but bold extension of the IaaS/PaaS models into pure sw development. It should lead to market efficiencies as well for hot areas as competition should lend itself to growth in the sector. OBIEE, Peoplecode, all of these areas could benefit from this.