Tag Archives: big data

Serialize R objects for transport across the wire

I’ve been thinking lately about serialization and transport of R objects. It seems to me that there is still some clunkiness in having modular classes share objects and that the predominant paradigm is still to store them at rest. But there are options, including save() and saveRDS() which will happily serialize your object to rest or to a connection. SaveRDS seems to be a better method unless you need to write multiple objects with one call, in which case you must fall back to save(). Simple to use:

> anObject <- (rnorm(100, mean=0, sd=1)) > anObject <- data.frame(x=1:length(anObject), y=anObject) > saveRDS(object = anObject, file = "/local/path/to/use/objectAtRest.rds")

For some instances, text as storage and even transport protocol may be better. Not fighting any environment issues for sharing an object, or wanting to transport cleanly across the wire. There is a base method, dput(), which will represent the serialization into text format for at rest or connections, but it seems to be very clunky and temperamental. Even the R base documentation tells us it is not a good way to share objects. But anyone outside the bubble would think immediately of xml or, more lightweight, json. And there are three packages (at least) that read/write json in R.

JSONlite is a package that originally forked from , and then underwent rewrite. It has several useful methods:

  • flatten – converts a nested df into a 2D df
  • prettify, minify – [adds,removes] indentation to a JSON string
  • serializeJSON – robust, and consequently, more verbose, serialization to/from R objects. Encodes all data and attributes
  • stream_in, stream_out – for line-by-line processing of JSON over a connection. Common with large datasets in JSON DBs
  • toJSON, fromJSON – serializes to/from JSON with type conventions discussed here
  • unbox – utility method which marks atomic df or vector as singleton, for use with restrictive predetermined JSON structures
  • validate – test if a string contains valid JSON
  • rbind.pages – combine a list of dfs to a single df, intended to help with paged JSON coming in over the wire.
  • So, if you need textual representations of objects then I would use toJSON() over dput().

    # Get JSON over the wire and convert to a local df

    > aDataFrame <- fromJSON("https://api.github.com/users/hadley/orgs")

    > anObjectInJSON <- toJSON(anObject, pretty=TRUE)

    Own your data and the capability to sweat it

    So, the analysis is rolling in about what won it for Obama, which includes a great deal owing to big data and analytical models. The data came from public sector and commercial databases, combined into an obama campaign datawarehouse. Then there were real data scientists who knew how to build models and act upon them. Romney’s people also were doing these sorts of things, but importantly for me, chose to outsource much of the effort and thus were not able to own and exploit as much of the results and models as obama. This is the salient lesson, that in a world of increasing data-centricity, the successful organizations will not view data and the applications which sweat it as anything commodity. Rather they will need to see them as a core strategic requirement that, if anything, they will need to grow. Continue reading Own your data and the capability to sweat it

    LAK 2012

    There was some interesting stuff happening at the [LAK 2012](http://www.solaresearch.org/events/lak/2012videos/) in Vancouver. I didn’t go, but want to go over some stuff here and capture it for later. Much of this will be pushed forwards in Denver at [Educause 2012](http://www.educause.edu/annual-conference) which I should be at. This particular talk was looking at how to build organizational capacity for LA inside an HE. Donald M. Norris,Linda Baer
    Panel Proposal: Building Organizational Capacity for Analytics. Continue reading LAK 2012

    Educause Game Changers book worth a look

    Educause is well known to the denizens of HE, and they have just released a new book edited by Diana Oblinger, [Game Changers: Education and Information Technologies](http://www.educause.edu/game-changers). With 17 chapters and an additional 21 case studies, the work is a compilation of authors’ views about how “Institutions are finding new ways of achieving higher education’s mission without being crippled by constraints or overpowered by greater expectations”. The authors are a collection of university presidents, provosts, faculty and others who are taking a serious analysis of how the face of HE needs to change to sustain. Continue reading Educause Game Changers book worth a look

    Performance of NoSQL vs SQL

    Doing some work on looking at performance of NoSql engines versus traditional Cobb relational DBs, and found some actual benchmark data that is interesting and impressive. This approach is already critical in big data computation in scientific and commercial environments, both in experimentation and production environments, and will only become increasingly so unless the licensing model for RDBMS and storage evaporates into yesterday, which I think is highly unlikely. Continue reading Performance of NoSQL vs SQL