Tag Archives: R

rose-chart

Make a Rose chart in R using ggplot

I got a request to make a rose plot, sometimes called a circumplex or doughnut chart, recently. There are two cases for this kind of
plot. The first is where you are using data that naturally sits in the circumpolar coordintate system. Circular or polar data would fit naturally
in such a chart. The second case is one where you want to take naturally cartesian coordinate data and transform it into the circumpolar
coordinate system. Often this is done simply for visual effect. Regardless, here I will describe how to do this in R (version 3.3.1 bug-in-your-hair)
and ggplot 2 (should work fine in a 2.x version).

Naturally circumpolar data

An example of a natural dataset for such a graph can be seen in this periodic data represented in the rose chart.

Polar Data Plot

Naturally cartesian data

However, most people aren’t dealing with this natural coordinate system. Rather, they are in a traditional cartesian coordinate system – if you don’t know then with a high degree of probability you should assume that your in a basically cartesian space.
But we can still achieve the rose chart for this data. Let’s walk through it with some sample data.

library(ggplot2)
library(plyr)

# generate some random data
set.seed(42)
events <- ceiling(10*runif(10)) 
sales <- 1000*runif(10)

# make a dataframe
df <- data.frame(market=1:10, events = events, sales = sales)

Now we have created some markets each of which have a number of events (1:10) and some sales returns (1:1000).
My dataframe ended up looking like:

Market Events Sales
1 10 457.7418
2 10 719.1123
3 3 934.6722
4 9 255.4288
5 7 462.2928
6 6 940.0145
7 8 978.2264
8 2 117.4874
9 7 474.9971
10 8 560.3327

We can easily create a bar chart that shows this data:

# make the initial bar chart
p1 <- ggplot(df) +
    aes(x=factor(market), y=sales, fill=factor(market)) +
    geom_bar(width=1, stat="identity")

calling p1 will give you your version of this plot:

Bar Chart

You could easily make a similar chart for Events by Market.

Translate to a circumpolar coordinate system

To make the data that we have into a rose plot we are going to wrap that bar chart onto itself.

# now simply want to cast the cartesian coord bar chart onto a circumpolar coord system.
p2 <- p1 + scale_y_continuous(breaks = 0:10) +
    coord_polar() + 
    labs(x = "", y = "") +
    scale_fill_discrete(guide_legend(title='Market')) +
    theme(axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank())

Here we have taken the already existing bar chart, p1, and given it a continuous y scale that corresponds to 10 divisions – 1000 would not add any clarity to the resulting plot. We are simply trying to give a sense of scale in the y-axis.
We then push onto the polar coordintate system with coord_polar(). That’s it. The remaining calls help to clean up our presentation. Remove the x and y axis labels and add a legend for the Market factor color map. Finally, using calls to theme, we remove all of the axis text and ticks to simplify the presentation.
Here is what we end up with:

Rose chart

That’s fine, but we lose all sense of perspective in the actual market values and comparison between markets could perhaps be made simpler. Let’s try to add some perspective. Moving back to our original bar chart, let’s add some grids that give a better sense of scale along the y-axis.

# to achieve a grid that is visible, we will add 
# a variable to the dataframe that we can plot as a separate plot
# This means that we use plyr.ddply to subset the original data,
# grouped by the market column, and add a new "border" column
# that we can then stack in a separate geom_bar

df2 <- ddply(df, .(market), transform, border = rep(1, events))

p1 <- ggplot(df) +
    aes(x=factor(market)) +
    geom_bar(aes(y=events, fill=factor(market)),
             width=1, 
             stat="identity") +
    geom_bar(data=df2,
             aes(y = border, width = 1), 
             position = "stack", 
             stat = "identity", 
             fill = NA, 
             colour = "black")

Firstly, we computed a second dataframe using ddply out of plyr. This took every market row and added a border column that has a 1 for every event in that market. Have a View() of the dataframe and you will see many more rows that df – I have 70 in mine. Each market now has a multitude of rows equal to how many events there were in that market.
We then did the same sort of bar chart as before, but do note that we have flipped to event for the y-axis. I have reversed what we did before so you can try it out for sales on your own.
Crucially, we added a second bar chart to the plot object, which uses the df2 data. It is building that bar chart with the border column data and stacking the results with no fill and black outlines. Your resulting bar chart looks like:

Bar Chart with Grids

Cast our grided barchart to polar coords.

To get a rose chart from this new bar chart is no different to what we did before. All the differences are wrapped up in the generation of p1, so we have kept our code fairly DRY.
Rerunning the generation of p2 with the new p1:

p2 <- p1 + scale_y_continuous(breaks = 0:10) +
    coord_polar() + 
    labs(x = "", y = "") +
    scale_fill_discrete(guide_legend(title='Market')) +
    theme(axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank())

yields:
Rose chart with grids

Yahoo Financial data API

Access financial data from web api at yahoo

Yahoo used to run a very rich API to financial data at , but, alas, it serves no more on most of the URLs. There is still a service, but it
is a pale shadow of the former. The former URLs had up to 84 parameters! (see the reference of the API at the bottom). Now
you can query


http://real-chart.finance.yahoo.com/table.csv?s=AAPL

and you will get a return of Apple’s data, going back for decades:


Date,Open,High,Low,Close,Volume,Adj Close

It’s all comma delimited, and there doesn’t appear to be any secret xml switch. This is all dumbed down from what used to be there.
You can get the same result from another deprecated service they had at
http://ichart.finance.yahoo.com/table.csv?s=GOOG, again replacing
your S param with the symbol you wish to look up. No other params seem to make any difference to the request or will
result in a 404. One doesn’t seem to be able to pull multiple symbols in a single query.

There are other services that still seem to be running:

Download your CSV

http://download.finance.yahoo.com/d/quotes.csv?s=AAPL+GOOG&f=snl1c1p2&e=.csv

which will give you a downloaded CSV with quotes in the form of:


"AAPL","Apple Inc.",112.72,-0.20,"-0.18%"
"GOOG","Google Inc.",628.59,-9.02,"-1.41%"

This is accepting multiple symbols. It also seems to be accepting many of the old parameters referenced below.
Perhaps this should be reworked into a little api as it’s pretty hairy just now. I might do that in R to see how useful
it could be for time series. the extension makes no difference to what the format is nor the actual file sent. So,

  • s = <+SYM>…
  • f =
  • e no difference

Want a chart?

You can request a symbol at http://chart.finance.yahoo.com/z?s=GOOG
and you will get Google stock chart

Want a snapshot, no historical, in XML?

You can still reach through to the backend through this quite amazing piece of internet fossil evidence.
http://query.yahooapis.com/v1/public/yql?q=select%20%20from%20yahoo.finance.quotes%20where%20symbol%20in%20%28%22AAPL,GOOG%22%29&env=store://datatables.org/alltableswithkeys
will yield a *mostly
empty XML block. But, hey, it’s there.

Need a symbol, go here.

Parameter API

  • a Ask
  • a2 Average Daily Volume
  • a5 Ask Size
  • b Bid
  • b2 Ask (Real-time)
  • b3 Bid (Real-time)
  • b4 Book Value
  • b6 Bid Size
  • c Change & Percent Change
  • c1 Change
  • c3 Commission
  • c6 Change (Real-time)
  • c8 After Hours Change (Real-time)
  • d Dividend/Share
  • d1 Last Trade Date
  • d2 Trade Date
  • e Earnings/Share
  • e1 Error Indication (returned for symbol changed / invalid)
  • e7 EPS Estimate Current Year
  • e8 EPS Estimate Next Year
  • e9 EPS Estimate Next Quarter
  • f6 Float Shares
  • g Day’s Low
  • h Day’s High
  • j 52-week Low
  • k 52-week High
  • g1 Holdings Gain Percent
  • g3 Annualized Gain
  • g4 Holdings Gain
  • g5 Holdings Gain Percent (Real-time)
  • g6 Holdings Gain (Real-time)
  • i More Info
  • i5 Order Book (Real-time)
  • j1 Market Capitalization
  • j3 Market Cap (Real-time)
  • j4 EBITDA
  • j5 Change From 52-week Low
  • j6 Percent Change From 52-week Low
  • k1 Last Trade (Real-time) With Time
  • k2 Change Percent (Real-time)
  • k3 Last Trade Size
  • k4 Change From 52-week High
  • k5 Percebt Change From 52-week High
  • l Last Trade (With Time)
  • l1 Last Trade (Price Only)
  • l2 High Limit
  • l3 Low Limit
  • m Day’s Range
  • m2 Day’s Range (Real-time)
  • m3 50-day Moving Average
  • m4 200-day Moving Average
  • m5 Change From 200-day Moving Average
  • m6 Percent Change From 200-day Moving Average
  • m7 Change From 50-day Moving Average
  • m8 Percent Change From 50-day Moving Average
  • n Name
  • n4 Notes
  • o Open
  • p Previous Close
  • p1 Price Paid
  • p2 Change in Percent
  • p5 Price/Sales
  • p6 Price/Book
  • q Ex-Dividend Date
  • r P/E Ratio
  • r1 Dividend Pay Date
  • r2 P/E Ratio (Real-time)
  • r5 PEG Ratio
  • r6 Price/EPS Estimate Current Year
  • r7 Price/EPS Estimate Next Year
  • s Symbol
  • s1 Shares Owned
  • s7 Short Ratio
  • t1 Last Trade Time
  • t6 Trade Links
  • t7 Ticker Trend
  • t8 1 yr Target Price
  • v Volume
  • v1 Holdings Value
  • v7 Holdings Value (Real-time)
  • w 52-week Range
  • w1 Day’s Value Change
  • w4 Day’s Value Change (Real-time)
  • x Stock Exchange
  • y Dividend Yield

R Data Structures

R Data Structures overview by Hadley Wickham

If you are working with any programming language, there is nothing more important to understand fundamentally than
the language’s underlying data structures. Wickham on R Data Structures is an
excellent overview for R programmers.


There are five fundamental data types in R.

  • Homogeneous
    1. 1D – Atomic vector
    2. 2D – Matrix
    3. nD – Array
  • Heterogeneous
    1. List
    2. Data frame

Hadley goes through the five to show how they compare, contrast and, most importantly, they are interrelated. Important stuff.
He also goes through a small set of exercises to test comprehension. I think that some of these could be used
as bones for interview questions.

Taken from his book, Advanced R which
is well worth the price and should be read by serious R folks.

Code syntax tool for R into HTML

So this is useful for R people. Need to place some code up in HTML, want to have syntax highlighting, don’t want to fight code and pre format tags all day in wordpress? Paste your block into pretty-r:

data(tips, package="reshape2")
 
tipsAnova <- aov(tip~day-1, data=tips)
tipsLM <- lm(tip~day-1, data=tips)
summary(tipsAnova)
summary(tipsLM)

What this is doing is allowing you to maintain syntactically highlighted and well-formed R source code in your HTML pages, easily. You have already written the source in R, cut and paste it into the form, and it will return a clean set of styled HTML for pasting into any web source you need.

qdapRegex library for R

I just found the qdapRegex package for R, part of the larger qdap packages that Jason Gray and Tyler Rinker have put together for supporting text munging/processing for discourse analysis, etc. There’s a lot in there, with four libraries, including the Regex set, some tools, dictionaries and a qdap proper for the qualitative analysis (pre)-processing.

Continue reading qdapRegex library for R

Some Six Sigma Quality Control Charts in R

There is a CRAN package available in R, authored and maintained by Luca Scrucca , that has many nice functions within that are very much geared for Six Sigma work. I just used it to very easily to produce an XmR (X-bar and moving range) plot for some data we have been working with. Here is an example of output:

Approver Process XmR

Continue reading Some Six Sigma Quality Control Charts in R

Don't put your auth tokens in your R source code

I’ve been working with the great open data source which is BLS. You can get some of the data with the v1 API, but to use the v2 API you need to have a token. That simply takes a registration and a validation. Cheers to BLS. And cheers to Mikeasilva

So, now you have your API token and you want to go grab some data into some R and cook it up. So you might do something like:

install_github("mikeasilva/blsAPI")
payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'),
                'startyear'='2010',
                'endyear'='2012',
                'catalog'='true',
                'calculations'='true',
                'annualaverage'='true',
                'registrationKey'= 'MYVERYOWNTOKENREGISTEREDTOME')
response json

Sadly, when you check your code into github, or share it with someone else, they have your API token. A better way exists, padawan. Go to your home dir


> normalizePath("~/")

in the R console will tell you if you don't know. So will a simple
cd

in a shell, but if you know what a shell is you knew that already :). In your home dir, edit a new file called .Renviron, unless it already exists, which questions why you are reading this post. In .Renviron, you can enter key-values per line:


BLS_API_TOKEN=11111111122222222333333333
GITHUB_TOKEN=11111111133333333322222222
BIGDATAUSERNAME=BIGDADDY
BIGDATAPASSWD=ROCKS
KEY=VALUE

and, beautifully, you can grab any and all of these values in your R code with the following:


myValueOfInterest <- Sys.getenv(KEY)
typeof(MyValueOfInterest)
[1] "character"

so you can easily pass it as a parameter to those connections. All much better than embedding it directly into the source. N.B.: If you happened to include your home dir as part of your project dir, don't commit the .Renviron. Also, go change your project directory to something more sensible like a child dir. While you're at it, look at some of the other methods available via Sys, e.g.:


Sys.setenv()
Sys.unsetenv()

Now your interaction with the v2 API is more like:


payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'),
                'startyear'='2010',
                'endyear'='2012',
                'catalog'='true',
                'calculations'='true',
                'annualaverage'='true',
                'registrationKey'= BLS_API_TOKEN)
response <- blsAPI(payload)
json <- fromJSON(response)
response json 

Serialize R objects for transport across the wire

I’ve been thinking lately about serialization and transport of R objects. It seems to me that there is still some clunkiness in having modular classes share objects and that the predominant paradigm is still to store them at rest. But there are options, including save() and saveRDS() which will happily serialize your object to rest or to a connection. SaveRDS seems to be a better method unless you need to write multiple objects with one call, in which case you must fall back to save(). Simple to use:


> anObject <- (rnorm(100, mean=0, sd=1)) > anObject <- data.frame(x=1:length(anObject), y=anObject) > saveRDS(object = anObject, file = "/local/path/to/use/objectAtRest.rds")

For some instances, text as storage and even transport protocol may be better. Not fighting any environment issues for sharing an object, or wanting to transport cleanly across the wire. There is a base method, dput(), which will represent the serialization into text format for at rest or connections, but it seems to be very clunky and temperamental. Even the R base documentation tells us it is not a good way to share objects. But anyone outside the bubble would think immediately of xml or, more lightweight, json. And there are three packages (at least) that read/write json in R.

JSONlite is a package that originally forked from , and then underwent rewrite. It has several useful methods:

  • flatten – converts a nested df into a 2D df
  • prettify, minify – [adds,removes] indentation to a JSON string
  • serializeJSON – robust, and consequently, more verbose, serialization to/from R objects. Encodes all data and attributes
  • stream_in, stream_out – for line-by-line processing of JSON over a connection. Common with large datasets in JSON DBs
  • toJSON, fromJSON – serializes to/from JSON with type conventions discussed here
  • unbox – utility method which marks atomic df or vector as singleton, for use with restrictive predetermined JSON structures
  • validate – test if a string contains valid JSON
  • rbind.pages – combine a list of dfs to a single df, intended to help with paged JSON coming in over the wire.
  • So, if you need textual representations of objects then I would use toJSON() over dput().


    # Get JSON over the wire and convert to a local df

    > aDataFrame <- fromJSON("https://api.github.com/users/hadley/orgs")


    > anObjectInJSON <- toJSON(anObject, pretty=TRUE)