Tag Archives: data


Make a Rose chart in R using ggplot

I got a request to make a rose plot, sometimes called a circumplex or doughnut chart, recently. There are two cases for this kind of
plot. The first is where you are using data that naturally sits in the circumpolar coordintate system. Circular or polar data would fit naturally
in such a chart. The second case is one where you want to take naturally cartesian coordinate data and transform it into the circumpolar
coordinate system. Often this is done simply for visual effect. Regardless, here I will describe how to do this in R (version 3.3.1 bug-in-your-hair)
and ggplot 2 (should work fine in a 2.x version).

Naturally circumpolar data

An example of a natural dataset for such a graph can be seen in this periodic data represented in the rose chart.

Polar Data Plot

Naturally cartesian data

However, most people aren’t dealing with this natural coordinate system. Rather, they are in a traditional cartesian coordinate system – if you don’t know then with a high degree of probability you should assume that your in a basically cartesian space.
But we can still achieve the rose chart for this data. Let’s walk through it with some sample data.


# generate some random data
events <- ceiling(10*runif(10)) 
sales <- 1000*runif(10)

# make a dataframe
df <- data.frame(market=1:10, events = events, sales = sales)

Now we have created some markets each of which have a number of events (1:10) and some sales returns (1:1000).
My dataframe ended up looking like:

Market Events Sales
1 10 457.7418
2 10 719.1123
3 3 934.6722
4 9 255.4288
5 7 462.2928
6 6 940.0145
7 8 978.2264
8 2 117.4874
9 7 474.9971
10 8 560.3327

We can easily create a bar chart that shows this data:

# make the initial bar chart
p1 <- ggplot(df) +
    aes(x=factor(market), y=sales, fill=factor(market)) +
    geom_bar(width=1, stat="identity")

calling p1 will give you your version of this plot:

Bar Chart

You could easily make a similar chart for Events by Market.

Translate to a circumpolar coordinate system

To make the data that we have into a rose plot we are going to wrap that bar chart onto itself.

# now simply want to cast the cartesian coord bar chart onto a circumpolar coord system.
p2 <- p1 + scale_y_continuous(breaks = 0:10) +
    coord_polar() + 
    labs(x = "", y = "") +
    scale_fill_discrete(guide_legend(title='Market')) +
    theme(axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank())

Here we have taken the already existing bar chart, p1, and given it a continuous y scale that corresponds to 10 divisions – 1000 would not add any clarity to the resulting plot. We are simply trying to give a sense of scale in the y-axis.
We then push onto the polar coordintate system with coord_polar(). That’s it. The remaining calls help to clean up our presentation. Remove the x and y axis labels and add a legend for the Market factor color map. Finally, using calls to theme, we remove all of the axis text and ticks to simplify the presentation.
Here is what we end up with:

Rose chart

That’s fine, but we lose all sense of perspective in the actual market values and comparison between markets could perhaps be made simpler. Let’s try to add some perspective. Moving back to our original bar chart, let’s add some grids that give a better sense of scale along the y-axis.

# to achieve a grid that is visible, we will add 
# a variable to the dataframe that we can plot as a separate plot
# This means that we use plyr.ddply to subset the original data,
# grouped by the market column, and add a new "border" column
# that we can then stack in a separate geom_bar

df2 <- ddply(df, .(market), transform, border = rep(1, events))

p1 <- ggplot(df) +
    aes(x=factor(market)) +
    geom_bar(aes(y=events, fill=factor(market)),
             stat="identity") +
             aes(y = border, width = 1), 
             position = "stack", 
             stat = "identity", 
             fill = NA, 
             colour = "black")

Firstly, we computed a second dataframe using ddply out of plyr. This took every market row and added a border column that has a 1 for every event in that market. Have a View() of the dataframe and you will see many more rows that df – I have 70 in mine. Each market now has a multitude of rows equal to how many events there were in that market.
We then did the same sort of bar chart as before, but do note that we have flipped to event for the y-axis. I have reversed what we did before so you can try it out for sales on your own.
Crucially, we added a second bar chart to the plot object, which uses the df2 data. It is building that bar chart with the border column data and stacking the results with no fill and black outlines. Your resulting bar chart looks like:

Bar Chart with Grids

Cast our grided barchart to polar coords.

To get a rose chart from this new bar chart is no different to what we did before. All the differences are wrapped up in the generation of p1, so we have kept our code fairly DRY.
Rerunning the generation of p2 with the new p1:

p2 <- p1 + scale_y_continuous(breaks = 0:10) +
    coord_polar() + 
    labs(x = "", y = "") +
    scale_fill_discrete(guide_legend(title='Market')) +
    theme(axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank())

Rose chart with grids

Yahoo Financial data API

Access financial data from web api at yahoo

Yahoo used to run a very rich API to financial data at , but, alas, it serves no more on most of the URLs. There is still a service, but it
is a pale shadow of the former. The former URLs had up to 84 parameters! (see the reference of the API at the bottom). Now
you can query


and you will get a return of Apple’s data, going back for decades:

Date,Open,High,Low,Close,Volume,Adj Close

It’s all comma delimited, and there doesn’t appear to be any secret xml switch. This is all dumbed down from what used to be there.
You can get the same result from another deprecated service they had at
http://ichart.finance.yahoo.com/table.csv?s=GOOG, again replacing
your S param with the symbol you wish to look up. No other params seem to make any difference to the request or will
result in a 404. One doesn’t seem to be able to pull multiple symbols in a single query.

There are other services that still seem to be running:

Download your CSV


which will give you a downloaded CSV with quotes in the form of:

"AAPL","Apple Inc.",112.72,-0.20,"-0.18%"
"GOOG","Google Inc.",628.59,-9.02,"-1.41%"

This is accepting multiple symbols. It also seems to be accepting many of the old parameters referenced below.
Perhaps this should be reworked into a little api as it’s pretty hairy just now. I might do that in R to see how useful
it could be for time series. the extension makes no difference to what the format is nor the actual file sent. So,

  • s = <+SYM>…
  • f =
  • e no difference

Want a chart?

You can request a symbol at http://chart.finance.yahoo.com/z?s=GOOG
and you will get Google stock chart

Want a snapshot, no historical, in XML?

You can still reach through to the backend through this quite amazing piece of internet fossil evidence.
will yield a *mostly
empty XML block. But, hey, it’s there.

Need a symbol, go here.

Parameter API

  • a Ask
  • a2 Average Daily Volume
  • a5 Ask Size
  • b Bid
  • b2 Ask (Real-time)
  • b3 Bid (Real-time)
  • b4 Book Value
  • b6 Bid Size
  • c Change & Percent Change
  • c1 Change
  • c3 Commission
  • c6 Change (Real-time)
  • c8 After Hours Change (Real-time)
  • d Dividend/Share
  • d1 Last Trade Date
  • d2 Trade Date
  • e Earnings/Share
  • e1 Error Indication (returned for symbol changed / invalid)
  • e7 EPS Estimate Current Year
  • e8 EPS Estimate Next Year
  • e9 EPS Estimate Next Quarter
  • f6 Float Shares
  • g Day’s Low
  • h Day’s High
  • j 52-week Low
  • k 52-week High
  • g1 Holdings Gain Percent
  • g3 Annualized Gain
  • g4 Holdings Gain
  • g5 Holdings Gain Percent (Real-time)
  • g6 Holdings Gain (Real-time)
  • i More Info
  • i5 Order Book (Real-time)
  • j1 Market Capitalization
  • j3 Market Cap (Real-time)
  • j4 EBITDA
  • j5 Change From 52-week Low
  • j6 Percent Change From 52-week Low
  • k1 Last Trade (Real-time) With Time
  • k2 Change Percent (Real-time)
  • k3 Last Trade Size
  • k4 Change From 52-week High
  • k5 Percebt Change From 52-week High
  • l Last Trade (With Time)
  • l1 Last Trade (Price Only)
  • l2 High Limit
  • l3 Low Limit
  • m Day’s Range
  • m2 Day’s Range (Real-time)
  • m3 50-day Moving Average
  • m4 200-day Moving Average
  • m5 Change From 200-day Moving Average
  • m6 Percent Change From 200-day Moving Average
  • m7 Change From 50-day Moving Average
  • m8 Percent Change From 50-day Moving Average
  • n Name
  • n4 Notes
  • o Open
  • p Previous Close
  • p1 Price Paid
  • p2 Change in Percent
  • p5 Price/Sales
  • p6 Price/Book
  • q Ex-Dividend Date
  • r P/E Ratio
  • r1 Dividend Pay Date
  • r2 P/E Ratio (Real-time)
  • r5 PEG Ratio
  • r6 Price/EPS Estimate Current Year
  • r7 Price/EPS Estimate Next Year
  • s Symbol
  • s1 Shares Owned
  • s7 Short Ratio
  • t1 Last Trade Time
  • t6 Trade Links
  • t7 Ticker Trend
  • t8 1 yr Target Price
  • v Volume
  • v1 Holdings Value
  • v7 Holdings Value (Real-time)
  • w 52-week Range
  • w1 Day’s Value Change
  • w4 Day’s Value Change (Real-time)
  • x Stock Exchange
  • y Dividend Yield

R Data Structures

R Data Structures overview by Hadley Wickham

If you are working with any programming language, there is nothing more important to understand fundamentally than
the language’s underlying data structures. Wickham on R Data Structures is an
excellent overview for R programmers.

There are five fundamental data types in R.

  • Homogeneous
    1. 1D – Atomic vector
    2. 2D – Matrix
    3. nD – Array
  • Heterogeneous
    1. List
    2. Data frame

Hadley goes through the five to show how they compare, contrast and, most importantly, they are interrelated. Important stuff.
He also goes through a small set of exercises to test comprehension. I think that some of these could be used
as bones for interview questions.

Taken from his book, Advanced R which
is well worth the price and should be read by serious R folks.

Some Six Sigma Quality Control Charts in R

There is a CRAN package available in R, authored and maintained by Luca Scrucca , that has many nice functions within that are very much geared for Six Sigma work. I just used it to very easily to produce an XmR (X-bar and moving range) plot for some data we have been working with. Here is an example of output:

Approver Process XmR

Continue reading Some Six Sigma Quality Control Charts in R

Customer Intelligence tools and options

Tom Davenport knocked together an interesting [summary of CC tools](http://blogs.hbr.org/cs/2012/08/a_few_weeks_ago_i.html) that has some cursory analysis and applications. Worth a quick gander. Interesting also are some of the comments. It is amusing to see that the common angst of datawarehousing is still coming to new audiences as this drive towards larger adoption of data-centricity continues, namely *we need common definitions*. Ahh, ontology. A book that is referenced and praised in the comments that I haven’t read yet is [Customer Worthy](http://www.amazon.com/Customer-Worthy-everyone-organization-Think/dp/0981986919/ref=sr_1_1?ie=UTF8qid=1345216632sr=8-1keywords=Customer+Worthy%2C+Why+and+How+Everyone+Must+Think+Like+a+Customer). Need to read this I think.

Verbosity and algorithmic defences

The flood of information that is coming through our personal data buses is increasing all the time. I came across a couple of comparative statistics the other day that blew me away, and I wonder if we are all foolishly ignoring that data deluge and betting on algorithms to save the day. Continue reading Verbosity and algorithmic defences

Performance of NoSQL vs SQL

Doing some work on looking at performance of NoSql engines versus traditional Cobb relational DBs, and found some actual benchmark data that is interesting and impressive. This approach is already critical in big data computation in scientific and commercial environments, both in experimentation and production environments, and will only become increasingly so unless the licensing model for RDBMS and storage evaporates into yesterday, which I think is highly unlikely. Continue reading Performance of NoSQL vs SQL