Doing some work on looking at performance of NoSql engines versus traditional Cobb relational DBs, and found some actual benchmark data that is interesting and impressive. This approach is already critical in big data computation in scientific and commercial environments, both in experimentation and production environments, and will only become increasingly so unless the licensing model for RDBMS and storage evaporates into yesterday, which I think is highly unlikely.[Big Data](http://en.wikipedia.org/wiki/Big_data) is no longer a problem set for the 1%, but is increasingly finding its way into mid-size organizations that need to capture, store, search, analyze, and visualize data sets that in the realm of TBs up to ZBs (1×10^21). 1 ZB = 1×10^9 TB. This scale is appropriate for problem sets in 2012. To give some sense of scale for these numbers, let’s quickly give some context. One-way broadcast network capacity was calculated to be 0.432 ZB in 1986, 0.715 ZB in 1993, 1.2 ZB in 2000 and 1.9 ZB in 2007, shown in the following plot. The last data point is the equivalent of some 175 newspapers a day to every person on earth, as reported by Martin Hilbert and Priscila Lopez (The World’s Technological Capacity to Store, Communicate, and Compute Information, Martin Hilbert and Priscila López (2011), Science (journal), 332(6025), 60-65).
![Network Capacity Trend](http://22.214.171.124/~shawnmeh/blog/wp-content/uploads/2012/01/network_capacity_trend.png)
Data sets are growing to these scales very quickly; In 2007 1.9 ZBs was transmitted through broadcast technology, according to the [University of Southern California]( http://uscnews.usc.edu/science_technology/how_much_information_is_there_in_the_world.html). Part of the growth is that there are more computers out there creating data, e.g., information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, Radio-frequency identification readers, wireless sensor networks and so on. According to [Big Blue]( http://www-01.ibm.com/software/data/bigdata/), every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years.
These growing data sets are placing increasing pressures on computational performance for load, monitor, backup and optimization for the data. Real or near-real time is a requirement for analytics, and at these scales, SQL implementations start to display serious latency (10s for result sets). Another concern with the traditional approach is the scale in costs associated with storage and compute engines to service the RDBMS. Preference is given by big data analytics teams for DAS over shared data storage architectures, due to performance, complexity, and costs. This need for performance of course directs one to wanting to place as much into memory as possible, and as the data sets themselves increase on these scales, any efficiencies in the memory footprint of the representation must be considered a win.
[NoSQL](http://en.wikipedia.org/wiki/NoSQL) have a win in several of these factors. One area is performance, as they avoid joins and usually will scale horizontally, allowing for partitioning in working memory on appropriate scales. RDBMS was a big winner in the application of frequent but small read/write transactions. NoSQL performs well on heavy read/write workloads, with real world implementations proving the technology and approach at Facebook with a 50 TB for inbox search, and 21 PB of storage claimed in 2010 in what was then meant to be the world’s largest [Hadoop]() cluster. FB is reported to be achieving 1MB of metadata for 1GB of data with this [approach](http://perspectives.mvdirona.com/2008/06/30/FacebookNeedleInAHaystackEfficientStorageOfBill ionsOfPhotos.aspx). [Yahoo! Inc](http://www.yahoo.com/) made available the source for its own production implementation of Hadoop and started contributing to the code base, and providing maintenance and patches. Hadoop is used by others, including [Amazon](http://amazon.com), [Google](http://www.google.com) and [IBM](http?//www.ibm.com).
So, architecturally, to provide random, realtime r/w access to the Hadoop and HDFS, you would use something like [HBase](http://hbase.apache.org/), like [BigTable](http://en.wikipedia.org/wiki/BigTable) across GFS. But I have had some questions recently about the real performance of these kinds of systems and I was interested in a recent [study](http://www.aicit.org/ijact/ppl/04_IJACT2-199028IP.pdf) from some guys at the Polytechnical University of Bucharest. HBase is using a [MapReduce]() approach to data modelling and representation, so data is stored in labeled tables which are made of rows (billions) X columns (millions), the table cells are versioned using timestamps of last insert. Table cell types (row and column) are byte arrays, so anything goes. Table rows are sorted by row keys which is the table’s primary key, which is the sole access mechanism.
Row columns are grouped into namespaced column families, e.g., document:sentence and document:word are in one family. HBase then partitions the tables into horizontal groups called regions, holding a subset of rows. In a test using two machines hosting two HBase nodes running a linux OS (Ubuntu 9.10 (KarmicKoala) 32, Sun Java 1.6.0_20) on VMWare Player (Desktop Slave Node: AMD Sempron 2600+ 1.6 GHz, 768 MB DDR RAM, 160 GB hard drive, operating system Windows XP SP2 32-bit with VMWare 524 MB of RAM, a single processor, 9 GB hard drive, a network Adapter. Notebook (Cluster Master) features are: Intel Core 2 Duo T5550 1.83 GHz, 2 GB DDR RAM, 160 GB hard drive, Windows 64-bits operating system installed. VMware Player the following: 1044 MB RAM, a single processor, 14 GB hard drive, one network adapter set to Bridge mode so that virtual machine to be considered as being connected to the physical network).
The first plot shows their hardware working on tables with increasing numbers of rows and show inserts and updates hitting a bump at 500,000 rows for inserts, but all operations declining nicely as rows hit 1,000,000. This is the right slope!
![Insert and Updates](http://126.96.36.199/~shawnmeh/blog/wp-content/uploads/2012/01/insert_and_updates.png)
The second plot shows reads across the same increasing table sizes, and again the same slope. Performance seems to be impressive on fairly modest hardware.