First of all, apologies for the lack of posts the last month or so. I’ve busy working on the launch of significant enhancements to the RainStor product and lots of exciting activity with our partners, including our recently announced relationship with Dell. My other focus has been with my fast growing identical twins Parker and Ryan who are now 7 months old. So pardon my indulgence as I combine the two into this blog post. As ever, comments are welcome but be gentle as I’m operating on a sleep deficit J
As per Wikipedia, Shared Nothing (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. Shared Nothing Architectures have become prevalent in the Data Warehousing space with products such as EMC Greenplum, HP Vertica and Teradata providing Big Data analytic solutions. Hadoop and HDFS is also an example of a SN environment.
With that as a brief backdrop let me turn my attention to the challenge of being a parent to twins. Together with obvious extreme lack of sleep, the other conundrum is do we need to buy two of everything, and if not how will we share it? If you are a parent of twins (or two siblings) the scenario and questions below will be familiar:
1. Will they both need parallel access to it at the same time?
(Items such as bottles, clothing, pacifiers, car seats meet this criteria)
This is a simple example of “shared nothing” with no contention, whereby each boy has his own item and is able to happily focus on each individual item in a self contained manner. With the use of their own bottles at feeding time, we are able to simultaneously feed both boys, in half the time versus if we proceeded sequentially. For SN systems such as Hadoop, massive parallelization by simply adding more nodes with its own locally attached disk allows it to scale to handle Big Data data volumes.
2. How will we provide access to an item if one or the other needs it?
(For example using a fixed or portable changing table)
With a changing table built into a dresser, we are pretty much saying that all diaper changing will be done in a fixed location (at least around the house). We therefore bring each boy to the nursery where the table resides. In Hadoop MapReduce function processing is moved to keep the work as close to the data as possible to reduce network traffic. This is to avoid, moving the data itself in what is known as “data shipping”. In contrast, Shared Disk and Shared Everything architectures don’t have a “data shipping” issue because each node has access to all of the data.
Obviously Shared Nothing vs. Shared Disk vs. Shared Everything is a much more complex and sophisticated technical topic which I won’t be covering today. Throw in OLTP vs OLAP and the Cloud you have an even spicier debate. If you are interested you can check the links below for some good discussion.
- http://jhingran.typepad.com/anant_jhingrans_musings/2010/02/shared-nothing-vs-shared-disks-the-cloud-sequel.html
- http://scaledb.blogspot.com/2010/02/cloud-computing-shared-disk-vs-shared.html
- http://kevinclosson.wordpress.com/2007/08/09/nearly-free-or-not-gridsql-for-enterprisedb-is-simply-better-than-real-application-clusters-it-is-shared-nothing-architecture-after-all/
And as usual excellent reference material is available through links on Wikipedia:
- ^ The Case for Shared Nothing Architecture by Michael Stonebraker. [Originally published in Database Engineering, Volume 9, Number 1 (1986).](PDF)
- ^ Independent article comparing Shared Nothing and Shared Disk
- ^ Article on Shared Nothing from the point of view of a Shared Disk Vendor(PDF)
- ^ Article on Shared Nothing from the point of view of a Shared Nothing Vendor(PDF)
To tie off this post with my main two focuses; RainStor ‘s unique architecture which physically stores data using a “Shared Nothing” paradigm, while at the same time providing access from any SN node, mitigating the “data shipping” network transfer bandwidth problem by significantly compressing the data. This is one reason why RainStor is an ideal solution to ingest and support the query of ever growing Big Data volumes that have to be retained at petabyte-scale. Meanwhile Parker and Ryan are themselves growing and scaling at an alarming rate.