I attended an IEEE-CNSV (Consultants Network of Silicon Valley) meeting on Cloud Computing Paradigms: MapReduce, Hadoop and Cascading in Santa Clara last night. It was extremely well attended with about 100+ people listening to Chris K. Wensel of Scale Unlimited providing an excellent, very technical overview of Hadoop, MapReduce and Cascading, an open source project originated by Chris that provides a feature rich API for more easily implementing complex distributed processes in Hadoop and MapReduce.
The room was filled with very smart engineers, scientists (one guy had over 50 patents to this name) and an abundance of “expert witnesses”. So I’m sure they had no problem keeping up with Chris’ presentation which I found very enlightening. However, it did take a significant combination of leveraging my Computer Science degree where I majored in Artificial Intelligence, writing systems in Lisp and Prolog, together with my years of working with Oracle RDBMS and a healthy dose of Java and UNIX/LINUX terminology comprehension for me to get the most out of the presentation. If you’d rather dig technically deeper into Hadoop, MapReduce and Cascade, I’d suggest you take a look at Chris’ excellent site www.cascading.org.
For those of you still with me here, let me try to sum up in non-technical terms what I think Hadoop and MapReduce is all about.
First a quick fun fact, the name Hadoop comes from the name of a plush elephant toy (pic to the left is not the original) belonging to the child of the original developer, Doug Cutting (who now works for Cloudera). It is an open source software platform that allows you to design and run applications that need to store and process humongous amounts of data. We are talking petabytes here not your garden variety Oracle business application with 20 to 50 terabytes. In everyday terms, as of Jan 2008, Google say they process about 20 petabytes a day, the New York Times speculated in 2006 that “the entire works of humankind, from the beginning of recorded history, in all languages” would amount to 50 petabytes of data, which for you data warehousing buffs means that it can be stored in a Teradata 12 database. So it’s not surprising that many of the major social networking apps like Facebook (approx. 1.5 petabytes of user photos, roughly 10 billion photos) use Hadoop.
Hadoop distributes data and processing across clusters of computers (commodity hardware) and as a result can execute in parallel enabling rapid insertion and query of information. Recent performance benchmarks from Yahoo show Hadoop sorting a Petabyte in 16.25 hrs and a terabyte in 62 seconds, reclaiming a record previously held by Google. Another benefit of Hadoop is the built in fault tolerance because Hadoop automatically maintains multiple copies of the data and is able to redeploy processing in case of failure.
Hadoop implements MapReduce, which is a software framework introduced by Google for distributed computing. As Chris described in this presentation, MapReduce is actually Map, Group then Reduce. Without getting too technical, think of the Map phase as the feeding in of the data together with the keys and values. For example assume you have 2 records. First record has a key of first name ‘Ramon’, with a value of ‘1’. The second record looks exactly the same with another key of ‘Ramon’ with a value also of ‘1’. The Group phase would then reorganize the information into ‘Ramon’ with two 1’s. Then the Reduce phase would perform aggregation (count) and the stored result would be ‘Ramon’ and ‘2’. This is of course a trivial example and the most common one typically shown in MapReduce examples. But the basic premise is that the data can be efficiently distributed, stored and retrieved at a high rate of performance within the cluster. (Update: For a great way to explain MapReduce using a worker/visual analogy see http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/ and also visit Kristina Chodorow’s (of MongoDB fame) blog for an entertaining Star Trek analogy.
All sounds good right? Some of you as old as me 🙂 might feel that this smacks a little of pre-RDBMS file systems back in the early days of computing where you had to write and manage all of the routines to retrieve and store data through low-level programming. For example, there is no “schema” in Hadoop/MapReduce, nor any transactional boundaries (commit processing). Chris’ Cascading project aims to make using MapReduce easier by allowing you to focus on the fields you want to store and retrieve, and not the heavy lifting of having to visualize concepts in MapReduce. However there are still major opportunities for improving ease of use, judging by the number of times Chris made mention of how to “game” or “hack” Hadoop to do what you want.
This is an opportunity being exploited by companies such as Cloudera who offer consulting services as well as a neat wizard to help define your Hadoop cluster by answering basic questions about your hardware configuration. Other companies such as Greenplum (ebay I believe is a customer) and Aster Data (another high profile customer is MySpace), also use MapReduce (albeit leveraging Postgres rather than using Hadoop) in a more user friendly way allowing SQL statements (or a variant thereof) in their commercial high performance analytical databases. Thereby introducing MapReduce to more mainstream business use. Putting them squarely in competition with a plethora of Columnar Databases out there such as Vertica and Paraccel in the land grab for the high performance analytical data warehousing market. Evan Levy of Baseline Consulting recently wrote a nice blog post on this very topic, and Merv Adrian’s blog has the very latest as he has recently spoken to, or visited many of the major analytics DB players.
Since Hadoop is open source, there are a number of initiatives to improve the system to overcome issues that prevent Hadoop from being used in more traditional RDBMS style processing scenarios. But is this the wise thing to do? IT and Computing, like most things in life, are cyclical in nature. Mainframes were outdated and on their way out in the 90s with the rise of Client Server computing, then a few years later the mainframe or server base computing (with thin client) came back into vogue, now cloud computing is the rage. Similarly it appears to me that primative low level file storage with programmatic manipulation was succeeded by RDBMS systems with SQL, now the lower level of file system storage through Hadoop with “roll your own SQL queries” is back.
Even though Hadoop is getting more and more popular for processing large datasets, the dirty little secret might be that Hadoop is not quite enterprise class yet, with a single point of failure in the master node, weak security standards and relatively poor binary data compression, which with replication requirements actually results in x times more physical storage required, and comes with data access performance penalties. Also not everyone can run a Facebook or Google size server farm to perform their calculations, no matter how cheap the storage or server hardware.
It will be interesting to see how everything plays out. Certainly there is no disputing that apps such as Facebook and Sharethis (another Aster Data customer) could only exist today with Hadoop and/or MapReduce style implementations. Oracle RDBMS simply isn’t up to the task and is more suited for its current business usage. Which brings me back to the context of the original title of my post, will Oracle just sit around and watch Hadoop and MapReduce takeoff? On the high performance analytics side they do offer Oracle Exadata, but will they make a move to acquire one of the many startups out there? Sybase already has a columnar DB offering through their IQ database, in fact they have sued Vertica claiming patent infringement. Certainly RDBMS’ are better at real-time while Hadoop is more batch oriented, but will Oracle or any of the big RDBMs vendors offer anything themselves in the area of MapReduce? Will they get in the game? Or will they stay “Irelephant” in this area and allow Greenplum, Aster Data or any of the columnar database vendors to become the next Oracle for non RDBMS big data?
Hopefully you got a sense of Hadoop and MapReduce from this post. Like the plush elephant that Hadoop is named after, I’m sure you’ll never forget 🙂
Thanks again to Chris Wensel and the IEEE-CNSV for the Hadoop primer. Chris is presenting again at the SDForum on 5/27/09
Here’s a blog post on Hadoop from Cloudera’s Christophe Bisciglia titled “5 Common Questions About Hadoop” this is very informative – http://www.cloudera.com/blog/2009/05/14/5-common-questions-about-hadoop/
Nice post, Ramon. Just a note that Aster Data is a relational database that has created tight coupling of SQL with MapReduce to provide the expressive flexibility and performance of MapReduce, while making it “callable” through standard SQL. This native integration enables a whole new class of powerful SQL/MR functions that can be written and then used by “everyday” business analysts through traditional BI tools. One example of this is a SQL/MR function we’ve developed called “nPath”, which enables elegant time-series analysis of data (used a lot in click-stream analysis) with a single pass of the database, expressible in SQL. Lots of resources here for people wanting to know the business applications of MapReduce in the real world, how we’ve integrated SQL with MapReduce, how to write the functions, and more: http://www.asterdata.com/mapreduce/index.php
Thanks Steve. Aster Data has it exactly right and your technology is exactly what the business world needs. Kudos to you.
There is no argument MapReduce/ Hadoop is and has already proven itself to be a highly scalable & fault-tolerant mechanism over the cloud for data intensive operations; to compute, to aggregate; at the end of the day it’s same old distributed computing via grid jobs that’s showing it’s magic.
Whole NoSQL moment and everything related to it isn’t about shedding everything existing and go new way on, it’s about making people aware, to let people out of local maxima and help them see the world beyond which is to realize “There is a more than one way to do it” (Perl mantra), and there always is.
Sticking always to a traditional approaches or systems to do some next generation or a different sort of job isn’t surely a way to go, we got to change the solution space when problem space changes.
I think future is about hybrid technology, when it wouldn’t even be require to be called hybrid anyways… We already see combinations of technology working complementary with each other: when Bradford’s mentioned: [Hadoop + Hbase + Hive] thing is in work, when Facebook’s [Hadoop + Casandra + Hive] is in work, when Linkedin’s [Hadoop + Voldemort + RDBMS (Oracle, MySQL)] is in work. (Reference: http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html)
Line between traditional RDBMS & noSQL systems is already blurring, as we see HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf) combining power of [Hadoop + RDBMS (postgresql) + Hive] to cater the job. MongoDB and Yahoo Sherpa both are working to provide a scalable data storage system with as many friendly querying capabilities as possible. (Reference: http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html)
Very soon I believe big vendors like Oracle are also going to introduce such parallel DBMS, with hybrid combination of some of these noSQL system approaches in the backend, as other close sourced data warehousing vendors GreenPlum (www.greenplum.com/technology/mapreduce) and Aster Data (http://www.asterdata.com/product/mapreduce.php) did.
(Reference: http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing)
Future is of parallel DBMS functioning as a one package (Reference: http://www.computerworld.com/s/article/print/9131526/Researchers_Databases_still_beat_Google_s_MapReduce), where we don’t need to worry about integration of components making it work. Still when that’ll be here, one solution of course wouldn’t cater all problems, all problems will evolve with our solutions too; we would adapt and should continue picking the hat that (most closely) fits the head.