How Much Is That Hadoop Cluster Really Costing You?

Also published at http://rainstor.com/how-much-is-that-hadoop-cluster-really-costing-you/

Last month when we released our RainStor for Big Data Analytics product edition that runs natively on Hadoop, we raised a lot of eyebrows with two of the points that we were making:

  1. Compression can dramatically reduce the TCO of Hadoop nodes needed
  2. SQL access to the compressed data in HDFS can be achieved without having to transfer the data out of Hadoop or use specialized tools

In my post “Feeding the Elephant Peanuts and Making Pig Fly” I talked how we could achieve massive compression, give SQL-92 access and boost the performance of MapReduce jobs. This post revisits the first point around TCO. I’ll cover the second point in a future blog post.

The reason that I decided to go over the TCO point again is because I had the pleasure of chatting with David Merrill, Hitachi Data Systems Chief Economist (@StoragEcon) on this very topic. I have been a fan of his white papers and noted that he had started writing about Big Data Storage Economics on his blog titled, The Storage Economist. For those of you who are unfamiliar with his work, a good example is his white paper Four Principles For Reducing Total Cost of Ownershipproviding a pragmatic and quantifiable look at all of the factors that contribute towards operating and running different types of storage.

We talked about his research and analysis and how from purely a bare-metal CPU, disk and component perspective, commodity clusters such as Hadoop can appear to provide lower TCO from a cost per usable Tb perspective. However as his research showed , when cost per written to Tb is used, the equation is turned completely upside down. As David concluded, “don’t confuse price and cost, and look at a longer time horizon when planning and building big data storage infrastructures.”

In our January release on Hadoop we had an example in an infographic illustrating how RainStor’s compression can significantly drive down the physical storage and therefore the number of nodes required. We used a simple operating cost metric of $3000 per node (containing 12Tb of raw disk) that resulted in a TCO (buying and operating the cluster) savings of over $1M over 3 years for storing 300Tb of user data. If you take a look at David’s numbers in his post he has it at a low of around $3000 per usable Tb for DAS to a cumulative high cost of $45,000 per written-to Tb! Granted the research was done in 2009 and acquisition costs have plummeted since, but since the price of floor space, heating, cooling etc. just continues to grow, it demonstrates that $3k per node can be considered reasonably conservative. His post also pointed out that in general the server CPUs with DAS were tasked with a lot of “mundane tasks”.  As part of our conversation, I detailed how RainStor’s unique value and pattern de-duplication process leveraged CPU cycles up front to build highly compressed partitions which not only saved on physical disk space, but used collected metadata to make data access more intelligent and efficient, as well as magnifying the performance of the commodity disks by retrieving more data per block when required. This means using more CPUs on load and improving performance overall upon access. All of this reflects the savings and the impact of baseline storage and access costs, and doesn’t yet add the cost of administration (both software and personnel) of the nodes and cluster, as well as any development and integration costs (which I will cover in my next post).

Bottom line, David’s research and white papers over the years have contributed greatly to the overall TCO of storage and the benefits of widely adopted technologies such as thin provisioning. Now he is pointing out that Big Data Hadoop clusters have hidden hardware operating costs and it’s best to go into such endeavors with eyes (and pocketbooks) wide open. Meanwhile, here at RainStor we continue to focus on drive down the TCO of your choice of storage and configuration, through our database. In the end, David and I both agreed that services aside, the best TCO lies between the efficient selection and implementation of the hardware and software for the given use case, and that the right combination is what will make Big Data manageable and affordable.

Leave a Reply