As the Facebook IPO frenzy builds up to the pricing and Facebook starts trading this Friday (UPDATE: Facebook has priced at $38 giving it a market cap of about $104B), it got me to thinking about how much data I have uploaded/contributed to Facebook over the last 5 years. Turns out, you can get your own personal slice of the Big Data in Facebook back as a tidy zip file snapshot of everything you have done/uploaded to or had commented on. If you want to try it yourself take a look at the instructions here.
Since I joined Faceook in 2007 it appears that I have generated or uploaded about 1.5GB of data. The Zip file returned (after it took 5 hours – combined file preparation time and bandwidth needed to download), contains a nice HTML index page, which provides a strip down version of the photos and comments in chronological order, just like your wall. The basic capability was made available in 2010 and extended with an enhanced archive option, after complaints made by Irish users who reported their concerns to the Irish data retention commissioner.
So how much Big Data is in Facebook and where is it kept? The popular details making the rounds of Hadoop and Big Data conferences focuses mainly on the huge clusters running Facebook data warehouses running on Hadoop and Hive. There was an interesting article on Facebook’s corporate blog about their massive Hadoop migration (30PB worth) last year to a larger data center. However on a daily basis, the repository and platform you interact with is still powered by MySQL databases.
Given the publicity around how traditional “relational” databases can’t handle internet scale, and that NoSQL databases are the way to go, the fact that Facebook still operates MySQL as the backend is eye opening. This has prompted critique from database experts such as Michael Stonebreaker (Vertica and VoltDB fame) to state that Facebook is “trapped in a MySQL fate worse than death”. This was followed up by another GigaOM article detailing how Facebook is able to make MySQL scale.
The article details how Facebook is not just relying purely on MySQL, and that they have a massive layer of memcached servers that are being used as an in-memory database highlighting that MySQL servers on their own couldn’t possibly handle the read load of live Facebook traffic. For functionality such as the Facebook Inbox, Hadoop and HBase are used instead. Additionally Hadoop is used as the backup for the MySQL data.
Back to my personal download of my Facebook information, I was quite impressed by the time (again 5 hours) it took to download a bulk copy of my personal data. However, I doubt that many users leverage the download option today, rather with more and more users joining, and increasing upload of data per account, it will be interesting to see if the MySQL architecture can continue to hold up, and how the Facebook’s use of Hadoop, Hive and other Apache projects will evolve for Big Data warehousing and analytics.