The Official Klout Blog

Big Data, Bigger Brains

November 3rd, 2011 by Dave Mariani
8 Flares 8 Flares ×

The web continues to produce a deluge of data and signals about users and their activities online.  At Klout, we are doing our best to translate these signals into a reliable measure of user influence and reach.  Like many other startups and a growing number of large enterprises, Klout uses Hadoop to crunch large volumes of data using clusters of commodity servers.  At Klout, we capture and process over 3 billion signals a day and Hadoop is an excellent, horizontally scalable platform for just doing just that.

What Ever Happened to Business Intelligence?

Besides ingesting and processing big data, we also need to deliver deep data analytics for our internal and external customers.   We need to uncover traffic and engagement trends for improving our consumer experiences at while driving ROI for our advertisers and business partners.  Hadoop, by its nature, is a batch processing system and is not yet suitable for delivering interactive, “anything by anything” queries.  But there is hope for those who need to support business intelligence workloads for Big Data sets, using open source software and inexpensive hardware.  Hive, which runs on Hadoop, exposes a SQL interface and presents a relational “database” view of your data on top of your raw Hadoop files.  But Hive can’t support real time, highly interactive business intelligence queries because it still behaves like a batch processing system, generating MapReduce code to do its work.

A Business Intelligence “Index” for Hive

At Klout, we developed a unique solution to this problem that leverages Hadoop’s scale and cost effectiveness while delivering “speed of thought” ad hoc queries.  The key was to create a multi-dimensional query “index”, or cube, that sits in front of Hive and serves realtime, ad hoc, “anything by anything” queries using the upcoming version of Microsoft SQL Server Analysis Services (code name “Denali”).  Yes, you read it right.  We use aWindows based multi-dimensional (MOLAP) product from Microsoft to load 350 million rows of Hive data per day and achieve an average query response time ofunder 10 seconds on 35 billion rows of data.  All queries are served by a single, $7,000 server with internal RAID storage.

An Unholy Alliance?

With all of the specialized business intelligence appliances on the market (Vertica, Green Plum, Aster Data, Netezza, Teradata), why did we pick Microsoft SQL Server Analysis Services (SSAS)?  Because it’s a full featured business intelligence engine, it’s in active development, it’s inexpensive, has widespread query tool support, great documentation, and it scales.  How much does it scale?  At Yahoo!, I deployed SQL Server Analysis Services to support our display advertising business that ingested 3.5 billions rows a day and delivered average query times of less than 7 seconds on 500 billion rows of data.  Just as important, SSAS provides a true business view of data to end users in the form of a cube with measures and dimensions, hiding the complexities of SQL and delivering a rich semantic layer on top of your raw, unstructured Hadoop data.  Hive can do what it does best by providing a Cloud-based, inexpensive, centralized data warehouse, while SSAS manages all the data aggregates to support realtime ad hoc queries.  In fact, to deliver equivalent, performant cube functionality using traditional SQL databases would require generating thousands of data aggregates, creating complexity and making system changes onerous.

What’s Wrong With This Picture?

There’s only one problem with this solution.  There doesn’t yet exist a direct connection between Hive and SSAS for loading a cube.  So, we use Sqoop to load each day’s data into MySql, essentially using MySql as a staging area for loading data into the cube.  This solution works and scales pretty well, but it introduces unnecessary data latency.

What’s Next

We’re working with Microsoft to develop direct connectivity between Hive and our cube using their just announced support for Hive and Hadoop.  By eliminating the staging server, we will reduce our data latency dramatically and eliminate a key dependency.  The good news is that you can do what we did here at Klout right now, for no cost. You can download CTP3 of SQL Server Code Named “Denali” for free by downloading it here or by using your MSDN account.  Stay tuned for further updates.

Do you want to take on Big Data challenges like this? We’re looking for great engineers who can think out of the box.  Check us out.

8 Flares Twitter 0 Facebook 0 LinkedIn 0 Google+ 8 Pin It Share 0 8 Flares ×
The following two tabs change content below.

Dave Mariani

Latest posts by Dave Mariani (see all)

This entry was posted on Thursday, November 3rd, 2011 at 7:32 am and is filed under engineering. You can follow any responses to this entry through the RSS 2.0 feed.

You can leave a response, or trackback from your own site.