Guus Bosman

software engineering director

You are here


Cloudera Sessions

Yesterday I attended the Cloudera Sessions, an event on using Hadoop, HBase and other "big data" tools organized by Cloudera.

Big Data is an interesting field and I enjoyed this well-organized day. Cloudera is a provider of commercial solutions around the open-source Hadoop stack. There were speakers from Cloudera and several of their commercial partners, talking about the practical experiences so far and plans for the future.

An event like this is meant to convince people to use Cloudera's stuff -- but it is also a good way to find out how people are actually using Hadoop in commercial applications. This is the part I liked best. There were several speakers who talked about their (very) recent experience with commercial roll-outs and I spoke to people at my table and over lunch about what they are doing with this technology.

Many companies are still experimenting, but there are several early adopters who have real production deployments. Unsurprisingly, the latter includes many starts-ups.

Two years ago I learned more about the technology behind Hadoop in a great book, Data-Intensive Text Processing using MapReduce.


Mike Olson, the CEO of Cloudera gave a nice overview of the market which he called "Next Gen Data Management". Instead of talking about gigabytes of data, we're moving towards petabytes of data, much of which is machine-generated. By using the map reduce paradigm, where every node in the cluster combines data storage with computing power, an enormous leap in performance and cost-effectiveness can be achieved.

"Pushing the the analysis to the data", as he described it, is somewhat similar in approach to the -fanciful- agent-approach that was popular at the VU when I graduated there.

Another interesting point that was made was that the use of tools like Hadoop, which is dramatically cheaper than previous big Data Warehouse solutions, leads to a different approach of data gathering. It used to be that people were afraid to "over-use my data", in that there was a hard limit how much the current Data Warehouse could hold, data-wise or at least under the current license. With Hadoop the approach is much more: let's collect as much as we can, go after everything.

It is clear that Cloudera has encountered some skepticism from 'business' towards Hadoop. In their presentation they emphasized strongly that they will take care of things like upgrades etc, to free up the employees to do actual analysis. Obviously Hadoop is not as mature as existing data warehouse tools, which is a very well developed field. I got the feeling that the typical data warehouse organization has pretty high expectations on the maturity of tools -- a challenge and opportunity for Cloudera at the same time.

The only partner that was a bit out of place, for me, was HP. Big Data is very much about software, and hardware is really just an afterthought. The speaker wasn't very inspired (or inspiring) either.

Josh Wills, data scientist

I particularly enjoyed the presentation by Josh Wills, a data scientist at Cloudera who has worked on the Google ad engine. He described several techniques for creating models and touched on how they would work in the Hadoop eco-system.

"My nightmare scenario is that a business person comes to me a says: 'Go find me some insights'. It doesn't really work that way."

He emphasized the need for monitoring and experimenting -- it is impossible to things right from the get-go. He illustrated this with an example of a team in the 1960's that was able to design a human-powered airplane. That team, unlike the competition, focused very strongly on quick turn-around: they were able to rebuild their crash airplanes quickly, which allowed them to experiment much more than the others. Interestingly, a member in the audience spoke up and mentioned that he had been one of the grad students working at CalTech at the time, which was cool. The lesson is one that holds true in my field of software engineering as well: don't try to build a complex system from scratch, it won't work. Instead, start with a very simple but working system, and make it better over time.


I'd like to read more on the ETL tools that SyncSort offers -- perhaps this might be of interest to some other teams at my work. Most of all I'd like to learn more about Mahoud, for example by starting at Cloudera's 'Building Recommender Systems'. This is fun stuff!

Recent comments

Recently read

Books I've recently read: