Guus Bosman

software engineering director


You are here

internet

The Unreasonable Effectiveness of Data

Last week I finished a very interesting book, Data-Intensive Text Processing with MapReduce. For those of you interested in such matters, I can recommend this short paper by researchers at Google: "The Unreasonable Effectiveness of Data" (PDF). It makes the case that simple algorithms and models that scale well will outperform sophisticated algorithms and models that scale less well, given enough data.

This is particularly important in the field of human language processing, where two developments are intersecting. First, there is the availability of vast corpora of text harvested from the internet. Second, algorithms such as MapReduce can now provide near-perfect up-scaling of computational power. That means if you double the amount of computers available to an algorithm, the algorithm can now run at (almost) exactly at twice the speed. That provides the scalability needed to deal with these huge data-sets.

This is in contrast to older approaches in the field, where researches tried to model hand-coded grammars and ontologies, represented as complex networks of relations. As the article points out, this dichotomy is an oversimplification, and in practice researches combine "deep" approaches with statistical approaches.

From the article:

"So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do."

Cool stuff, and fun to read about.

dailylife

Sasha's birthday

Today is Sasha's birthday.

Happy birthday!

dailylife

Nice weather tomorrow

Tomorrow it will be 75 degrees, according to the weather forecast (24 in Celsius). Pretty amazing for February.

My little nephew Jasper is doing well; Ettie told me he has gotten his second tooth.

Français

Gaston Lagaffe

I'm looking forward to my French class tomorrow evening; I'm really enjoying the course. It's a nice group of people, the teacher is very good and the place has a very pleasant atmosphere. Most importantly, I feel my French is improving.

The homework assignment for tomorrow is to describe a couple of situations from a comic book, to practice describing emotions and the passe composé versus the imparfait. The comic book is Gaston Lagaffe, one of my all-time favorite comics.

In Dutch of course, Gaston is known as Guust Flater, with a name very similar to mine. I believe my parents own all books of Guust Flater -- and I read them all a thousand times.

driving

A flat tire on the Durham freeway

Today I had a flat tire, for the first time in 90,000 miles.

I was driving on the Durham freeway when I heard a strange noise, and after a few moments I realized that something was wrong with the car. I couldn't quite tell what it was. I left the freeway, and thankfully there was a parking lot right at the off ramp.

I called AAA roadside service, and within 45 minutes someone came and replaced the tire with the emergency one. It actually didn't look too difficult, and if it ever happens again I might try it myself. I went straight to the car dealership to get a new tire installed.

Recent comments

Recently read

Books I've recently read: