Guus Bosman

software engineering director


You are here

internet

The Unreasonable Effectiveness of Data

Last week I finished a very interesting book, Data-Intensive Text Processing with MapReduce. For those of you interested in such matters, I can recommend this short paper by researchers at Google: "The Unreasonable Effectiveness of Data" (PDF). It makes the case that simple algorithms and models that scale well will outperform sophisticated algorithms and models that scale less well, given enough data.

This is particularly important in the field of human language processing, where two developments are intersecting. First, there is the availability of vast corpora of text harvested from the internet. Second, algorithms such as MapReduce can now provide near-perfect up-scaling of computational power. That means if you double the amount of computers available to an algorithm, the algorithm can now run at (almost) exactly at twice the speed. That provides the scalability needed to deal with these huge data-sets.

This is in contrast to older approaches in the field, where researches tried to model hand-coded grammars and ontologies, represented as complex networks of relations. As the article points out, this dichotomy is an oversimplification, and in practice researches combine "deep" approaches with statistical approaches.

From the article:

"So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do."

Cool stuff, and fun to read about.

Comments

Jaap's picture

Have a nice day with Baba Marta.

Recent comments

Recently read

Books I've recently read: