Guus Bosman

software engineering director


You are here

Links & Technology

Image: 
Topic: 

Liars and Outliers: enabling the trust that society needs to thrive

In February of this year Bruce Schneier released his latest book, Liars & Outliers -- enabling the trust that society needs to thrive. This accessible book does a good job exploring the scientific theory of trust and collaboration and combines a theoretical framework with real-life examples. It does not bring many new insights to people who have followed Schneier's other work but the theoretical framework is useful and this is a book worth reading.

Bruce Schneier
978-1-118-14330-8
/images/books/liarsoutliers.png
English for work
internet

Software on the Curiosity Mars rover

In a thrilling feat of engineering the Curiosity rover landed successfully on Mars this week. It is easy to be jaded about some of the technologies we see around us but landing a car-sized rover safely on Mars is extraordinary.

Here is an animation of the landing of Curiosity, and it shows the rover driving around on a different planet.

Hardware

The Curiosity is roughly the size of a car and weighs around 2,000 lbs. Its nuclear power source generates 125 watt of power initially but will slowly degrade. After 14 years it will deliver around 100 watt (the official mission of the rover takes two years).

The rover has 17 cameras and since the specs were created in 2004, the main cameras only have 2 MP sensors. Similarly, the processing hardware feels a little dated: the CPU runs at 200 Mhz, it has 256 MB memory and 2 GB flash solid state storage. It has two on-board computers, one is configured as backup, and will take over in the event of problems with the main one.

Software

One of the main engineering accomplishments in the project is the software.

With a total cost of around $1.2 billion and much scientific work on the line, a software mistake would be disastrous. The development and QA approach is very rigorous. I enjoyed reading the Coding Standards for the Jet Propulsion Laboratory. It is a very readable document, and the guidelines are obviously conservative. "Specifically the use of task delays has been the cause of race conditions that have jeopardized the safety of spacecraft." Rover has around 2.5 million lines of C-code.

Curiosity is running on Wind River's real-time operating system (RTOS), VxWorks. The software is similar to that of previous rover missions. There's a team of 30 developers and 10 testers working on the Curiosity rover. It runs over 130 threads.

A cool presentation has more details on the testing methodology. It emphasizes analyzing the logs that are generated. A Python-based high-level language helps with testing to see if the logs show the expected results.

Upgrade

Right now, the firmware on Curiosity is being upgraded. I have a lot of experience with remote firmware upgrades, and I know how scary this process can be. A senior flight software engineer speaking Computer World:

"It has to work. You don't want to be known as the guy doing the last activity on the rover before you lose contact."

Very cool to see a software upgrade being done 160 million miles away.

Topic: 

Responsive Web Design

This highly readable book introduces Response Web Design, a name coined by the author Ethan Marcotte for creating pages that work well on different devices, be it mobile phones, tablets or desktops.

Ethan Marcotte
978-0984442577
/images/books/responsivewebdesign.png
English for work
Topic: 

Scalable Internet Architectures

Scalable Internet Architectures provides a good introduction to scalability and performance engineering for large internet applications. The book has useful high-level discussions and interesting real-world insight but could have benefited from better editing. The book would have been even stronger with more focus on theoretical aspects -- which the author explains well -- and less emphasis on specific tools and code-snippets. Overall, even though the book is from 2006 it is worth a read, especially for engineers new to the field.

The author of the book, Theo Schlossnagle, is principal at a consulting company and his real-world experience with scalability and other aspects of large-scale engineering clearly shows in the book. He excels at outlining the challenges and possible solutions on a high-level, giving the reader a good background to make informed choices.

Still relevant 6 years later

The book was written in 2006 but most of the material is still relevant; the architectures and concepts that are described are still valid today. The code examples and the recurring emphasis on the author's favorite tools, Spread and Whackamole, are less useful for a book on this level.

The book is almost exclusively focused on the ‘back-end’ server architecture and doesn’t talk much about ‘front-end’ items except for mentioning that cookies make an excellent 'super local' cache for web applications. Most of the development in the field since 2006 has been client-side, with the possible exception of experimental things like SPDY, Google’s new protocol. It would be interesting to read more about the impact of increased Ajax use and streaming partial page-rending such as Facebook’s on the back-end architecture.

"Developers have no qualms about pushing code live..."

The excellent first three chapters introduce the field of scalability and performance engineering and explain the challenges that occur once an internet application reaches a large scale. The classic tension between flexibility and stability is summarized succinctly, where "developers" are really a proxy for the demands of the business to deal with a changing internal and external world:

"In my experience, developers have no qualms about pushing code live to satisfy urgent business needs without regard to the fact that it may capsize an entire production environment at the most inopportune time. [...] My assumption is that a developer feels that refusing to meet a demand from the business side is more likely to result in termination than the huge finger-pointing that will ensue post-launch".

For me this is a very familiar discussion -- part of being an engineering manager is to make these types of judgment calls: when will we push back, when will we take risk, what is the risk/benefit trade-off.

High-level problems and solutions

The author is at his best when explaining high-level problems and their possible solutions. The author explains the need for horizontal scaling and introduces various techniques that make this possible. He goes into advanced topics but doesn’t forget to cover the basics. For example, there is an excellent walk-through on the performance gains from serving static content vs dynamic content. This is a good description for people new to the field and it is well illustrated, including the slowness of the initial TCP handshake and the dramatic difference in memory footprint of Apache 'bare-bones' versus Apache with Perl or PHP compiled in.

An interesting piece of real-hand knowledge is the author's claim that on web servers (in clusters > 3 servers) one can expect up to 70% resource utilization. That's a good benchmark to have.

I also liked the explanation on caching semantics. The author illustrates the problems of having shared, non-scalable resources (such as databases) and explains how introducing caches can provide the ability to create a more scalable architecture. The sample PHP code is helpful in explaining caching and two-tier execution. The book discusses transparent caches, look-aside caches and distributed caches.

The descriptions of the various types of database replication were good to – master-master, master-slave, and even cross-vendor database replication, where an expensive Oracle master is used in combination with open source PostgreSQL slaves. The latter definitely has its pros and cons and would introduce quite a bit of extra maintenance, but author is right that is opens the mind to think about possibilities like that.

Peer-to-peer

Throughout the book Schlossnagle discusses peer-to-peer high availability software. The tools Spread and Whackamole are being pushed quite a lot; they are part of a project the author worked on at John Hopkins University. This peer-to-peer concept brings in an interesting perspective – for me looking at these solutions makes sense, although it is not something I have worked with yet. However, the author gets too specific in the last chapters of the book, and instead of high-level discussions he delves into the specifics of using Spread for logging, which is a missed opportunity to really discuss the various architectures in that area.

The book is clearly written by someone who has been in the trenches, although the tone is a little cynical at times: "And yes, 1 fault tolerant and N-1 fault tolerant are the same with two machines, but trying to make that argument is good way to look stupid". The book could have benefited from a stronger editor who would have kept those things in check. The book is woolly, especially chapters 4 and 5, and could have been a bit shorter.

Recommended

The book provides a good high-level discussion of concepts such as various caching models, fail-over and scalability, combined with real-world experiences of the author. The book would have been stronger if it had had a better editor but is worth a read, especially for engineers new to the field of large scale websites.

There are very few books out there that discuss all these aspects on a high level. Perhaps a second edition can fix some of the minor shortcomings, but the book is recommended.

More info: http://scalableinternetarchitectures.com

Theo Schlossnagle
0-672-32699-X
/images/books/scalableinternetarchitectures.jpg
English for work
internet

"We crashed, now what?"

The other day I read an interesting article by researchers from my old Computer Science department in Amsterdam: "We crashed, now what?"

The paper is a short description of an experiment they did with real-time recovery of operating system crashes on the Minix operating system. Minix, of course, is message-driven, with most of the kernel's components running in user space. With some smart book keeping they were able to put simple checkpoints in place that allow for successful recovery of crashes of kernel components, caused for example by memory errors. Pretty cool stuff:

"Preliminary results showed that our approach is able to restart even the most critical OS components flawlessly during normal system operation, keeping the system fully functional and without exposing the failure to user processes. For instance, our approach can successfully restart the process manager (PM), which stores and manages the most critical information about all the running processes—both regular and OS-related—in the system. Our preliminary experiments showed that the global state of PM was always correctly restored upon restart and no information was ever lost."

One of the co-authors of the article is Andrew Tanenbaum, professor at the Vrije Universiteit and creator of the Minix operating system.

internet

Uncovering Spoken Phrases in Encrypted Voice over IP Conversations

Today I read 'Uncovering Spoken Phrases in Encrypted Voice over IP Conversations', a very interesting article from the December 2010 issue of ACM Transaction on Information and System Security. (Read the full PDF version here).

The paper details a gap in the security of VBR compressed encrypted VoIP streams. The authors had earlier found that it is possible to determine the language that is spoken on such a VoIP call, based on packet lengths. Now they have expanded their research and show that it's possible to detect entire spoken phrases during a VoIP call. On average, their method achieved recall of 50% and precision of 51% for a wide variety of phrases spoken by a diverse collection of speakers (some phrases are easier to detect than others; the recall various from 0% to 98%, depending on length of the phrase and the speaker).

In other words: they can detect fairly well if a certain phrase is being used in a conversation, even though the VoIP conversation is encrypted!

Fundamentally, this is possible because VoIP packets are compressed using variable bit-rate compression and not typically "padded". Longer phonemes (such as vowels) correspond with longer packets, shorter phonemes (such as fricatives like 's', 'sh' or 'th') use shorter packets -- using sophisticated statistical analysis they can detect whole phrases.

A solution would be to add padding to VoIP packets, but that increases the bandwidth that is needed. Not only does padding increase the bandwidth because of padding itself, but it also negates a big benefit of VBR compression when dealing with quiet periods in a conversation, when one party is listening to another.

A fun read, quite accessible.

internet

The Unreasonable Effectiveness of Data

Last week I finished a very interesting book, Data-Intensive Text Processing with MapReduce. For those of you interested in such matters, I can recommend this short paper by researchers at Google: "The Unreasonable Effectiveness of Data" (PDF). It makes the case that simple algorithms and models that scale well will outperform sophisticated algorithms and models that scale less well, given enough data.

This is particularly important in the field of human language processing, where two developments are intersecting. First, there is the availability of vast corpora of text harvested from the internet. Second, algorithms such as MapReduce can now provide near-perfect up-scaling of computational power. That means if you double the amount of computers available to an algorithm, the algorithm can now run at (almost) exactly at twice the speed. That provides the scalability needed to deal with these huge data-sets.

This is in contrast to older approaches in the field, where researches tried to model hand-coded grammars and ontologies, represented as complex networks of relations. As the article points out, this dichotomy is an oversimplification, and in practice researches combine "deep" approaches with statistical approaches.

From the article:

"So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do."

Cool stuff, and fun to read about.

Topic: 

Data-Intensive Text Processing with MapReduce

It's beautiful to see a real change in paradigm happening. I remember in college how much I enjoyed programming in functional languages, and how cool it is to be able to look at problems from a different viewpoint. What Google and others have achieved with MapReduce a similar change in the way of looking at problems.

MapReduce is the name of Google's base algorithm for their processing of huge data sets. Since then, other companies have followed suit. I didn't know much about this field and this book is a great introduction. It provides a good description of the foundation, and I love it that it describes practical uses. Examples they gave are machine translations, Google's PageRank, shortest path in a graph etc.

Actually in use

What I like about MapReduce is that it provides an abstraction for distributed computing that is actually being used and is succesful. The book showed the scaling characteristics of an example algorithm (strips for computing word co-occurrence) on Hadoop: a R^2 of 0.997! That means that there is almost a linear scalability increase when you add extra machines.

Want to read more

This is one of those books that makes you want to read more. For example, since reading this book I've looked into terms such as Zipfian, Brewer's CAP Theorem and Heap's Law. I still need to learn more about Expectation Maximization and "Hidden Markov Models", harping back on some fundamental mathematics I had in college.

I want to read more about machine translations now, Koehn's book perhaps. And definitely want to read the Google article, about "unreasonable effectiveness of data".

This is an excellent book, which provides a very readable introduction to the algorithms and real-world implementations.

Jimmy Lin, Chris Dyer
9781608453429
/images/books/mapreduce.jpg
English for work
Topic: 

HTML5 for Web Designers

HTML5 for Web Designers is a short and pleasant introduction to HTML5.

The book, 87 pages long, is published by the folks of A List Apart, a blog about website design that I follow. It's a quick read -- the book probably took me no more than 30 minutes -- and it gives you the highlights of HTML5 quickly. The introduction, with the history of the development of HTML standards, was interesting.

HTML5

Web Forms 2.0 is very useful. I think the micro-format like elements such as mark and time are good additions, but I'm not so sure about the new structure elements. The article vs section is a little confusing, and I'm not sure what their added value is. I'm not so convinced of the benefits of the more flexible nesting and outlining that the author describes.

Obviously, the standardization of video and audio playback is huge (as long as we can all agree on the encoding...).

For my work, the Web Forms 2.0 elements are probably going to be the most useful: marking fields as required, specifying that input fields can take numeric input only, etc. Today we use JavaScript libraries for this. A library like ExtJS already allows you to specify this declaratively but native browser support would be even better.

The book purposely did not go into the new standardized JavaScript APIs that are part of HTML5, that would be a nice topic to read on.

Jeremy Keith
97809844425008
/images/books/html5webdesigners.jpg
English for work
internet

What phone are you?

What phone are you?

What your phone says about you...

Pages

Recent comments

Recently read

Books I've recently read: