The other day I read an interesting article by researchers from my old Computer Science department in Amsterdam: “We crashed, now what?”
The paper is a short description of an experiment they did with real-time recovery of operating system crashes on the Minix operating system. Minix, of course, is message-driven, with most of the kernel’s components running in user space. With some smart book keeping they were able to put simple checkpoints in place that allow for successful recovery of crashes of kernel components, caused for example by memory errors. Pretty cool stuff:
“Preliminary results showed that our approach is able to restart even the most critical OS components ﬂawlessly during normal system operation, keeping the system fully functional and without exposing the failure to user processes. For instance, our approach can successfully restart the process manager (PM), which stores and manages the most critical information about all the running processes—both regular and OS-related—in the system. Our preliminary experiments showed that the global state of PM was always correctly restored upon restart and no information was ever lost.”
One of the co-authors of the article is Andrew Tanenbaum, professor at the Vrije Universiteit and creator of the Minix operating system.