Where Andy is wrong about L4
Andy Tanenbaum, in a recent interview with LinuxFr.org, made some comments about L4. Unfortunately, they were mostly wrong.
Make no mistake, Andy is a guy I hold in high regard, he’s done great things over the years. But that doesn’t mean he should get away with making incorrect statements about the main competitor to his Minix system! And I’m not fussed about the outdated number of 300 million L4 deployments (the official figure is 1.5 billion). There are far more substantial issues. So, let’s have a look at what he said.
Andy Quote 1: “The biggest difference is that we [Minix] run the entire operating system as a collection of servers in userland. They [L4] mostly run Linux as a whole unit in userland. This means that a single bug in Linux can crash Linux. While L4 will survive this, all that does is make the reboot faster. It doesn’t solve the reliability problem.”
While it is true that L4 is being used as a virtualization platform to host Linux (something my colleague Hermann Härtig in Dresden has been doing since 1997), and there are phones out there that run Linux on top of L4, that isn’t the whole story. In fact, if all we did was put L4 underneath Linux, we’d do nothing than adding overhead. Clearly, that would be a pointless exercise.
There are several reasons why you’d want to run Linux on top of L4. One is the classical virtualization use case of processor consolidation: running multiple systems side-by-side on the same processor, using virtual-machine encapsulation to isolate one from the other. This is the configuration that is indeed used on some of the phones that ship with L4.
The second reason is legacy support: If you think you can just introduce a new OS API (even if it’s POSIX-compliant) and the world will adapt all its software, you’re dreaming. People want to not only keep re-using their old software, they even want to build new software to the old environments. But that shouldn’t stop you from providing a better environment for safety- and security-critical components.
This trend is indeed very strong in the embedded world, even (or particularly) in the domain of safety- or security-critical devices. I rarely get to talk to someone considering deploying seL4 who doesn’t have a need for a Linux or Windows environment. The important point is that a highly trustworthy microkernel underneath allows this to co-exist safely with the critical functionality.
Andy Quote 2: “There are lots of technical differences between MINIX and L4, but the focus of the projects is very different. We [Andy] have a microkernel so we can run a multiserver system. The entire focus is on multiserver and the ability to restart OS components independently. L4 is being used to run Linux and on phones. This has nothing to do with technical differences. We could have used L4 and they could have used MINIX but it is easier to use your own since you can change it as needs arise.”
Notwithstanding my comments above, the reality is that many of the 1.5 billion L4 deployments have nothing whatsoever to do with Linux! They run L4 on the baseband processor of a smartphone, with Linux (or Windows or iOS or BBOS or xyz) running on a separate applications processor. And the baseband stack is in fact a multi-server design! It consists of multiple (each multithreaded) tasks that communicate with each other. Exactly what Andy claims he does and we don’t…
The statement that “they” (i.e. us) could have used Minix is actually quite wrong. Minix, owing to its design and strong emphasis on simplicity, is much slower than L4. (Their papers typically tell you that throughput is only degraded by a few %. However, that’s typically compared to Minux-2, which isn’t the world’s fastest system either, while a reasonable baseline would be Linux. Furthermore, they don’t tell you how much CPU usage increased – I’ve discussed this with Andy, and unless things have changed a lot recently, then CPU overhead is pretty significant. Pretending that throughput degradation measures overhead while ignoring CPU load is one of my favourite benchmarking crimes!)
Performance happens to be pretty important on a mobile device (whether it’s a phone or an implanted medical device). Throwing away cycles means throwing away battery life, and manufacturers tend to care about this. In my experience, overheads of more than 5% (measured in terms of CPU cycles) are considered unacceptable in this space. Minix certainly wouldn’t be useful in a mobile environment at all!
Andy Quote 3: “They certainly did a nice job [on formal verification of seL4] but it is really hard to verify if it works since the hardware itself is poorly defined and the OS has to deal with the hardware a lot.”
C’mon, Andy, that one’s really cheap! seL4 is the first (and so far only) OS kernel which has a formal (mathematical and machine-checked) proof that its implementation follows the specification. That means nothing less than that the implementation is bug-free with respect to the spec. Something that just cannot be achieved by simplicity and any amount of code inspection or testing!
Andy seems to imply that because the hardware isn’t exactly defined, the verification is somewhat worthless. Of course, if the hardware’s behaviour isn’t defined well enough, then that hurts Minix just as well. But the argument is actually bogus. The verification of seL4 was done for ARM processors, and (unlike x86) they have a very precisely-defined ISA. It’s so well defined that Anthony Fox and Magnus Myreen at Cambridge could turn it into a formal (mathematical) model. And we’re in the process of linking this model to our seL4 code proof.
In reality, the functional correctness proof of seL4 is very powerful. It means that seL4’s behaviour is very precisely defined (there is a formal spec of the kernel). And the implementation adheres to that spec. Which means that in order to prove safety or security properties of a system built on top, it is generally sufficient to prove that these properties hold with respect to the seL4 spec, and by implication they hold for the system. (I’m glossing over a few details here, this argument requires more than functional correctness. However, all the other bits required are either already in place for seL4, or we’re working on them and they should be in place this year!)
So, how about the multi-server issue? This is the classical microkernel approach: you structure the microkernel-based system into multiple components (or servers) and the microkernel enforces isolation between them. This can lead to a more fault-tolerant system, as Andy keeps pointing out (correctly). However, if he seems to imply that this somehow is different from what we do, then that’s simply wrong. In fact, our Trustworthy Systems research agenda uses exactly this component approach. The difference is that we can prove (security or safety or other) properties of our system. No-one else can do this!
On top of that we have a few other cool bits. Such as device drivers which are synthesised from formal specs, and are as such correct by construction! Perfect complement to a formally-verified kernel.
Andy Quote 4 (not actually about L4 but about capability-based OSes like Coyotos): “Capability systems have been around for 40 years. At the recent SOSP conference, M.I.T. professor Jack Dennis got an award for inventing them in 1967. They definitely have potential. My first distributed system, Amoeba, used them. Coyotos, however, is going nowhere.”
Am I to read this as an argument that because CoyotOS is going nowhere, capability systems aren’t?
seL4 is a capability-based system too. And, unlike CoyotOS, it is going places! In fact, capabilities are a core part of the safety/security story in seL4: they allow us to reason precisely about access rights (and information flow) in the system, in a way that is impossible to do with an access-control-list based approach as Minix uses it.
The two approaches are similar as far as OS structure is concerned. The core difference is that with Minix, Andy is trying to get things right by keeping them really simple. He cops a significant performance overhead, yet in the end can only use traditional quality-assurance methods (testing and code inspection) to make a case about the behaviour of his system. Nor can he formally reason about the security/safety properties of the complete system.
In contrast, with seL4 we have a microkernel that is blindingly fast (as fast as any L4 kernel, and no other microkernel has ever been able to beat L4 in performance), yet we can make actual guarantees about its behaviour and that of systems built on top, with the strength of mathematical proof!
Everything else is simply not playing in the same league.