Q: What is the difference between a microkernel?
A: One protection is both different…
I’m of course referring to the latest twist in the microkernel story: Singularity vs L4. The recent public release of the Singularity source by Microsoft has generated some attention, and invited some comparisons with L4. It has also “enriched” the blogosphere with blatantly incorrect statements such as “the use of SIPs effectively eliminates the overhead traditionally incurred by context-switching in conventional microkernels.”
So, what’s going on?
L4 and Singularity can be considered as the leading microkernel representatives of two alternate approaches to protection: hardware or software mechanisms. The L4 approach is to use hardware mechanisms (processor modes and the MMU/MPU) to protect processes (and their data) from each other. This is the standard way for providing protection in operating systems, and the idea of microkernels based on this approach goes back to Brinch Hansen’s Nucleus (1970).
The alternative, used by Singularity, is to use software means, specifically language-based protection (using the type system) and static analysis. While most people think this is a revolutionary idea, it is actually quite old. The approach has been used by the Burroughs mainframes since 1961! (I used a B6700 a lot in a former life, and it was sloooooow!)
There is no such thing as a free lunch, and protection is no exception.
The cost of hardware-based protection comes in the form of system-call exception (to change the execution mode from user to kernel, aka “trapping into the kernel”) and the cost of setting up, maintaining and tearing down mappings (manipulating page tables and MMU). Kernel entry costs anything from one cycle (Itanium) via a dozen or so cycles (decent RISC processors) to many hundreds of cycles (x86). Mapping-related cost come mostly in the form of handling TLB misses (typically a few dozen cycles). For most loads (exhibiting decent locality) the mapping overhead is low, except on processors without a tagged TLB (x86 is again the main offender) where the TLB is completely flushed on each context switch (leading to many TLB reloads just after the switch).
How about software, a la Singularity? Surely, there are no such overheads, as invoking the kernel is just a function call, and a process switch is just a thread swich without change of addressing context? That’s true, but many overlook the fact that this is balanced by other overheads resulting from language-based protection: a run-time range check must be applied for each pointer-dereference or array-indexing operation, unless one can statically (i.e. at compile time) determine that it is safe (this is where static analysis comes in). All those run-time checks obviously cost. On top of this is the cost of garbage collection, as type-safe languages (like Sing#) have automatic memory management (this is also an aspect of language-based protection).
So, both schemes have a cost. Which is higher? That’s really impossible to say a priori, and in general depends on the nature of the computation. The main reason is that you compare a (relatively small) overhead that occurs all the time (most statements of the programming language) against a much higher overhead that occurs much less frequently (invoking system calls and performing context switches).
The only way to tell is to build the systems and then benchmark against each other. It is thus just a little bit curious that the Singularity folks (given that they routinely refer to Singularity as a microkernel) never benchmark it against the top-of-the-line microkernel, i.e. L4. Instead they publish benchmarks comparing with Windows, FreeBSD and Linux. What does this prove?
How should the two be compared? Well, there are two extreme cases we can look at, and, as it happens, the information is all there if you know where to look for it.
The first is to measure the overhead imposed on normal programs by the run-time checks, garbage collection, etc. This is the extreme L4-friendly benchmark, as obviously the cost to a program running on L4 would be zero. The friendly Singularity folks have conveniently measured a lower limit of this cost, published in their white paper Singularity: Rethinking the Software Stack: They find that turning off compiler run-time checks gains 4.7% performance. Note that this is a lower limit on the overhead any code (OS as well as application) will experience on Singularity. The real cost will be higher, as their test probably still had the garbage collector active, and almost certainly didn’t apply all the additional compiler optimisations that are possible once the runtime checks are removed. So, in summary, running any code on Singularity gives you a slowdown of at least about 5% compared to L4. This is for x86, but the processor architecture is unlikely to have a significant effect on this.
The other extreme is to look at a scenario where two processes do nothing but sending messages between each other. This is the extremely L4-unfriendly benchmark, as it is dominated by the cost of hardware protection (kernel entry and exit, switching addressing context), none of which applies to Singularity. The paper Sealing OS Processes to Improve Dependability and Safety measures this as “message ping/pong” (Table 8) and “IPC costs” (Table 9). As I said, they only show measurements for Singularity, Windows, FreeBSD and Linux, not the real benchmark for microkernel performance. Fortunately, L4 numbers are available too, and have been for years: The Karlsruhe team benchmarked L4 IPC on an AMD-64 machine. (It isn’t exactly the same box which was used in the Singularity benchmarks, but it is close enough, 1.6GHz vs 2.0GHz and the same architecture.) The result: cross-process IPC in Singularity costs 803 cycles for one byte, 933 cycles for four bytes, while L4 does it in 230 cycles for up to 24 bytes. Actual time: 402/467ns for Singularity, 144ns for L4.
In other words, L4 is about three times faster in the case that’s maximally biased in Singularity’s favour! So much about “eliminates the overhead traditionally incurred by context-switching in conventional microkernels”…
Now, this is also on a very L4-unfriendly architecture (x86). Things look much better for L4 on processors with a tagged TLB (which is just about every one other than x86). Such processors, like ARM, MIPS, PowerPC, are prevalent in embedded systems. Embedded systems, especially battery-powered ones, also happen to be particularly sensitive to performance. So the result is clear: L4 yet again wins the performance stakes—hands down!
I will compare the two approaches from the point of view of security and relevance to embedded systems in a future blog.