Skip to content

Microkernels Really Do Improve Security

Many of us operating systems researchers, going back at least to the US DoD’s famous 1983 Orange Book, have been saying that a secure/safe system needs to keep its trusted computing base (TCB) as small as possible. For system of significant complexity, this argues for a design where components are protected in their own address spaces by an underlying operating system (OS). This OS is inherently part of the TCB, and such should itself be kept as small as possible, i.e. based on a microkernel, which comprises only some 10,000 lines of code. Which is why I have spent about a quarter century on improving microkernels, culminating in seL4, and getting them into real-world use.


Monolithic vs microkernel-based OS structure.

It is intuitive (although not universally accepted) that a microkernel-based system has security and safety advantages over a large, monolithic OS, such as Linux, Windows or macOS, with their million-lines-of-code kernels. Surprisingly, we lacked quantitative evidence backing this intuition, beyond extrapolating defect density statistics to the huge TCBs of monolithic OSes.

Finally, the evidence is here, and it is compelling

Together with two undergraduate students, I performed an analysis of Linux vulnerabilities listed as critical in the Common Vulnerabilities and Exposures (CVE) database. A vulnerability is tagged critical if it I easy to exploit and leads to full system compromise, including full access to sensitive data and full control over the system.

For each of those documented vulnerabilities we analysed how it would be affected if the attack were performed against a feature-compatible, componentised OS, based on the seL4 microkernel, that minimised the TCB. In other words, an application running on this OS should only be dependent on a minimum of services required to do its job, and no others.

We assume that the application requires network services (and thus depend on the network stack and a NIC driver), persistent storage (file system and a storage device driver) and console I/O. Any OS services not needed by the app should not be able to impact its confidentiality (C), integrity (I) and availability (A). For example, an attack on a USB device should not impact our app. Such a minimal TCB architecture is exactly what microkernels are designed to support.

The facts

The complete results are in a peer-reviewed paper, I’m summarising them here. We find that of all of the 115 critical Linux CVEs, we could analyse 112, the remaining 3 did not have enough information to understand how they could be mitigated. Of those 112:

  • 33 (29%) were eliminated simply by implementing the OS as a componentised system on top of a microkernel! These are attacks against functionality Linux implements in the kernel, while a microkernel-based OS implements them as separate processes. Examples are file systems and device drivers. If a Linux USB device driver is compromised, the attacker gains control over the whole system, because the driver runs with kernel privileges. In a well-designed microkernel-based system, only the driver process is compromised, but since our application does not require USB, it remains completely unaffected.
  • A further 12 exploits (11%) are eliminated if the underlying microkernel is formally verified (proved correct), as we did with seL4. These are exploits against functionality, such as page-table management, that must be in the kernel, even if it’s a microkernel. A microkernel could be affected by such an attack, but in a verified kernel, such as seL4, the flaws which these attacks target are ruled out by the mathematical proofs.
    Taken together, we see that 45 (40%) of the exploits would be completely eliminated by an OS designed based on seL4!
  • Another 19 (17%) of exploits are strongly mitigated, i.e. reduced to a relatively harmless level, only affecting the availability of the system, i.e. the ability of the application to make progress. These are attacks where a required component, such as NIC driver or network stack, is compromised, and as a result compromising the whole Linux system, while on the microkernel it might lead to the network service crashing (and becoming unavailable) without being able to compromise any data. So, in total, 57% of attacks are either completely eliminated or reduced to low severity!
  • 43 exploits (38%) are weakly mitigated with a microkernel design, still posing a serious security threat but no longer qualifying as “critical”. Most of these were attacks that manage to control the GPU, implying the ability to manipulate the frame buffer. This could be used to trick the human user into entering sensitive information that could be captured by the attacker.

Only 5 compromises (4%) were not affected by OS structure. These are attacks that can compromise the system even before the OS assumes control, eg. by compromising the boot loader or re-flashing the firmware, even seL4 cannot defend against attacks that happen before it is running.


Effect of (verified) microkernel-based design on critical Linux exploits.

What can we learn from this?

So you might ask, if verification prevents compromise (as in seL4), why don’t we verify all operating systems, in particular Linux? The answer is that this is not feasible. The original verification of seL4 cost about 12 person years, with many further person years invested since. While this tiny compared to the tens of billions of dollars worth of developer time that were invested in Linux, it is about a factor 2–3 more than it would cost to develop a similar system using traditional quality assurance (as done for Linux). Furthermore, verification effort grows quadratically with the size of the code base. Verifying Linux is completely out of the question.

The conclusion seems inevitable: The monolithic OS design model, used by Linux, Windows, macOS, is fundamentally and irreparably broken from the security standpoint. Security is only achievable with a (verified) microkernel design. Furthermore, using systems like Linux in security- or safety-critical applications is at best grossly negligent and must be considered professional malpractice. It must stop.


Insecure by design – lessons from the Meltdown and Spectre debacle

The disclosure of the Meltdown and Spectre computer vulnerabilities on January 2, 2018 was in many ways unprecedented. It shocked – and scared – even the experts.

The vulnerabilities bypass traditional security measures in the computer and affect billions of devices, from mobile phones to massive cloud servers.

We have, unfortunately, grown used to attacks on computer systems that exploit the inevitable flaws resulting from vast conceptual complexity. Our computer systems are the most complex artefacts humans have ever built, and the growth of complexity has far outstripped our ability to manage it.

A new kind of vulnerability

Meltdown and Spectre are qualitatively different from previous computer vulnerabilities. Not only are they effective across a wide class of computer hardware and operating systems from competing vendors. And not only were the vulnerabilities hiding in plain sight for more than a decade. The really shocking realisation is that Meltdown and Spectre do not exploit flaws in the computer hardware or software.

As Intel stated in its press release, these attacks:

…gather sensitive data from computing devices that are operating as designed.

The ingenuity of the attacks lies in combining seemingly unrelated design features that were thought to be well understood – stuff we teach undergraduate computer science students. The vulnerability is not in any of the individual features, but in the complex interaction between them.

It turns out that computer systems are insecure not because of mistakes made in the implementation, but because of ill-conceived design.

As a community of computer systems experts, we have to ask ourselves how such a debacle is possible, and how a recurrence can be prevented.

We have known for a while that the established “wait for something to happen and then try to fix it” approach – better known as “patch and pray” – does not work even for more common implementation flaws, as witnessed by the proliferation of exploits. It works even less well for such insecure-by-design situations.

Automated evaluation of designs

The fundamental problem is that humans are unable to fully understand the conceptual complexity of modern computer systems and how its seemingly unrelated features might interact. There is no hope that this will change.

But solving complex problems is what machines are increasingly good at. So, the only real solution can be the automated evaluation of designs, with the aim of mathematically proving that under all circumstances a design will behave in a way that is considered secure – in particular by not leaking secret data.

In other words, a design must be considered insecure unless there is a rigorous mathematical proof to the contrary.

This is not an easy ask by any definition, and much more work across many areas of computer science and engineering is needed to make it a reality. But we need to start somewhere, and we need to start now.

We will reap benefits of embarking on such a program long before we achieve the goal of rigorous end-to-end proof. Significant improvements will be achieved through partial results, both in the form of proving weaker properties, and by establishing desired properties in a less rigorous fashion.

For example, an incomplete evaluation may be more feasible than a complete one, and produce a probabilistic result, such as a greatly reduced likelihood of exploits.

Rewriting the hardware-software contract

A necessary, and overdue, first step is a new and improved hardware-software contract.

Computer systems are a combination of hardware and software. The people and companies that develop hardware are largely separate from those developing the software. Given the vastly different skills and experience required, this is inevitable.

To make development practical, both sides work to an interface, called the instruction-set architecture (ISA), which presents the contract between hardware and software functionality.

The problem, clearly exhibited by the Meltdown and Spectre attacks, is that the ISA is under-specified for security, or safety for that matter. It simply does not provide ways to isolate the speed of progress of a computation from other system activities.

The ISA is a functional specification, meaning it defines how the visible state of the machine will (eventually) change if an operation is triggered. It intentionally abstracts away anything to do with time. In particular, it hides how long operations take and how this time depends on the internal state of the machine. The problem is that this internal state depends on potentially confidential data processed by previous operations.

This means that by observing the exact timing of particular sequences of operations, it is possible to infer data that is supposed to be kept secret. This is exactly what happened with Meltdown and Spectre.

The abstraction is there for a good reason: It allows hardware designers to change things “under the hood”, usually in order to improve performance. Consequently, there will be resistance from hardware manufacturers to a tighter contract. But we believe that the refined specifications can be kept abstract enough to retain manufacturers’ ability to innovate, and to avoid exposing confidential IP.

The ConversationThe recent debacle has shown that the ISA is too abstract, making it impossible to tell whether a system is secure or if it will leak secrets. This must change, urgently.

Co-authored with Yuval Yarom, Researcher, CSIRO’s Data61; Senior Lecturer, University of Adelaide

This article was originally published on The Conversation. Read the original article.

Verified software can (and will) be cheaper than buggy stuff

Software is expensive. Verified software is more expensive (although not as much as people think). But we’re working on closing that gap, with the aim of making verified (highest-assurance) software no more expensive than traditionally engineered (no-assurance) code. Our Cogent approach is a major step in that direction.

The cost of verified software

Our paper describing the complete seL4 verification story analysed the cost of designing, implementing and proving the implementation correctness of seL4. We found that the total cost (not counting one-off investments into tools and proof libraries) came to less than $400 per line of code (LOC).

This may look expensive at first, but is less so if taken in context. The Pistachio microkernel, developed a few years earlier in very similar circumstances (university environment with people who knew what they were doing) was only a factor of 2–3 less expensive. Pistachio makes no assurance claims whatsoever (but is generally well-engineered). Take into account that Pistachio experimented far less with kernel design, and its cost does not include the full life-cycle cost (especially later bug-hunting, which doesn’t happen for seL4), you’ll see that we aren’t all that far away.

Another data point is a number quoted by Green Hills some ten years ago: They estimated the full (design, implementation, validation, evaluation and certification) cost of their high-assurance microkernel to be in the order of $1k/LOC.

In other words, we are already much cheaper than traditional high-assurance software, and a factor of 2-3 away from low-assurance software. If we could close the latter gap, then there would be no more excuse for not verifying software.

How to reduce verification cost

The main reason for the high cost of verification is that it’s very labour-intensive. Almost all of the (by now) quarter million lines of seL4 proof code are hand-written, by real humans. And as those numbers show, there is a lot of proof code to write. Just for the seL4 correctness proof (we now have many additional proofs of high-level properties), there were over 20 lines of proof per line of C.

This means there are two obvious avenues to reducing cost:

  1. reducing the ratio of proof lines to code lines, and
  2. generating proofs automatically.

Reducing the amount of proof is possible when the code is written in a more verification-friendly way. Not much can be done there when writing in C, the implementation language of seL4 (for good reason). However if we use a high-level language that is strongly typed and memory safe, many properties we had to prove for seL4 would be automatically enforced by the language. Furthermore, functional languages have many advantages, such as freedom from side effects, and verifying code written in such a language is easier (requires fewer proofs) than imperative-language code.

A glimpse of this we got in the original seL4 verification. We did this in two steps, first refining from the formal kernel spec to an “executable spec” which was essentially Haskell code, and then a second refinement from there to C code. Had we stopped at Haskell, we would have saved about 1/3 of the total proof effort. But the Haskell code was actually intentionally written in an imperative style, and more low-level than you would normally do. This is because verifiers preferred to work on the first refinement, and tried to get as much as possible out of it by pushing the Haskell level as low as possible. Refining to more “normal” Haskell code would have made the first refinement step much simpler/cheaper (at the cost of disproportionately more effort going into the second refinement).

Of course, had we stopped at Haskell we would not have a complete verification story. We would have to trust the Haskell compiler, which is an order of magnitude bigger than seL4, and a big Haskell run time. We therefore did the real implementation in C, and proved its correctness (and later also that it was correctly translated into machine code).

However, the translation from Haskell to C, while it was done by a human, was for the most part fairly mechanical (the main exception being the performance-critical fast paths). In fact, it should be possible to automate this translation. But if you automate the translation (i.e. build a compiler) then it should be possible to prove the correctness of that translation, as it had been done by the CompCert certifying C compiler.

Cogent: Co-generation of code and proof

Our Cogent framework combines the above two observations in order to reduce verification cost:

  • we use a high-level, strongly typed, memory-safe functional
    implementation language — Cogent;
  • we automatically translate Cogent into C, and with the C also
    generate a formal spec of the code, plus a proof that the generated
    C is correct against this spec.

The generated spec is in the logic of the theorem prover, Isabelle, and is at the same abstraction level as the Cogent code (visually equivalent).

Importantly, Cogent compiles into straight C, without the need for a run-time system, so verifying that code is all that’s needed. Cogent code does require some abstract data types (ADTs) implemented in C and to be verified manually, but these are explicitly called by Cogent code, so there’s no library code that gets pulled in under the hood. And there’s certainly no such ugly stuff as garbage collectors.

The combination of a nice, clean functional spec and ability to compile into straight and efficient C without garbage collection is possible through the use of linear types, which restrict the functional representation enough to enable the right translation. The need for a separate ADT library is the cost we pay for this, as the resulting language is not Turing complete.

With this, all we have to do is implement our system in Cogent and then verify that the Cogent code (as represented by the Isabelle spec) is correct against the top-level formal spec. In other words, had we written seL4 in Cogent, we would be done after the first refinement (plus verifying the ADTs). Furthermore, Cogent code is more high-level than our Haskell implementation. We would have easily saved between a third and half of the overall verification effort.

It works!

We didn’t try to use Cogent to (re)implement seL4. After all, seL4 is done, and who wants to redo something that’s already been done? (Note that the linear type system would have forced much of the kernel into external libraries, thus defeating the point of using Cogent.) Instead we picked other important systems code as a case study: file systems.

We implemented two file systems in Cogent. One is a re-implementation of the Linux ext2fs. It’s a complete implementation of the original ext2fs and passes the test suite, but lacks some more recent features (symlinks and ACLs).

The second file system is BilbyFS, a custom flash file system, which in complexity and efficiency sits somewhere in between the Linux standard flash file systems JFFS2 and UBIFS.

Our paper recently published at ASPLOS describes the experience. We found that we’re not yet quite there in terms of performance: while our file systems achieved essentially the same throughput as the hand-written C implementations, CPU load indicates that there’s presently about a factor-two overhead. According to our motto that security is no excuse for bad performance, we’ll have more work to do. One issue was that we overestimated the ability of the C compiler to optimise struct arguments, we get a lot of redundant data copies. But this is fixable by putting more smarts into the Cogent compiler.

But our experience so far shows that we’re on the right track: We can implement real-world systems code in Cogent, and integrate with real systems. For evaluation we loaded our file systems into Linux, but we will use them natively in seL4-based systems as well.

What’s the cost of verification now?

An evaluation of the verification cost is extremely encouraging. We did not completely verify our file systems, but we completely verified the implementation of two external APIs for BilbyFS: sync() and iget(). And the result is stunning: The implementation of all functionality required for sync() is 300 lines of Cogent, iget() is 200 lines. The verification of that functionality required 5,700 lines of proof for sync() and 1,800 lines for iget(). So, the first take-away is that the ratio of proof to code lines seems to be reduced, which is what we hope to see and guarantees some cost reduction.

We see a similar picture when looking at proof effort. In seL4, the cost was 1.4 person years (py) per kLOC, in BiblyFS this is down to 0.6py/kLOC, about a 60% reduction. This is due to a combination of two factors: the functional abstraction provided by Cogent, but also the fact that BilbyFS is much more modular than possible for a high-performance microkernel.

We cannot yet read too much into those numbers, as they are for small pieces of code, and experience shows that effort grows roughly with the square of the size. But the trend is good. And it is roughly in line with what we expect.

But the important part is that we totally eliminate the manual effort in verifying down to C. That’s roughly the equivalent of seL4’s second refinement step, except that we can make this automated step bigger with Cogent (and correspondingly reduce the manual step). As argued above, this alone should result in a 1/3 to 1/2 reduction of effort. And this is what we have already.


More work remains to be done, especially with respect to eliminating the performance gap. But we have taken a big step with respect to reducing the factor 2–3 cost gap of formal verification. We estimated that we’ve at least halved it.

The grand goal remains the elimination of this gap. I’m confident we’ll get there in a few years. And imagine the implications: verified, highest-assurance software at the cost of traditional, no-assurance software. We really may be able to make software bugs a thing of the past!

Cogent is open source.

Thanks to Toby Murray for feedback on this blog.

So, the FBI cracked the iPhone. With a zero-day and hardware?

The Washington Post reports that the FBI bought a zero-day exploit, and “then [used the] new information to create a piece of hardware that helped the FBI to crack the [San Bernardino] iPhone’s four-digit [PIN]”.

This is a bit intriguing, as there are known physical attack vectors that could be used, with some difficulty, to crack the PIN. If they used a physical attack, why did they need a software exploit?

It is also a bit ironic, as I had just finally caved in and bought an iPhone myself. A strong motivation was the security story, of which our L4 kernel is a significant component. Specifically, modern iOS devices have a security processor, which runs a modified L4 kernel. That kernel is in fact a fork of the L4-embedded kernel we developed more than ten years ago, and which later became OKL4. Apple’s iOS L4 is derived from the L4 kernel that supported our Darbat project.

So, what does this all mean for the security of iOS devices, and specifically, does it mean that L4 was broken?

First off, I find it rather comforting that a physical attack was needed, using some custom hardware. It was always clear that physical attacks would be possible. From the security PoV, such attacks are much less of a concern for the innocent individual (such as me), because they require a lot of effort for each individual device, and they require physical possession, i.e. cannot be done remotely. I don’t object to law enforcement being able to break an individual device if there is a concrete suspicion, as in the San Bernardino case. I do object against anything that enables mass surveillance of innocent citizens, which is what the FBI wanted Apple to enable (really just a mind-boggling stupid request). I’m more than glad that Apple refused to cooperate with this nonsense.

So, whatever the zero-day was the FBI purchased, it seems that on its own it wasn’t enough to break security. That’s a good outcome.

It’s also highly unlikely that the zero-day was against L4. The article makes it clear that the actual PIN was brute-forced (easy to do for a four-digit PIN). L4 cannot protect against this. And it also must be kept in mind that L4 only protects the security processor, which is in charge of crypto operations and, it seems, key management.

The PIN is, by definition, able to unlock the phone. It is stored somewhere on the device in encrypted form, and the security processor (and L4) is responsible for that. But the protection against brute-forcing (by counting unsuccessful attempts etc) require far more infrastructure than the security processor. Some of that is inevitably under control of iOS. So, almost certainly, the zero-day exploit targeted iOS, not L4.

That there are zero-days in iOS is hardly surprising: it is a large, monolithic OS with a trusted computing base (TCB) in the millions of lines of code. It’s a virtual certainty that there are many more exploits lurking in there.

Unlikely as it is, there might be exploits in L4 as well. It’s much less likely, as L4 is orders of magnitude smaller than iOS, and the number of exploits it might have can be expected to be corresponding orders of magnitude smaller. But it’s a legacy L4, not the verified seL4, and therefore there is no guarantee that there are no bugs. And, of course, we don’t know what Apple did with it.

To sum up, recent developments do not change my opinion that the security of iOS devices is strong, almost certainly among the best in mass-market consumer devices. Cracking it required a physical attack, which is practically impossible to defend against, certainly for a mass-market device. L4 likely made the thing harder to crack, and is unlikely to be the part that was compromised.

Update June’16

As people pointed out soon after I posted this blog, the San Bernadino iPhone was an older model that featured a pre-A7 SoC, and as such does not yet have the secure enclave that runs L4. Presumably this means newer phones are much harder to crack…

How to steal encryption keys: Your cloud is not as secure as you may think!

People increasingly depend on cloud services for business as well as private use. This raises concerns about the security of clouds, which is usually discussed in the context of trustworthiness of the cloud provider and security vulnerability of cloud infrastructure, especially the hypervisors and cryptographic libraries they employ.

We have just demonstrated that the risks go further. Specifically, we have shown that it is possible to steal encryption keys held inside a virtual machine (that may be hosting a cloud-based service) from a different virtual machine, without exploiting any specific weakness in the hypervisor, in fact, we demonstrated the attack on multiple hypervisors (Xen as well as VMware ESXi).

Cache-based Timing Channels

The performance of modern computing platforms is critically dependent on caches, which buffer recently used memory contents to make repeat accesses faster (exploiting spatial and temporal locality of programs). Caches are functionally transparent, i.e. they don’t affect the outcome of an operation (assuming certain hardware or software safeguards), only its timing. But the exact timing of a computation can reveal information about its algorithms or even the data on which it is operating – caches can create timing side channels.

In principle, this is well understood, and cloud providers routinely take steps to prevent timing channels, by not sharing processor cores between virtual machines belonging to different clients.

Preventing L1-based cache channels

Why does this help? Any exploitation of a cache-based channel inherently depends on shared access (between the attacker and the victim) of this hardware resource. An attacker has no direct access to the victim’s data in the cache, instead cache attacks are based on the attacker’s and victim’s data competing for space in the cache. Basically, the attacker puts (parts of) the cache into a defined state, and then observes how this changes through the competition for cache space by the victim.


Memory architecture in a modern x86 processor.

Modern processors have a hierarchy of caches, as shown in the diagram to the right. The top-level “L1” caches (there is usually a separate L1 for data and instructions) are fastest to access by programs executing on the processor core. If a core is shared between concurrently executing virtual machines (VMs), as is done through a hardware feature called hyperthreading, the timing channel through the cache is trivial to exploit.

So, hypervisor vendors recommend disabling hyperthreading, and we can assume that any responsible cloud provider follows that advice. Hyperthreading can provide a performance boost for certain classes of applications, but typical cloud-based services probably don’t benefit much, so the cost of disabling hyperthreading tends to be low.

In principle, it is possible to exploit the L1 timing channel without hyperthreading, if the core is multiplexed (time-sliced) between different VMs. However, this is very difficult in practice, as the L1 caches are small, and normally none of the attacker’s data survives long enough in the cache to allow the attacker to learn much about the victim’s operation. Small caches imply a small signal. (Attacks against time-sliced L1 caches generally requiring the attacker to force high context-switching rates, eg by forcing high interrupt rates, something the hypervisor/operator will consider abnormal behaviour and will try their best to prevent.)

In somewhat weaker form, what I have said about L1-based attacks also applies to other caches further down in the memory hierarchy, as long as they are private to a core. On contemporary x86 processors, this includes the L2 cache. L2 attacks tend to be harder than L1 attacks for a number of reasons (despite the larger size of the L2, compared to the L1), and the same defences work.

Attacking through the last-level cache (L3)

The situation is different for the last-level cache (nowadays the L3), which, on contemporary x86 processors, is shared between multiple cores (either all cores on the processor chip, or a subset, called a cluster).

What is different is that, on the one hand, the cache is further away from the core, meaning much longer access times, which increases the difficulty of exploiting the timing channel. (Technically speaking, the longer access time reduces the sampling rate of the signal the attacker is trying to measure.) On the other hand, the fact that multiple cores access the same L3 concurrently provides an efficiency boost to the exploit (by increasing the rate at which the signal can be sampled), similar to the case of the L1 cache with hyperthreading enabled.

Still, L3-based attacks are difficult to perform, to the point that to date people generally don’t worry about them. There have been attacks demonstrated in the past, but they either depended on weaknesses (aka bugs) in a specific hypervisor (especially its scheduler) or on unrealistic configurations (such as sharing memory between VMs), which obviously break inter-VM isolation – a security-conscious cloud provider (or even a provider concerned about accounting for resources accurately) isn’t going to allow those.

It can be done!

However, these difficulties can be overcome. As we just demonstrated in a paper that will be presented at next month’s IEEE Security and Privacy (“Oakland”) conference, one of the top security venues, the L3 timing channel can be exploited without unrealistic assumptions on the host installation. Specifically:

  • We demonstrated that we can steal encryption keys from a server running concurrently, in a separate VM, on the processor.
  • Stealing a key can be done in less than a minute in one of our attacks!
  • The attack does not require attacker and victim to share a core, they run on separate cores of the same processor.
  • The attack does not exploit specific hypervisor weaknesses, in fact we demonstrated it on two very different hypervisors (Xen as well as VMware ESXi).
  • The attack does not require attacker and victim to share memory, even read-only (in contrast to earlier attacks).
  • We successfully attack multiple versions of encryption (which use different algorithms).

What does this mean in practice?

The implications are somewhat scary. The security of everything in clouds depends on encryption. If you steal encryption keys, you can masquerade as a service, you can steal personal or commercial secrets, people’s identity, the lot.

Our attack targeted two specific implementations of a crypto algorithm in the widely used GnuPG crypto suite, it exploits some subtle data-dependencies in those algorithms, which affect cache usage, and thus can be analysed by a sophisticated attacker to determine the (secret!) encryption keys. Ethics required that we informed the maintainers to allow them to fix the problem (we provided workarounds) before it became public. The maintainers updated their code, so an installation running up-to-date software will not be susceptible to our specific attack.

However, the problem goes deeper. The fact that we could successfully attack two different implementation of encryption means that there are likely ways to attack other encryption implementations. The reason is that it is practically very difficult to implement those algorithms without any exploitable timing variations. So, one has to assume that there are other possible attacks out there, and if that’s the case, it is virtually guaranteed that someone will find them. Eventually such a “someone” will be someone nasty, rather than ethical researchers.

Why do these things happen?

Fundamentally, we are facing a combination of two facts:

  • Modern hardware contains a lot of shared resources (and the trend is increasing).
  • Mainstream operating systems and hypervisors, no matter what they are called, are notoriously poor at managing, and especially isolating, resources.

Note that we did not exploit anything in the hypervisor most people would consider a bug (although I do!), they are simply bad at isolating mutually-distrusting partitions (as VMs in a cloud are). And this lack of isolation runs very deep, to the degree that it is unlikely to be fixable at realistic cost. In other words, such attacks will continue to happen.

To put it bluntly: From the security point of view, all main-stream IT platforms are broken beyond repair.

What can be done about it?

Defending legacy systems

The obvious way to prevent such an attack is never to share any resources. In the cloud scenario, this would mean not running multiple VMs on the same processor (not even on different cores of that processor).

That’s easier said than done, as it is fundamentally at odds with the whole cloud business model.

Clouds make economic sense because they lead to high resource utilisation: the capital-intensive computing infrastructure is kept busy by serving many different customers concurrently, averaging out the load imposed by individual clients. Without high utilisation, the cloud model is dead.

And that’s the exact reason why cloud providers have to share processors between VMs belonging to different customers. Modern high-end processors, as they are deployed in clouds, have dozens of cores, and the number is increasing, we’ll see 100-core processors becoming mainstream within a few years. However, most client VMs can only use a small number of cores (frequently only a single one). Hence, if as a cloud provider you don’t share the processor between VMs, you’ll leave most cores idle, your utilisation goes down to a few percent, and you’ll be broke faster than you can say “last-level-cache timing-side-channel attack”.

So, that’s a no-go. How about making the infrastructure safe?

Good luck with that! As I argued above, systems out there are pretty broken in fundamental ways. I wouldn’t hold my breath.

Taking a principled approach to security

The alternative is to build systems for security from the ground up. That’s hard, but not impossible. In fact, our seL4 microkernel is exactly that: a system designed and implemented for security from the ground up – to the point of achieving provable security. seL4 can form the base of secure operating systems, and it can be used as secure hypervisor. Some investment is needed to make it suitable as a cloud platform (mostly to provide the necessary management infrastructure), but you can’t expect to get everything for free (although seL4 itself is free!)

While seL4 has already a strong isolation story, so far this extends to functional behaviour, not yet timing. But we’re working on exactly that, and should have a first solution this year.

So, why should you believe me that it is possible to add such features to seL4, but not to mainstream hypervisors? The answer is simple: seL4 is designed from ground up for high security, and none of the mainstream systems is. And second, seL4 is tiny, about 10,000 lines of code. In contrast, main-stream OSes and hypervisors consist of millions of lines of code. This not only means they are literally full of bugs, thousands and thousands of them, but also that it is practically impossible to change them as fundamentally as would be required to making them secure in any serious sense.

In security, “small is beautiful” is a truism. And another one is that retrofitting security into a system not designed for it from the beginning doesn’t work.

Update June’16

Researchers from Wochester Polytechnic have in the meantime demonstrated our attack on a real Amazon AC2 cloud. This shows beyond doubt that the threat is real. Which, of course, doesn’t stop could providers from sticking their heads in the sand.

Whither ERA?

The ARC released the composition of the ERA’15 Research Evaluation Committees (RECs) a few days ago. The one relevant to us is the Mathematics, Information and Computing Sciences (MIC) REC. So I was a bit surprised when I looked at it and recognised almost no names.

For those living outside Australia, outside academia, or have their head firmly burrowed in the sand, ERA is the Excellence in Research for Australia exercise the Australian Research Council (ARC, Oz equivalent of the NSF) has been running since 2010. It aims to evaluate the quality of research done at Australian universities. I was involved in the previous two rounds, 2010 as a member of the MIC panel, and 2012 as a peer reviewer.

The ERA exercise is considered extremely important, universities take it very seriously, and a lot of time and effort goes into it. The outcomes are very closely watched, universities use it to identify their strengths and weaknesses, and everyone expects that government funding for universities will increasingly be tied to ERA rankings.

The panel is really important, as it makes the assessment decisions. Assessment is done for “units of evaluation” – the cartesian product of universities and 4-digit field of research (FOR) codes. The 4-digit FORs relevant to computer science and information systems are the various sub-codes of the 2-digit (high-level) code 08 – Information and Computing Sciences.

For most other science and engineering disciplines, assessment is relatively straightforward: you look at journal citation data, which is a pretty clear indication of research impact, which in turn is a not unreasonable proxy for research quality. In CS, where some 80% of publications are in conferences, this doesn’t work (as I’ve clearly experienced in the ERA’10 round): the official citation providers don’t understand CS, they don’t (or only very randomly) index conferences, they don’t count citations of journal papers by conference papers, and the resulting impact factors are useless. As a result, the ARC moved to peer-review for CS in 2012 (as was used by Pure Maths and a few other disciplines in 2010 already).

Yes, the obvious (to any CS person) answer is to use Google Scholar. But for some reason or other, this doesn’t seem to work for the ARC.

Peer review works by institutions nominating 30% of their publications for peer review (the better ones, of course), and several peer reviewers are each reviewing a subset of those (I think the recommended subset is about 20%). The peer reviewer then writes a report, and the panel uses those to come up with a final assessment. (Panelists typically do a share of peer review themselves.)

Peer review is inevitably much more subjective than looking at impact data. You’d like to think that the people doing this are the leaders in the field, able to objectively assess the quality of the work of others. A mediocre researcher is likely to emphasise factors that would make themselves look good (although they are, of course, excluded from any discussion of their own university). Basically, I’d trust the judgment someone with an ordinary research track record much less than that of a star in the field.

So, how does the MIC panel fare? Half of it are mathematicians, and I’m going to ignore those, as I wouldn’t be qualified to say anything about their standing. But for CS folks, citation counts and h-factors as per google scholar, in the context of the number of years since their PhD, is a very good indication. So let’s look at the rest of the MIC panellists, i.e. the people from computer science, information systems or IT in general.

Name Institution years of PhD cites h-index
Leon Sterling (Chair) Swinburne ~25 5,800 28
Deborah Bunker USyd 15? max cite =45
David Green Monash ~30 3,400 30
Jane Hunter UQ 21 3,400 29
Michael Papazoglou Tilburg 32 13,200 49
Paul Roe QUT 24 <1,000 17
Markus Stumptner UniSA ~17 2,900 28
Yun Yang Swinburne ~20 3,800 30

[Note that Prof Bunker has no public Scholar profile, but according to Scholar, her highest-cited paper has 45 citations. Prof’s Sterling’s public Scholar profile includes as the top-cited publication (3.3k cites) a book written by someone else, subtracting this leads to the 5.8k cites I put in the table. Note that his most cited publication is actually a textbook, if you subtract this the number of cites is 3.2k.]

Without looking at the data, one notices that only three of the group are from the research-intensive Group of Eight (Go8) universities, plus one from overseas. That in itself seems a bit surprising.

Looking at the citation data, one person is clearly in the “star” category: the international member Michael Papazoglou. None of the others strike me as overly impressive, a h-index of around 30 is good but not great, similar with citations around the 3000 mark. And in two cases I can really only wonder how they could possibly been selected. Can we really not come up with a more impressive field of Australian CS researchers?

Given the importance of ERA, I’m honestly worried. Those folks have the power to do a lot of damage to Australian CS research, by not properly distinguishing between high- and low-quality research.

But maybe I’m missing something. Let me know if you spot what I’ve missed.

Peer Review: Anonymity should not be at the expense of transparency

We’ve all been through this: You do some work, you think it’s good, you send it to a top conference, and it gets rejected. Happens to the best of us, and, in fact, happens more often than many think. For example, 2013 was a very good year for me publications-wise, with 7 papers in top venues (SOSP, TOCS, EuroSys, SIGMOD, OOPSLA and RTAS). Yet, of my 19 submissions that year, 8 got rejected, so my acceptance rate wasn’t even 60%. And that was a good year. I had much worse.

It has just happened again, this time with EuroSys, by any standard a top conference. (I would say that, I was the PC chair a few years ago. 😉 )

This time I was a bit, say, surprised. The paper had four reviews: one was a “weak reject”, one was a “weak accept”, and the remaining two were clear “accept”, making it, as my colleague and co-author Kev remarked, our highest-rated reject ever.

Now this sort of stuff happens. While one would naively think that with two clear supporters and one weak critic, this should get up, the reality is different. Things may come out of the discussion at the PC meeting, the weak critic may become a stronger critic from reading other reviews, etc. I’ve seen a lot of this happen in the many PCs I served. And I’m certainly not saying that the PC meeting made a mistake. That isn’t my call, the whole point of peer review is that the judgement is made independently.

However, I do have a significant issue with the way this paper was handled by the PC, and that relates to transparency: The PC gave us, the authors, no halfway convincing justification for their decision. To the contrary, the points raised against the paper by the review could have easily been rebutted. In the interest of transparency I will not just make this claim, but back up my rant by going through the reviewer’s arguments below.

So, basically, the PC made a decision that is not understandable or reproducible from the data at hand. This is an issue in that it creates an impression of arbitrariness, that has the potential to damage the reputation of the conference. Hopefully it is a one-off case. However, the conference failed to use two simple mechanisms that are part of what I consider to be best practice in transparency.

  1. Rebuttals: Many conferences send out reviews to authors and give them a few days to respond to them before the PC meeting. This is a mandatory requirement for all SIGPLAN-sponsored conferences (including ASPLOS, which is also one of “ours”). It has been used by most if not all EuroSys PCs since I introduced it in 2011, and has been used by many other top conferences, including last year’s OSDI. EuroSys’15 didn’t do rebuttals, which, in my opinion, is a transparency regression.
  2. Summary of PC discussion: This has recently been adopted by many top conferences: Papers discussed at the PC (at least those which end up being rejected) are given a brief summary of the PC discussion, stating the reasons for rejection. This has been used by both major systems conferences on whose PCs I served during the last year (OSDI and ASPLOS) and many others. I don’t remember whether previous EuroSys PCs used it, but it’s become a common enough practice to consider it best practice.

By not doing either of these, the EuroSys PC didn’t actually do anything wrong, but clearly missed out an opportunity to be seen to be transparent. I think that’s highly regrettable for a conference I think highly of. In the particular instance of our paper, I would have been far less annoyed about the outcome if I was given a good justification.

So, let’s have a look at the information we did get. Below I reproduce uncensored (other than for the list of typos) Review A, the only negative one. And I’ll show how I can rebut all criticism from the paper (in line with normal rules for rebuttals, which aren’t allowed to introduce new material). For clarity, I’ll be more detailed and verbose than I could in an actual rebuttal, which is generally very space constrained. You can check my arguments against the submitted version of the paper. And, to show that I’m not hiding anything, here are the complete reviews for the paper.

Note that I do not claim that being given the opportunity for rebuttal would have changed the outcome. But it would have given us some assurance that the decision wasn’t made on the basis of misunderstandings (this is the precise reason for rebuttals), and a summary would have indicated to us that our comments were taken into account. Without that we are left wondering.

Ok, here comes the review.

Overall merit: 2. Weak reject
Reviewer expertise: 4. Expert

===== Paper summary =====

The paper explores a trade-off between coarse grained locking, fine grained locking, and transactional memory in the context of a microkernel. It claims that using a single global lock with microkernel on a small embedded system is advantageous in these settings, the main reason being it allows for static verification of the kernel. The differences between used synchronization techniques are evaluated on a quad-core x86 and ARM systems and seL4 microkernel. The evaluation, which uses a server workload, shows no significant performance difference between the techniques on a real-world sized benchmark.

===== Comments for author =====

Experimental results don’t quite back-up the claim that a single global lock is better than fine-grained locking. Fine-grained locking performs best on both x86 and ARM with 4 cores on the multi-core IPC micro-benchmark (Figure 7), which is the benchmark showing any difference between synchronization different.

This simply mis-represents the claims of the paper. What we claim (in the abstract) is that “coarse-grained locking [with a microkernel on closely-clustered cores] achieves scalability comparable to fine-grained locking.” Further, at the end of the Introduction, we state that “the choice of concurrency control in the kernel is clearly distinguishable for extreme synthetic benchmarks. For a real I/O (and thus system call intensive) work-load, only minor performance differences are visible, and coarse grain locking is preferred (or the equivalent elided with hardware transactional memory) over extra kernel complexity of fine-grained locking.” A similar statement is in the Discussion section. I claim that the statements in the paper are consistent with the results presented, and contradict the assertions made by the reviewer.

If fine-grained locking performs best, even if only on a micro-benchmark, already with only four cores, why not use it?

Again, we explain so in the paper: it is more complex, and as such very error-prone (as we did experience very painfully when implementing it). That concurrency is hard is well established, and if you can avoid it without degrading performance, then it’s a reasonable thing to do. And, as we also clearly state, we wouldn’t be able to verify such a highly-concurrent kernel. So I really don’t understand why the reviewer simply ignores our arguments.

I think that it is interesting to see that transactional memory outperforms fine-grained locking and is the best option overall, but it is not (yet) widely spread and the paper is making a claim about BKL anyway.

While our microbenchmarks show that transactional memory performs best, this is only visible in extreme microbenchmarks. Like microbenchmarks in general, these are an analysis tool, but don’t tell you anything about real performance, as they are not representative of real workloads. (You don’t do 1000s of back-to-back identical system calls with no actual work in between, except for analysis.) This is why we did macrobenchmarks (Figs 8–11). And they show no discernible performance differences between the configuration. This is the core data that back our argument that the details of locking don’t matter (for the particular part of the design space we’re looking at), and justify taking the simplest option.

It is not clear that using four-core systems to evaluate synchronization techniques is acceptable today, as even phones are starting to have eight cores. The findings of the study are not really valid for eight cores, as claimed in the paper, as contention will be much higher with twice as many cores.

Yes, 8-way ARM chips are starting to appear, and our data cannot be used to prove anything beyond 4 cores. However, given that on the macrobenchmark there is no discernible difference for 4 cores, it is a fairly safe assumption that if there is a difference at 8 cores, it isn’t going to be dramatic.

This claim is backed by Fig 2(b). It represents an extreme case of hammering the kernel, with “think time” (i.e. out-of-kernel time) being equal to in-kernel time. As shown in Fig 1, only about 20% of pause times in our macrobenchmark are so short, the median is about four times as big. Hence, Fig 2(b) represents an unreasonably high lock contention. Yet, the benchmark still shows very little difference between lock implementations. If we double the number of cores, contention would double (as the reviewer states, it could actually be a bit worse than double), but would still not even be half of that of Fig 2(b) which shows essentially no difference. That’s why I can claim with reasonable confidence that the 8-core case would still work.

Performance of locking primitives and IPC is well evaluated and results are very clearly presented.

Evaluation is limited as it uses a single benchmark. It is hard to draw solid conclusions based on one workload.

We cannot dispute the fact that we only have one macro-benchmark. However, I can argue that not more could be learned from having more than one (see below).

Moreover, the benchmark is a server workload (YCSB on Redis) and the paper is supposed to focus on lower-end systems.

For the point of the exercise, it is totally relevant what kind of workload it is. All the actual work is at user level, the kernel just acts as a context-switching engine. The only thing that matters is the distribution of in-kernel and think times, as this is what determines lock contention.

We took this particular benchmark as, of all the realistic workloads we could think off, it hammers the kernel most savagely, by producing a very high system-call load. In other words, it was the most pessimistic yet halfway-realistic workload we could come up with.

Server systems will have many more cores (typically 16 today) and using fine-grained synchronization is likely to be even more important for them.

With respect, that’s besides the point. Server platform or not, we specifically state that we’re aiming at cores that share a cache and have low inter-core communication latencies (see Introduction). Server platforms with high core counts tend not to match this requirement, and we argue that they would be best served with a “clustered mulitkernel” approach, where a BKL is used across a closely-coupled cluster, and a multikernel approach without shared state across clusters. All that is stated in the Introduction of the paper.

It is not clear why more than 32000 records would exceed memory limitations of the systems. If benchmark uses 1kB key-value pairs, this should only be 32MB total and ARM system has 1GB and x86 has 16GB of memory.

OK, we could have been clearer about those implementation limitations of our setup.  However, if you think about it, you’ll realise it doesn’t matter for the results. Using more records, and thus increasing the working set of the user-level code, would have resulted in an increase of cache misses, and thus overall reduced userland performance. The kernel, with its tiny working set, would be less affected. The net effect is that think times go up, and thus lock contention goes down, making the benchmark more BLK friendly.

End of review.

In summary, I think I can comprehensively rebut any of the arguments the reviewer makes against the paper. Where does that leave us?

Well, if that was all there was against the paper, then it is really hard to understand why it was rejected. There may have been other arguments, of course. But then the basic principle of transparency would have obliged the reviewers to tell us. After all, that’s the point of having written reviews that get released to authors, it’s to avoid the impression that random or biased decisions get made behind closed doors.

So, in my opinion, the PC failed the transparency test. I can only hope that we were the only such case.


PS: There was also a shadow PC, consisting of students and junior researchers. It operated under the same conditions as the real PC, but had no influence on the decisions of the real PC. Interestingly, I found the reviews of the shadow PC of much better quality than the “real” ones, more extensive and better argued.