Skip to content

How to steal encryption keys: Your cloud is not as secure as you may think!

People increasingly depend on cloud services for business as well as private use. This raises concerns about the security of clouds, which is usually discussed in the context of trustworthiness of the cloud provider and security vulnerability of cloud infrastructure, especially the hypervisors and cryptographic libraries they employ.

We have just demonstrated that the risks go further. Specifically, we have shown that it is possible to steal encryption keys held inside a virtual machine (that may be hosting a cloud-based service) from a different virtual machine, without exploiting any specific weakness in the hypervisor, in fact, we demonstrated the attack on multiple hypervisors (Xen as well as VMware ESXi).

Cache-based Timing Channels

The performance of modern computing platforms is critically dependent on caches, which buffer recently used memory contents to make repeat accesses faster (exploiting spatial and temporal locality of programs). Caches are functionally transparent, i.e. they don’t affect the outcome of an operation (assuming certain hardware or software safeguards), only its timing. But the exact timing of a computation can reveal information about its algorithms or even the data on which it is operating – caches can create timing side channels.

In principle, this is well understood, and cloud providers routinely take steps to prevent timing channels, by not sharing processor cores between virtual machines belonging to different clients.

Preventing L1-based cache channels

Why does this help? Any exploitation of a cache-based channel inherently depends on shared access (between the attacker and the victim) of this hardware resource. An attacker has no direct access to the victim’s data in the cache, instead cache attacks are based on the attacker’s and victim’s data competing for space in the cache. Basically, the attacker puts (parts of) the cache into a defined state, and then observes how this changes through the competition for cache space by the victim.

architecture

Memory architecture in a modern x86 processor.

Modern processors have a hierarchy of caches, as shown in the diagram to the right. The top-level “L1″ caches (there is usually a separate L1 for data and instructions) are fastest to access by programs executing on the processor core. If a core is shared between concurrently executing virtual machines (VMs), as is done through a hardware feature called hyperthreading, the timing channel through the cache is trivial to exploit.

So, hypervisor vendors recommend disabling hyperthreading, and we can assume that any responsible cloud provider follows that advice. Hyperthreading can provide a performance boost for certain classes of applications, but typical cloud-based services probably don’t benefit much, so the cost of disabling hyperthreading tends to be low.

In principle, it is possible to exploit the L1 timing channel without hyperthreading, if the core is multiplexed (time-sliced) between different VMs. However, this is very difficult in practice, as the L1 caches are small, and normally none of the attacker’s data survives long enough in the cache to allow the attacker to learn much about the victim’s operation. Small caches imply a small signal. (Attacks against time-sliced L1 caches generally requiring the attacker to force high context-switching rates, eg by forcing high interrupt rates, something the hypervisor/operator will consider abnormal behaviour and will try their best to prevent.)

In somewhat weaker form, what I have said about L1-based attacks also applies to other caches further down in the memory hierarchy, as long as they are private to a core. On contemporary x86 processors, this includes the L2 cache. L2 attacks tend to be harder than L1 attacks for a number of reasons (despite the larger size of the L2, compared to the L1), and the same defences work.

Attacking through the last-level cache (L3)

The situation is different for the last-level cache (nowadays the L3), which, on contemporary x86 processors, is shared between multiple cores (either all cores on the processor chip, or a subset, called a cluster).

What is different is that, on the one hand, the cache is further away from the core, meaning much longer access times, which increases the difficulty of exploiting the timing channel. (Technically speaking, the longer access time reduces the sampling rate of the signal the attacker is trying to measure.) On the other hand, the fact that multiple cores access the same L3 concurrently provides an efficiency boost to the exploit (by increasing the rate at which the signal can be sampled), similar to the case of the L1 cache with hyperthreading enabled.

Still, L3-based attacks are difficult to perform, to the point that to date people generally don’t worry about them. There have been attacks demonstrated in the past, but they either depended on weaknesses (aka bugs) in a specific hypervisor (especially its scheduler) or on unrealistic configurations (such as sharing memory between VMs), which obviously break inter-VM isolation – a security-conscious cloud provider (or even a provider concerned about accounting for resources accurately) isn’t going to allow those.

It can be done!

However, these difficulties can be overcome. As we just demonstrated in a paper that will be presented at next month’s IEEE Security and Privacy (“Oakland”) conference, one of the top security venues, the L3 timing channel can be exploited without unrealistic assumptions on the host installation. Specifically:

  • We demonstrated that we can steal encryption keys from a server running concurrently, in a separate VM, on the processor.
  • Stealing a key can be done in less than a minute in one of our attacks!
  • The attack does not require attacker and victim to share a core, they run on separate cores of the same processor.
  • The attack does not exploit specific hypervisor weaknesses, in fact we demonstrated it on two very different hypervisors (Xen as well as VMware ESXi).
  • The attack does not require attacker and victim to share memory, even read-only (in contrast to earlier attacks).
  • We successfully attack multiple versions of encryption (which use different algorithms).

What does this mean in practice?

The implications are somewhat scary. The security of everything in clouds depends on encryption. If you steal encryption keys, you can masquerade as a service, you can steal personal or commercial secrets, people’s identity, the lot.

Our attack targeted two specific implementations of a crypto algorithm in the widely used GnuPG crypto suite, it exploits some subtle data-dependencies in those algorithms, which affect cache usage, and thus can be analysed by a sophisticated attacker to determine the (secret!) encryption keys. Ethics required that we informed the maintainers to allow them to fix the problem (we provided workarounds) before it became public. The maintainers updated their code, so an installation running up-to-date software will not be susceptible to our specific attack.

However, the problem goes deeper. The fact that we could successfully attack two different implementation of encryption means that there are likely ways to attack other encryption implementations. The reason is that it is practically very difficult to implement those algorithms without any exploitable timing variations. So, one has to assume that there are other possible attacks out there, and if that’s the case, it is virtually guaranteed that someone will find them. Eventually such a “someone” will be someone nasty, rather than ethical researchers.

Why do these things happen?

Fundamentally, we are facing a combination of two facts:

  • Modern hardware contains a lot of shared resources (and the trend is increasing).
  • Mainstream operating systems and hypervisors, no matter what they are called, are notoriously poor at managing, and especially isolating, resources.

Note that we did not exploit anything in the hypervisor most people would consider a bug (although I do!), they are simply bad at isolating mutually-distrusting partitions (as VMs in a cloud are). And this lack of isolation runs very deep, to the degree that it is unlikely to be fixable at realistic cost. In other words, such attacks will continue to happen.

To put it bluntly: From the security point of view, all main-stream IT platforms are broken beyond repair.

What can be done about it?

Defending legacy systems

The obvious way to prevent such an attack is never to share any resources. In the cloud scenario, this would mean not running multiple VMs on the same processor (not even on different cores of that processor).

That’s easier said than done, as it is fundamentally at odds with the whole cloud business model.

Clouds make economic sense because they lead to high resource utilisation: the capital-intensive computing infrastructure is kept busy by serving many different customers concurrently, averaging out the load imposed by individual clients. Without high utilisation, the cloud model is dead.

And that’s the exact reason why cloud providers have to share processors between VMs belonging to different customers. Modern high-end processors, as they are deployed in clouds, have dozens of cores, and the number is increasing, we’ll see 100-core processors becoming mainstream within a few years. However, most client VMs can only use a small number of cores (frequently only a single one). Hence, if as a cloud provider you don’t share the processor between VMs, you’ll leave most cores idle, your utilisation goes down to a few percent, and you’ll be broke faster than you can say “last-level-cache timing-side-channel attack”.

So, that’s a no-go. How about making the infrastructure safe?

Good luck with that! As I argued above, systems out there are pretty broken in fundamental ways. I wouldn’t hold my breath.

Taking a principled approach to security

The alternative is to build systems for security from the ground up. That’s hard, but not impossible. In fact, our seL4 microkernel is exactly that: a system designed and implemented for security from the ground up – to the point of achieving provable security. seL4 can form the base of secure operating systems, and it can be used as secure hypervisor. Some investment is needed to make it suitable as a cloud platform (mostly to provide the necessary management infrastructure), but you can’t expect to get everything for free (although seL4 itself is free!)

While seL4 has already a strong isolation story, so far this extends to functional behaviour, not yet timing. But we’re working on exactly that, and should have a first solution this year.

So, why should you believe me that it is possible to add such features to seL4, but not to mainstream hypervisors? The answer is simple: seL4 is designed from ground up for high security, and none of the mainstream systems is. And second, seL4 is tiny, about 10,000 lines of code. In contrast, main-stream OSes and hypervisors consist of millions of lines of code. This not only means they are literally full of bugs, thousands and thousands of them, but also that it is practically impossible to change them as fundamentally as would be required to making them secure in any serious sense.

In security, “small is beautiful” is a truism. And another one is that retrofitting security into a system not designed for it from the beginning doesn’t work.

Whither ERA?

The ARC released the composition of the ERA’15 Research Evaluation Committees (RECs) a few days ago. The one relevant to us is the Mathematics, Information and Computing Sciences (MIC) REC. So I was a bit surprised when I looked at it and recognised almost no names.

For those living outside Australia, outside academia, or have their head firmly burrowed in the sand, ERA is the Excellence in Research for Australia exercise the Australian Research Council (ARC, Oz equivalent of the NSF) has been running since 2010. It aims to evaluate the quality of research done at Australian universities. I was involved in the previous two rounds, 2010 as a member of the MIC panel, and 2012 as a peer reviewer.

The ERA exercise is considered extremely important, universities take it very seriously, and a lot of time and effort goes into it. The outcomes are very closely watched, universities use it to identify their strengths and weaknesses, and everyone expects that government funding for universities will increasingly be tied to ERA rankings.

The panel is really important, as it makes the assessment decisions. Assessment is done for “units of evaluation” – the cartesian product of universities and 4-digit field of research (FOR) codes. The 4-digit FORs relevant to computer science and information systems are the various sub-codes of the 2-digit (high-level) code 08 – Information and Computing Sciences.

For most other science and engineering disciplines, assessment is relatively straightforward: you look at journal citation data, which is a pretty clear indication of research impact, which in turn is a not unreasonable proxy for research quality. In CS, where some 80% of publications are in conferences, this doesn’t work (as I’ve clearly experienced in the ERA’10 round): the official citation providers don’t understand CS, they don’t (or only very randomly) index conferences, they don’t count citations of journal papers by conference papers, and the resulting impact factors are useless. As a result, the ARC moved to peer-review for CS in 2012 (as was used by Pure Maths and a few other disciplines in 2010 already).

Yes, the obvious (to any CS person) answer is to use Google Scholar. But for some reason or other, this doesn’t seem to work for the ARC.

Peer review works by institutions nominating 30% of their publications for peer review (the better ones, of course), and several peer reviewers are each reviewing a subset of those (I think the recommended subset is about 20%). The peer reviewer then writes a report, and the panel uses those to come up with a final assessment. (Panelists typically do a share of peer review themselves.)

Peer review is inevitably much more subjective than looking at impact data. You’d like to think that the people doing this are the leaders in the field, able to objectively assess the quality of the work of others. A mediocre researcher is likely to emphasise factors that would make themselves look good (although they are, of course, excluded from any discussion of their own university). Basically, I’d trust the judgment someone with an ordinary research track record much less than that of a star in the field.

So, how does the MIC panel fare? Half of it are mathematicians, and I’m going to ignore those, as I wouldn’t be qualified to say anything about their standing. But for CS folks, citation counts and h-factors as per google scholar, in the context of the number of years since their PhD, is a very good indication. So let’s look at the rest of the MIC panellists, i.e. the people from computer science, information systems or IT in general.

Name Institution years of PhD cites h-index
Leon Sterling (Chair) Swinburne ~25 5,800 28
Deborah Bunker USyd 15? max cite =45
David Green Monash ~30 3,400 30
Jane Hunter UQ 21 3,400 29
Michael Papazoglou Tilburg 32 13,200 49
Paul Roe QUT 24 <1,000 17
Markus Stumptner UniSA ~17 2,900 28
Yun Yang Swinburne ~20 3,800 30

[Note that Prof Bunker has no public Scholar profile, but according to Scholar, her highest-cited paper has 45 citations. Prof’s Sterling’s public Scholar profile includes as the top-cited publication (3.3k cites) a book written by someone else, subtracting this leads to the 5.8k cites I put in the table. Note that his most cited publication is actually a textbook, if you subtract this the number of cites is 3.2k.]

Without looking at the data, one notices that only three of the group are from the research-intensive Group of Eight (Go8) universities, plus one from overseas. That in itself seems a bit surprising.

Looking at the citation data, one person is clearly in the “star” category: the international member Michael Papazoglou. None of the others strike me as overly impressive, a h-index of around 30 is good but not great, similar with citations around the 3000 mark. And in two cases I can really only wonder how they could possibly been selected. Can we really not come up with a more impressive field of Australian CS researchers?

Given the importance of ERA, I’m honestly worried. Those folks have the power to do a lot of damage to Australian CS research, by not properly distinguishing between high- and low-quality research.

But maybe I’m missing something. Let me know if you spot what I’ve missed.

Peer Review: Anonymity should not be at the expense of transparency

We’ve all been through this: You do some work, you think it’s good, you send it to a top conference, and it gets rejected. Happens to the best of us, and, in fact, happens more often than many think. For example, 2013 was a very good year for me publications-wise, with 7 papers in top venues (SOSP, TOCS, EuroSys, SIGMOD, OOPSLA and RTAS). Yet, of my 19 submissions that year, 8 got rejected, so my acceptance rate wasn’t even 60%. And that was a good year. I had much worse.

It has just happened again, this time with EuroSys, by any standard a top conference. (I would say that, I was the PC chair a few years ago. ;-) )

This time I was a bit, say, surprised. The paper had four reviews: one was a “weak reject”, one was a “weak accept”, and the remaining two were clear “accept”, making it, as my colleague and co-author Kev remarked, our highest-rated reject ever.

Now this sort of stuff happens. While one would naively think that with two clear supporters and one weak critic, this should get up, the reality is different. Things may come out of the discussion at the PC meeting, the weak critic may become a stronger critic from reading other reviews, etc. I’ve seen a lot of this happen in the many PCs I served. And I’m certainly not saying that the PC meeting made a mistake. That isn’t my call, the whole point of peer review is that the judgement is made independently.

However, I do have a significant issue with the way this paper was handled by the PC, and that relates to transparency: The PC gave us, the authors, no halfway convincing justification for their decision. To the contrary, the points raised against the paper by the review could have easily been rebutted. In the interest of transparency I will not just make this claim, but back up my rant by going through the reviewer’s arguments below.

So, basically, the PC made a decision that is not understandable or reproducible from the data at hand. This is an issue in that it creates an impression of arbitrariness, that has the potential to damage the reputation of the conference. Hopefully it is a one-off case. However, the conference failed to use two simple mechanisms that are part of what I consider to be best practice in transparency.

  1. Rebuttals: Many conferences send out reviews to authors and give them a few days to respond to them before the PC meeting. This is a mandatory requirement for all SIGPLAN-sponsored conferences (including ASPLOS, which is also one of “ours”). It has been used by most if not all EuroSys PCs since I introduced it in 2011, and has been used by many other top conferences, including last year’s OSDI. EuroSys’15 didn’t do rebuttals, which, in my opinion, is a transparency regression.
  2. Summary of PC discussion: This has recently been adopted by many top conferences: Papers discussed at the PC (at least those which end up being rejected) are given a brief summary of the PC discussion, stating the reasons for rejection. This has been used by both major systems conferences on whose PCs I served during the last year (OSDI and ASPLOS) and many others. I don’t remember whether previous EuroSys PCs used it, but it’s become a common enough practice to consider it best practice.

By not doing either of these, the EuroSys PC didn’t actually do anything wrong, but clearly missed out an opportunity to be seen to be transparent. I think that’s highly regrettable for a conference I think highly of. In the particular instance of our paper, I would have been far less annoyed about the outcome if I was given a good justification.

So, let’s have a look at the information we did get. Below I reproduce uncensored (other than for the list of typos) Review A, the only negative one. And I’ll show how I can rebut all criticism from the paper (in line with normal rules for rebuttals, which aren’t allowed to introduce new material). For clarity, I’ll be more detailed and verbose than I could in an actual rebuttal, which is generally very space constrained. You can check my arguments against the submitted version of the paper. And, to show that I’m not hiding anything, here are the complete reviews for the paper.

Note that I do not claim that being given the opportunity for rebuttal would have changed the outcome. But it would have given us some assurance that the decision wasn’t made on the basis of misunderstandings (this is the precise reason for rebuttals), and a summary would have indicated to us that our comments were taken into account. Without that we are left wondering.

Ok, here comes the review.

Overall merit: 2. Weak reject
Reviewer expertise: 4. Expert

===== Paper summary =====

The paper explores a trade-off between coarse grained locking, fine grained locking, and transactional memory in the context of a microkernel. It claims that using a single global lock with microkernel on a small embedded system is advantageous in these settings, the main reason being it allows for static verification of the kernel. The differences between used synchronization techniques are evaluated on a quad-core x86 and ARM systems and seL4 microkernel. The evaluation, which uses a server workload, shows no significant performance difference between the techniques on a real-world sized benchmark.

===== Comments for author =====

Experimental results don’t quite back-up the claim that a single global lock is better than fine-grained locking. Fine-grained locking performs best on both x86 and ARM with 4 cores on the multi-core IPC micro-benchmark (Figure 7), which is the benchmark showing any difference between synchronization different.

This simply mis-represents the claims of the paper. What we claim (in the abstract) is that “coarse-grained locking [with a microkernel on closely-clustered cores] achieves scalability comparable to fine-grained locking.” Further, at the end of the Introduction, we state that “the choice of concurrency control in the kernel is clearly distinguishable for extreme synthetic benchmarks. For a real I/O (and thus system call intensive) work-load, only minor performance differences are visible, and coarse grain locking is preferred (or the equivalent elided with hardware transactional memory) over extra kernel complexity of fine-grained locking.” A similar statement is in the Discussion section. I claim that the statements in the paper are consistent with the results presented, and contradict the assertions made by the reviewer.

If fine-grained locking performs best, even if only on a micro-benchmark, already with only four cores, why not use it?

Again, we explain so in the paper: it is more complex, and as such very error-prone (as we did experience very painfully when implementing it). That concurrency is hard is well established, and if you can avoid it without degrading performance, then it’s a reasonable thing to do. And, as we also clearly state, we wouldn’t be able to verify such a highly-concurrent kernel. So I really don’t understand why the reviewer simply ignores our arguments.

I think that it is interesting to see that transactional memory outperforms fine-grained locking and is the best option overall, but it is not (yet) widely spread and the paper is making a claim about BKL anyway.

While our microbenchmarks show that transactional memory performs best, this is only visible in extreme microbenchmarks. Like microbenchmarks in general, these are an analysis tool, but don’t tell you anything about real performance, as they are not representative of real workloads. (You don’t do 1000s of back-to-back identical system calls with no actual work in between, except for analysis.) This is why we did macrobenchmarks (Figs 8–11). And they show no discernible performance differences between the configuration. This is the core data that back our argument that the details of locking don’t matter (for the particular part of the design space we’re looking at), and justify taking the simplest option.

It is not clear that using four-core systems to evaluate synchronization techniques is acceptable today, as even phones are starting to have eight cores. The findings of the study are not really valid for eight cores, as claimed in the paper, as contention will be much higher with twice as many cores.

Yes, 8-way ARM chips are starting to appear, and our data cannot be used to prove anything beyond 4 cores. However, given that on the macrobenchmark there is no discernible difference for 4 cores, it is a fairly safe assumption that if there is a difference at 8 cores, it isn’t going to be dramatic.

This claim is backed by Fig 2(b). It represents an extreme case of hammering the kernel, with “think time” (i.e. out-of-kernel time) being equal to in-kernel time. As shown in Fig 1, only about 20% of pause times in our macrobenchmark are so short, the median is about four times as big. Hence, Fig 2(b) represents an unreasonably high lock contention. Yet, the benchmark still shows very little difference between lock implementations. If we double the number of cores, contention would double (as the reviewer states, it could actually be a bit worse than double), but would still not even be half of that of Fig 2(b) which shows essentially no difference. That’s why I can claim with reasonable confidence that the 8-core case would still work.

Performance of locking primitives and IPC is well evaluated and results are very clearly presented.

Evaluation is limited as it uses a single benchmark. It is hard to draw solid conclusions based on one workload.

We cannot dispute the fact that we only have one macro-benchmark. However, I can argue that not more could be learned from having more than one (see below).

Moreover, the benchmark is a server workload (YCSB on Redis) and the paper is supposed to focus on lower-end systems.

For the point of the exercise, it is totally relevant what kind of workload it is. All the actual work is at user level, the kernel just acts as a context-switching engine. The only thing that matters is the distribution of in-kernel and think times, as this is what determines lock contention.

We took this particular benchmark as, of all the realistic workloads we could think off, it hammers the kernel most savagely, by producing a very high system-call load. In other words, it was the most pessimistic yet halfway-realistic workload we could come up with.

Server systems will have many more cores (typically 16 today) and using fine-grained synchronization is likely to be even more important for them.

With respect, that’s besides the point. Server platform or not, we specifically state that we’re aiming at cores that share a cache and have low inter-core communication latencies (see Introduction). Server platforms with high core counts tend not to match this requirement, and we argue that they would be best served with a “clustered mulitkernel” approach, where a BKL is used across a closely-coupled cluster, and a multikernel approach without shared state across clusters. All that is stated in the Introduction of the paper.

It is not clear why more than 32000 records would exceed memory limitations of the systems. If benchmark uses 1kB key-value pairs, this should only be 32MB total and ARM system has 1GB and x86 has 16GB of memory.

OK, we could have been clearer about those implementation limitations of our setup.  However, if you think about it, you’ll realise it doesn’t matter for the results. Using more records, and thus increasing the working set of the user-level code, would have resulted in an increase of cache misses, and thus overall reduced userland performance. The kernel, with its tiny working set, would be less affected. The net effect is that think times go up, and thus lock contention goes down, making the benchmark more BLK friendly.

End of review.

In summary, I think I can comprehensively rebut any of the arguments the reviewer makes against the paper. Where does that leave us?

Well, if that was all there was against the paper, then it is really hard to understand why it was rejected. There may have been other arguments, of course. But then the basic principle of transparency would have obliged the reviewers to tell us. After all, that’s the point of having written reviews that get released to authors, it’s to avoid the impression that random or biased decisions get made behind closed doors.

So, in my opinion, the PC failed the transparency test. I can only hope that we were the only such case.

Gernot

PS: There was also a shadow PC, consisting of students and junior researchers. It operated under the same conditions as the real PC, but had no influence on the decisions of the real PC. Interestingly, I found the reviews of the shadow PC of much better quality than the “real” ones, more extensive and better argued.

Security Fiasco: Why small is necessary but not sufficient for security

“Small is beautiful” is nowhere more true than it is in security. The smaller a system’s trusted computing base (TCB), the more feasible it is to achieve a reasonable degree of security. The TCB is the part of the system that must be trusted to perform correctly in order to have any security in the system. It includes at least parts of the hardware (processor, memory, some devices) and the most privileged part of the software, i.e. the operating system (OS) kernel, and, for virtualised systems, the hypervisor.

This is one of the strongest justifications for microkernels: you want to absolutely minimise the OS part of the TCB, by using a kernel (the part of the OS running in privileged mode) that is kept as small as possible. Kernel size is minimised by only including the minimum functionality, i.e. fundamental mechanisms, and leaving implementation of other functionality, including all policy, to (unprivileged) user-mode software.

We have built the seL4 microkernel on this insight, and taken it to its logical conclusion, by mathematical proving the correctness (“bug freedom”) of the implementation, as well as general security properties. These proofs are critically enabled by seL4’s small size (about 10k SLOC).

But while (barring dramatic technological advances which are many years away) security requires small size, this doesn’t mean that a kernel is secure just because it’s a microkernel!

A case in point here is the Fiasco.OC microkernel (also called the L4Re Microkernel – Fiasco.OC is the kernel and L4Re is the userland programming environment built on top). Some propagate Fiasco.OC as the core of a national security competency and advocate it as the base for the development of a national security infrastructure. Sounds like a good idea, after all, it’s a microkernel?

Turns out, it’s not. Turns out, Fiasco.OC (besides being not all that “micro”, weighing in at about 20–35k SLOC depending on configuration) isn’t secure after all, and most likely can’t be made secure without a comprehensive re-design.

Details are provided in a paper by researchers from the TU Berlin (and interestingly the lead author is a former member of the Fiasco.OC team). It shows that Fiasco.OC’s memory management functionality provides covert storage channels of significant bandwidth.

The mechanisms underlying those channels are the result of design flaws of Fiasco.OC’s memory management, which, on the surface, is driven by security needs. Specifically, Fiasco.OC provides an abstraction of per-partition memory pools. However, these are part of a global management structure that breaks isolation. In short: Fiasco.OC’s “partitioned” memory pools ain’t.

The deeper cause for this breakdown of security is that Fiasco.OC violates a core microkernel principle: that of policy-mechanism separation. The microkernel is supposed to provide policy-free mechanisms, with which policies are implemented at user level. In contrast, Fiasco.OC’s memory management encapsulates a fair amount of policy, and it is exactly those policies which the TUB researchers exploited.

This is a nice example that demonstrates that you’ll pay eventually when deviating from the core principle. Usually the cost comes in the form of restricting the generality of the system. Here it is in the form of a security breakdown. (Ok, for purists, that’s also a form of restricting generality: Fiasco.OC is restricted to use in cases where security doesn’t matter much. Not very assuring for those propagating it as a national security kernel!)

Given the size of Fiasco.OC (2–3 times that of seL4) and the lack of understanding of security issues that seems to have affected the design of its memory management, one must suspect that there are more skeletons hiding in the security closet.

This all is in stark contrast to seL4, which has been designed for isolation from the ground up, including radically eliminating all policy from memory management. This approach has enabled a mathematical proof of isolation in seL4: In a so-called non-interference proof that applies to the executable binary (not just a model of the kernel), folks in my group have shown that a properly-configured seL4 system enforces isolation. The proof completely rules out attacks of the sort the TUB researchers have levelled against Fiasco.OC!

To be fair, seL4’s security proofs only rule out storage channels, while leaving the potential for timing channels. This is, to a degree, unavoidable: while it is, in principle, possible to prevent storage channels completely (as has been done with seL4), it is considered impossible to completely prevent timing channels. At least seL4 has a (still incomplete) story there (see our recent paper on assessing and mitigating timing channels), with more work in progress, while Fiasco.OC has known unmitigated timing channels (in addition to the storage channels discussed above).

Building security solutions on secure microkernel technology is good, and I am advocating it constantly. But it will only lead to a false sense of security if the underlying microkernel is inherently insecure. seL4, in contrast, is the only OS kernel with provable security. Use it as the base and you have at least a chance to succeed!

Cyber-security: We must and can do better!

Security, especially of embedded/cyber-physical systems, including cars, aeroplanes, communication devices, and industrial control, has become a hot topic this year. For example, a  report on the state of IT security recently published by the German BSI (Federal Office for IT Security) lists, among others, a targeted attack on a German steel mill that led to massive damage of the facility (p 31 of the report).

Such attacks are only going to become more frequent, and people are looking for solutions. Increasingly, people are starting to realise that the existing cyber-security “solutions”, such as software patches and malware scanners, are just doctoring with symptoms. (Very lucrative business for the “solution” providers: they know that there are lots of bad guys out there who find new compromises all the time, forcing their customers to keep buying the latest security “solution”.) The “catch, patch, match” approach advertised by the Australian Signals Directorate (ASD) is in line with this approach, and scary in its naïveté! (The description of threats is reasonable, but the proposed “solution” is amazing. And they mean it: I attended a talk given by an ASD general at a cyber-security conference, and the message was essentially “catch, patch, match and you’ll be fine”!)

In contrast, ASD’s colleagues at the BSI take a more active role, including working with industry to provide secure core technologies (Sect 4 of the above report). They note that there is a fair bit of indigenous cyber-security capability, which needs to be coordinated and supported to provide more comprehensive security solutions for German industry.

It would be nice if there was a similar realisation in the Australian government. With the seL4 microkernel we have developed at NICTA, we have unique expertise in the world’s most secure operating-system kernel and the associated verification technology. There is a significant number of NICTA alumni out there who are familiar with the technology, including some who are running their own businesses (e.g. Cog Systems). Together with them we’d be in the perfect position to develop really strong cyber-security solutions, and build a local, export-focussed cyber-security ecosystem.

But it seems others may beat us to it. For example, DARPA has just issued a SBIR call (SIBRs are research grants for small business) aiming at developing a security ecosystem around seL4. Other governments may follow (eg Germany is a prime candidate.) Much of this development will be open-source, and thus re-usable locally. But without government support for the local ecosystem, we’ll lose the massive head start we’re enjoying at the moment.

OK Labs Story Epilogue

What did I learn from this all?

Clearly, OK Labs was, first off, a huge missed opportunity. Missed by a combination of greed, arrogance and stupidity, aided by my initial naiveté and general lack of ability to effectively counter-steer.

That arrogance and greed is a bad way of dealing with customers and partners was always clear to me, although a few initial successes made this seem less obvious for a while. In the end it came back to haunt us.

Strategically it was a mistake looking for investment without a clear idea of what market to address and what product to build. We didn’t need the investment to survive, we had a steady stream of revenue that would have allowed us to grow organically for a while until we had a clearer idea what to grow into. Instead we went looking for investment with barely half-baked ideas, a more than fussy value proposition and a business plan that was a joke. As a result, we ended up with 2nd- or 3rd-tier investors and a board that was more of a problem than a solution. But, of course, Steve, being the get-rich-fast type, wanted a quick exit.

A fast and big success isn’t generally going to happen with the type of technology we had (a very few extreme examples to the contrary notwithstanding). Our kind of technology will eventually take over, but it will take a while. It isn’t like selling something that helps you improve your business, or provide extra functionality for your existing product, something that can be trialled with low risk.

Our low-level technology requires the adopter to completely re-design the critical guts of their own products. This not only takes time, it also bears a high risk. Customers embarking on that route cannot risk failure. Accordingly, they will take a lot of time to evaluate the technology, the risks and upsides associated with it, trial it on a few less critical products, etc. All this takes time, and It is unreasonable to get significant penetration of even a single customer in less than about 5 years.

So, it’s the wrong place for a get-rich-fast play. But understanding that takes more technical insight than Steve will ever have, and executing it requires more patience then he could possibly muster. And, at the time, I didn’t have enough understanding of business realities to see this.

What this also implies is that for a company such as OK Labs, it is essential that the person in control has a good understanding of technology (ours as well as the customers’). This is not the place for an MBA CEO (and Steve is a far cry from an MBA anyway). Business sense is importance (and I don’t claim I’ve got enough of it), but in the high-tech company, the technologist must be in control. This is probably the most important general lesson.

Personally I learnt that I should have trusted my instincts better. While not trained for the business environment, the sceptical-by-default approach that is the core to success in research (and which I am actively baking into my students) serves one well beyond direct technical issues. The challenge is, of course, that people not used to it can easily mistake it as negativity – particularly people who are not trained for or incapable of analysing a situation logically.

I’ve learnt that, when all a person says about stuff you understand well is clearly all bullshit, then there’s a high probability that everything else they say is all bullshit too. I should have assumed that, but I gave the benefit of the doubt for far too long.

As mentioned earlier, I also learnt that it was a big mistake to be a part-timer at the company. This could have worked (and could potentially be quite powerful) if the CEO was highly capable, and there was a high degree of mutual trust and respect and we were aligned on the goals. But I wasn’t in that situation, unfortunately, and my part-time status allowed me to be sidelined.

Finally, I always believed that ethics are important, in research as well as in business. The OK experience confirms that to me. Screwing staff, partners and customers may produce some short-term success, but will bite you in the end.


© 2014 by Gernot Heiser. All rights reserved.Permission granted for verbatim reproduction, provided the reproduction is of the complete, unmodified text, is not made for commercial gain, and this copyright note is included in full. Fair-use abstracting permitted.

StartPrevious

OK Labs Story (9): The End Game

My October ’09 report to the Board was basically ignored. The more critical one of the investors was initially interested in following up, but in the end nothing happened. Steve obviously worked on him, but we also got, at a perfectly inopportune time, two deals which seemed to prove me wrong.

The AMSS disaster

One had started earlier, an Asian phone manufacturer wanting to build a “mass-market smartphone”, i.e. Android and modem stack on the same low-end (ARM9) processor core. I never understood why, and I was kept away from the customer or any real information from them, so I can only suspect that they were after an iPhone lookalike on the cheap (where looks would be more important than functionality).

So, we actually build a new product specifically for outdated hardware (the successor core ARM11, an ARMv6 implementation, was already in widespread use then). Technically it was a fun challenge. The project, code-named Marvin (the paranoid Android, paranoid because he’s never run on an ARM9 before) back-ported Android to ARMv5. The interesting bit was to get performance on the processor, which features a virtually-indexed, virtually-tagged cache. Windows as well as Linux deal with this cache by flushing it on every context switch, a performance killer. However, we had many years ago invented an approach to fast context switching on ARMv4/5, which made Linux faster by virtualising it; we used the same approach on the original virtualised phone, the Motorola Evoke. So we could run Android on that processor pretty much as fast as it could possibly be, and our engineering team made it work, of course.

The catch was that the customer was using Qualcomm chips, and was therefore running Qualcomm’s AMSS modem stack. So we had to port AMSS to the Microvisor. Technically this was not a big problem for our engineering team (and they did it just fine), but there was a legal issue: we did not have a source license to AMSS. The customer claimed that they had the rights to sub-license for the purpose. I voiced concerns at the board, resulting in a supposedly thorough analysis of the legal situation, which supposedly showed it was allright.

Obviously, the clean thing would have been to simply talk to Qualcomm, as they were bound to find out eventually. But for some reason Steve thought he could get away without telling them. (This is consistent with Steve’s 90-day horizon, he routinely goes for short-term gains even if this seriously damages long-term prospects, while my attitude is exactly the opposite.) It beats me why the rest of the Board followed.

Of course, Qualcomm eventually found out, and sent us a cease-and-desist letter, with which we complied. That (which happened after the presentation of my market analysis) was the end of this deal, but also of our once-great relationship with Qualcomm.

The poisonous Dash deal

There was another project which seemed to undermine my analysis, the only one that ever came through one of our investors. One of the directors had long-running high-level links with a major US network operator, let’s call them “Dash Mobile”. He got them to contract us for a prototype “mass-market smartphone”, running Android on an ARM9-based, single-core phone provided by another Asian manufacturer (not involving Qualcomm IP). Again, I never got enough information to understand why Dash was interested in this and whether their reasons added up. But, given the small size of the contract, it might be them just using some spare cash to investigate something which could be seen as having the potential for some market disruption, which Dash urgently needed (they weren’t going so great at the time).

Needless to say, the project went no-where (although that didn’t become obvious until after my departure). OK engineering delivered, as usual, and our overheads were low enough to be in the noise margin. But running a smartphone OS on the low-end processor did simply not result in a satisfying user experience (even with zero virtualisation overhead). But for a long time it seemed like a great project totally in line with Steve’s PR, contrary to what I was saying, and thus undermined my credibility. By the time I was proven right, it was too late. The timing of the Dash deal was disastrous.

In the end, the Marvin project was damaging in several ways. On the one hand we wasted a year or so, and millions of dollars, on developing a product for which there was no market, it was a complete waste. On the other hand it was a serious hit on morale in the engineering team: people had worked their asses off for many months, including regular work on week-ends, to meet the schedule. And in the end they saw that, while the work was technically good, the resulting product was useless. Several of the top people left after that.

Blood in the Board Room

Eventually I realised that I was powerless to do anything constructive in the company. Furthermore, 90% of frustration in my live originated from OK – in the mornings I dreaded looking at my email, which was full of shit having come in from Chicago overnight. At the same time, most of my fun came from research and working with bright students, and I was doing less and less of it. I decided I was too old to play such a masochist game and had to get out.

However, I thought I owed it to the many great OK engineers to at least make one more attempt to change things, unlikely as it was to succeed. So, in May 2010 I forced a bloodbath in the boardroom, and, predictably, the blood was mine, as the investors backed the CEO. Note that to this date we had a single quarter where company revenue and booking performance had matched the goals, and that was only because a large Motorola deal had come in late and spilled into a quarter that had low expectations; the sum of the two quarters was still well below expectations. But, of course, salvation was just around the corner, and we would soon be profitable. I have no idea why the VCs kept buying this…

The exit

I was given a face-saving 12-month consulting deal for leaving at the end of June ’10, and remained on the Board.

Incidentally, Benno left at the same time, voluntarily. The fundamental reason was the same: he couldn’t stand the bullshit and daily frustration any longer. (Benno and I are great friends again, I was invited to his wedding the following year, and his new startup Breakaway Consulting is a close partner of NICTA. We’re jointly developing eChronos, a verified unprotected RTOS which is already being licensed for medical implants. And he founded Cog Systems, the moral successor of OK Labs.)

The internal announcement of the departures was telling. I was asked to draft my own “to ensure it’s correct”. I kept it brief and focussed on simple, public facts, and, of course, didn’t praise myself, that would be the job of others. Steve mailed it out as is, with no attempt to mention my contributions to the company. In contrast, what Steve wrote about Benno was full of praise.

Nevertheless, were also both asked to pack up our desks after hours to avoid “unsettling” people. Steve has always under-estimated the intelligence of our engineers, and generally has no clue how they tick. People were already fairly unsettled, and the way I was farewelled clearly looked like a sacking to everyone. Given that the majority was there because of me, this obviously didn’t help morale at all.

While following Steve’s instructions to the letter, I did my best to counteract the effects on morale in engineering. I met up with them for lunch, putting up a cheerful face and taking positively about the whole development. Not sure how much this helped in the end, but it would certainly have been better than just vanishing the way Steve wanted me to.

The sale

With Benno and myself gone, the company had no-one left with a technical vision. All they could do from now on was implementing the designs Benno had made before leaving, and do whatever Chicago dreamt up. Needless to say, the company’s performance didn’t improve. The investors lost patience and wanted an exit: OK Labs was put up for sale.

The search for a buyer dragged on and on. Hardly surprising, of course, given that the “business plan” was a joke that didn’t hold up to scrutiny by anyone with technical insight. The “business plan” was built around our “enterprise strategy”, which basically meant “we’ll sell to enterprises” without any clue why they would buy from us, there was no value-add anyone could formulate in a way I could understand. Any real opportunities were outside that “strategy”.

Ironically, although maybe not surprisingly, by the time the company was finally sold, the only royalty deals we ever made were exactly in the two areas I had championed: automotive (where we may be in cars by now) and security (where there is at least one product in use by the military of a NATO country, probably more). But never a cent in royalties in consumer/enterprise mobile!

The strategy for the sale could only be to find someone interested in buying the engineering team (no longer the A team, a few individuals notwithstanding, but still very good) or the existing customer relationships.

What amazed me (despite all I had seen in the past) was the slide deck which was used to shop the company around. As part of the second investment round, OK Labs had acquired all rights to seL4 and its verification, and NICTA had delivered it sometime in 2010. Yet the sales deck didn’t mention seL4 a single time! The only obscure reference was the highly-misleading bullet point “World’s only fully verified Microvisor” (it’s seL4 that was verified, not the OKL4 “Microvisor”).

This is truly stunning! In 2011, MIT Technology Review listed seL4 in its “TR10″, what it judges the ten most important breakthrough technologies. So, OK sat on this truly unique technology, and the CEO had no clue what to do with it! Steve simply never understood seL4 and its potential. Which must make OK Labs the only company ever that owned a TR-10 technology but didn’t know what to do with it!

The way he came to his assessment that seL4 wasn’t of any use was a “market validation” exercise we had done earlier. It used the process Rob had used in the past at Wind River: You’ve got some ideas on how to evolve your product, so you go to all your customers and ask them what they thought about it. Which is a perfectly fine approach for a company with an established products it wants to improve, and an established, large customer base. In our case (where we had no relevant customers) we went to potential customers, all established players in their field and, effectively, asked them “hey, we’ve got this great technology that will completely disrupt your business, are you interested?” Can anyone guess the answer? But this is how strategy was decided at OK Labs.

The spoils

In the end, the company was sold to General Dynamics in August 2012, for a price that gave our investors more or less their money back.  None of the common shareholders saw a cent. In the dying days of the company, the investors had doubled as loan sharks, bridging the company at incredible interest rates, making double sure that there was nothing left for anyone else. (Actually, I was offered $1000 for signing away some irrelevant rights, which I declined, as being below my dignity.)

The only person who made a profit was the one who comprehensively fucked it up. Steve got about $900k in a combination of completion and retention bonuses (and on top of that may have retained the full-year separation pay he had written into his own employment contract). Isn’t the world of business wonderful?


© 2014 by Gernot Heiser. All rights reserved.

Permission granted for verbatim reproduction, provided the reproduction is of the complete, unmodified text, is not made for commercial gain, and this copyright note is included in full. Fair-use abstracting permitted.

StartPreviousNext

Follow

Get every new post delivered to your Inbox.