Skip to content

What does seL4’s license imply?

This question comes up again and again, and from talking to people, I realise that there are a lot of misconceptions about seL4 licensing and what it implies, in particular in terms of the infectiveness of the GPL. I tried to address this at my talk at the recent seL4 Summit, and will repeat and expand on what I said there.

Spoiler: The GPL license of seL4 has even less impact on you than the license of Linux would. It is not forcing any licensing conditions of anything that is of any value to you, and most likely not on anything you should be touching at all.

Licenses

The seL4 kernel is licensed under GPLv2,  version 2 of the Gnu General Public License – exactly the same license that is used for the Linux kernel. The core implication is the same as for Linux: Work derived from the kernel (be it additions, modifications or other enhancements) become automatically GPLed, sel4-licbut anything running on top of the kernel (i.e. in usermode) is unaffected and can be under any license.

We specifically license any usermode code we developed from scratch (libraries and components such as drivers and file systems) under the permissive 2-clause BSD license, to allow people to build arbitrary systems on top of seL4 and license them as they see fit. We do license some generally useful tools under GPL to maximise the benefit to the community, but no-one who uses seL4 is forced to use those tools.

A similar story holds for proofs: Our proofs about the kernel itself are under GPLv2, while proofs about user-level code are a mixture of GPLv2 and BSD licenses.

Implications

So, while in an abstract sense, things are exactly as in Linux, in practice there is a huge difference: GPL infectiveness in seL4 is only limited to code you should most likely not touch at all. This is a result of seL4 being a microkernel: the kernel only contains code that must run in the privileged mode of the hardware, everything else is left to user mode. This means that everything that does anything “useful”, such as file systems, device drivers, network protocol stack, resource management, definition of security policies, etc, is left in user mode, and can therefore be under any license. The kernel (after booting) is just a fast context-switching engine that happens to enforce isolation with the strength of mathematical proof.

sel4-linux-licThis is in stark contrast to Linux, where all of the things listed above are inside the kernel (and thus subject to the GPL). As the figure shows, with seL4 all the “interesting” code, in particular anything that could be valuable IP of the adopter of seL4, is at user level, and can be under any license. This specifically includes device drivers, which might reveal valuable hardware IP – with seL4 these can be kept closed. With Linux, you’d be forced to open-source them. Similarly, if you add a new file system or improve an existing one, provide improved network protocols, etc, you can do this without open-sourcing with seL4, but not with Linux.

In short, with seL4, any IP of real value can be licensed under conditions chosen by the owner, while in Linux, much IP would have to be open-sourced. This includes everything in the “your system services” box in the diagram, as well as the driver part of the platform-port box below. Compared to the size of the seL4 kernel (order of 10kLOC), this is huge (typically 100s of kLOC), much bigger than it might appear from the simple diagram.

In fact, you shouldn’t ever modify any of the GPLed kernel code, unless you’re doing a new platform port. The reason is simple: if you modify seL4, you invalidate verification. And the verification is the real value of seL4 (besides minor details such as seL4 being the fastest of its kind, and also the most advanced 😉) so why would you want to throw it away? With broken verification, it’s no longer seL4!

What actually must be open-sourced with seL4 is some minimal platform support. But this is simple and pretty boring boilerplate stuff:

  • You’ll need a simple timer driver (so the kernel can perform time-slice preemption). A typical example consists of less than 20 LOC (split between header and implementation body). This is if your platform has a timer device that’s different from any existing supported platform, and there’s a good chance there is one already. Typically you’d lift that code out of Linux anyway.
  • You’ll need a serial port driver (used for kernel debugging, but not in the production kernel). A typical example is less than 10 LOC! Again, it may already exist or you borrow it form Linux. No valuable IP anywhere near.
  • Then there’s an interrupt controller. That’s another 500 LOC or so spread over multiple files. Much of it is addresses, more addresses and register layouts, and some generic code that probably exists already.
  • You may also need some code for handling the System MMU (SMMU) and cache management functions, but those are becoming increasingly standardised.

As I said, none of this is IP of any value, and chances are that most of it exists already and you won’t have to do much more than defining the memory layout of your platform. A concrete example is the patch needed for supporting the i.MX8: about 20 lines of definitions in total. Scared of “giving it away”? Why would you?

[Thanks Kent McLeod for digging out the code examples.]

Why are we doing it this way?

People sometimes suggest we should just change the license to BSD. This would be a really bad idea! To demonstrate what I mean here I asked the Summit participants who had done a project involving Linux before. Unsurprisingly, almost everyone put their hand up. I then asked who had done a project with BSD Unix. Two academics raised their hand.

This says a lot about the power of licensing. Experts generally agree that BSD Unix was technically superior to Linux. Yet Linux dominates the world, and BSD Unix exists in academia and some niche applications, the only significant usage is its descendant Darwin, which forms the core of macOS and iOS (but not many know this).

Why did this happen? The reason is that, lacking strong leadership and using a permissive license, BSD forked into oblivion.

We will not allow the same to happen to seL4. We want seL4 to grow and become the one and only choice of OS for security- and safety-critical systems. The GPL is part of that strategy (and the forthcoming seL4 Foundation is the other).

Can seL4 reduce the cost of satellites?

Space satellites are expensive. Part of that is the launch cost, but that cost is dropping dramatically to tens of k$/kg. This, combined with the growing demand for micro-/nano-satellites, is increasing the sensitivity to other cost factors.

One of the big cost factors is the computer system that’s controlling the satellite. This may seem surprising, given that modern processor chips cost a few dollars. However, space is a very hostile environment, not only for people, but also for electronics. High-energy ionising radiation hitting a CMOS chip is creating clouds of charge that can flip bits in memory, caches or CPU registers, so-called single-event upsets (SEUs). Satellites typically use radiation-hardened processors that are not susceptible to SEUs. The drawback of these processors is that they are very expensive, typically 100s of k$ for a single one! And they run at low clock rates (few 100 MHz) and are based on old, power-hungry silicon technology.

What does this have to do with seL4?

What can an operating system (OS) do to address hardware cost? More than one might think at first. The core of the problem is that software relies on hardware operating correctly, i.e. according to a specification, namely the instruction-set architecture (ISA). SEUs result in hardware behaviour deviating from spec (unpredictable bit flips).

Traditional fault-tolerant systems

tma-trad

Traditional triple-modular redundancy (TMR) fault-tolerance.

The traditional way of protecting against such random faults is redundancy: the system is replicated, and the replicas check each other to detect (and ideally correct) faults. Such redundant hardware configurations have been used for decades in critical systems (e.g. airplanes or nuclear power plants). They typically use 2- or 3-way redundant hardware, called dual- or triple-modular redundancy (DMR or TMR, respectively). DMR tends to be cheaper but can only detect divergence (and then pull the plug), while TMR can use a majority vote to detect the faulty replica and correct by either turning the faulty one off (i.e. downgrading to DMR) or restarting it.

But they are expensive not only because of replicated hardware, but also as they require special voting hardware that checks for divergence among the replicas. Also, replicas typically have to operate in lock-step, which introduces high overheads.

Enter cheap multicores

The abundance of cheap multicore processors, driven by the mobile-phone market, is a game-changer here. We now have redundant (at least in terms of processor cores) hardware, that is cheap, high-performance and very energy-efficient. Can these be leveraged for making systems fault tolerant?

tma-1os

Multicore-based TMR configuration with non-replicated OS.

A number of recent proposals have explored this, typically along the lines of the diagram on the left: A shared operating system (or hypervisor) presents the multicore system as a logical single core that transparently replicates application software, and does the voting. This approach can effectively protect against faults in the application software. The problem is that the non-replicated OS is still vulnerable to SEUs – any bit-flip in OS data or instruction memory can cause the system to fail. Rather than real protection, this reduces the vulnerability to a relatively small memory region (assuming a slim OS kernel or hypervisor).

The advantage is that this is much cheaper (and more energy-efficient) than either traditional hardware replication or using rad-hardened processors.

Redundant co-execution with seL4

dma-sel4

RCoE even replicates the OS kernel, and thus all software (except the minimal shim to non-replicated devices).

We have recently pioneered a new approach that combines the benefits of the above, which we call redundant co-execution (RCoE), and have implemented this in seL4. The idea is shown in the diagram (for simplicity as a DMR configuration, but we can do an arbitrary number of replicas, including TMR): We also replicate the OS kernel, which is now aware of running in a replicated configuration. It compares (votes) the inputs and outputs of all kernel replica whenever entering or exiting, as well as configurable points within the kernel. This now extends the sphere of replication to practically the whole system, minimising vulnerabilities.

Applications get transparently replicated with RCoE (just as with the simplified model above). As seL4 is an extreme case of a minimal microkernel, where all traditional OS services, including device drivers, run as usermode programs, these get automatically replicated.

The one exception are peripherals that are (inherently) non-replicated, such as network interfaces. Such a peripheral has a single set of hardware device registers which the device driver uses for controlling I/O through the device. This represents a residual vulnerability, if bit-flips happen while accessing those registers. We keep exposure to the absolute minimum: When reading from a device register, a low-level shim immediately copies the data to replicated buffers, and then hands control to the (replicated) drivers which each operate on their own buffer, any divergence can be detected at that point. Similar, when writing to a device register, the shim compares the replicated output buffer, and then immediately copies to the device. The few instructions required for this is the only part where a bit-flip could happen undetected.

Virtual machines

While seL4 is a microkernel, it works pretty well as a hypervisor. So while RCoE approach transparently replicates applications, these do not have to be native seL4 apps, they can be virtual machines (VMs). Such a VM can run a complete Linux system as a guest, in which case the Linux system will also be transparently replicated. Obviously, in such a configuration, replica voting is very course grain, as the seL4 kernel gets rarely invoked during VM execution.

A typical configuration has the more critical code running natively on seL4 (with frequent voting), while less critical code can run inside the Linux VM.

Two variants

We implemented RCoE on Intel x86 processors as well as on Arm processors (although virtual machines we presently only support on x86). We have also designed and implemented two different versions of RCoE, a closely-coupled one, where for voting the replicas are synchronised to the exact instruction (while executing asynchronously between votes) and a loosely-coupled version, where replicas of user-mode components are unsynchronised when voting. These represent different trade-offs of performance, vulnerability and restrictions on application characteristics.

Does it work?

It works pretty well. We ran a large number of experiments where we introduced random faults in physical memory. We injected enough faults to observe 1,000 errors (which took hundreds of thousands of fault injections). Our default configuration (where voting happens on kernel entry and exit, as well as device communication) handled all but a handful gracefully, i.e. detecting an error before the system would produce erroneous outputs. We could also clearly see a performance-vs-dependabilty tradeoff at work, where making the voting more frequent would increase dependability at the cost of reduced performance. With enough effort (i.e. checking) faults can be made close to negligible.

We could also show that the TMR configuration can effectively downgrade to DMR and continue operating with the faulty replica taken off-line. The cost of this transition is only of the order of a millisecond.

At what cost?

Performance overhead of the default configuration (which eliminates almost all uncontrolled failures) is of the order of 30–100%, compared to unprotected single-core operation. In other words, a redundant system takes 1.3–2 times the time of the unprotected single-core system to execute a workload. In terms of energy, this means that the TMR configuration will consume 3×1.3–2, or 4–6 times the energy of unprotected execution.

To put this in context: The rad-hardened RAD750 processor (costing 100s of k$) delivers 240 DMIPS at 133 MHz, or at most 40 DMIPS/W. Our SabreLite board produces 2,000 DMIPS per core at 800 MHz. Using three cores (for TMR) with a 2-times performance overhead it still delivers five times the DMIPS/W of the RAD750. In other words, being able to use a cheap, low-power multicore processor comes out far in front of the rad-hardened processor despite the overhead.

And in terms of cost, it ends up >10,000 times ahead. So there really is an opportunity to reduce cost of satellites with seL4!

Find out all the details

Full details of this work are in a paper published last June in DSN, the top publication venue for dependability and fault tolerance. It contains the complete technical details and complete evaluation. If you want even more gory detail, it’s all in Yanyan Shen’s PhD thesis.

Can I try it out?

We’ll make the software available for download from the project page soon.

But keep in mind, this is a research prototype so far, and would need some more engineering for real deployment. Ideally the kernel with redundancy support should be verified, to achieve a degree of dependability approaching that of (expensive and slow) reliable hardware.

10 Years seL4: Still the Best, Still Getting Better

A week ago, on 29 July, we at Trustworthy Systems celebrated seL4 Freedom Day, the 5th anniversary of the open-sourcing of seL4. Furthermore it was seL4’s 10th birthday – the anniversary of the completion of the first functional-correctness proof (of seL4 and of any operating system), showing that the implementation was bug-free.

This anniversary prompts me to reflect on what happened with seL4 in those 10 years, what we have achieved with it, how it evolved, and what the on-going challenges and future directions are. I hope I’ll find the time to follow this up with more in-depth examinations of some of those developments.

seL4 was build from the beginning with three objectives in mind, all of equal importance:

  1. to have its implementation formally proved correct (which is what we mean when we say “verified”)
  2. to solve long-standing problems with resource management in microkernels (and OSes in general)
  3. to be suitable for real-world use.

I’ll look at how seL4 addresses all three.

The Proofs

The defining feature of seL4 is clearly its verification story, starting with the functional correctness proof. And, amazingly, it’s still fairly unique in this respect, much more so than I had assumed when we first created it. I had estimated we’d have a window of 3–5 years before someone created another verified OS kernel. Turned out I totally over-estimated the competition 😉.

While there are now a number of other OSes around with some sort of a code verification story, they are almost all toys (notwithstanding what their authors might be claiming). The only verified OS I have heard of that seems to have a defensible claim of being suitable for real-world use is the recently verified PROVENCORE system from French company Prove&Run. And even this system has a much narrower target domain: it’s designed for use in TrustZone-based trusted execution environments (TEEs), with a static architecture, and gets away with a very simple protection model. seL4, in contrast, is fully dynamic, and can support arbitrary secure system architectures, thanks to its flexible capability-based access-control model.

More proofs

The original proof showed that seL4’s C code was a correct implementation of its formal specification (mathematically speaking the code is a “refinement” of the spec). This was done for an ARM11 processor (32-bit Arm v6 architecture). There were a few missing bits and pieces, most have been completed by now.

The main remaining hole is that the kernel’s boot code is still not verified. This means that our proofs say that if the kernel boots into a safe state, it will remain in a safe state. The boot-code verification needs to show that the kernel actually gets into a safe state initially. Clearly this proof is overdue (and we are working on it as a background activity).

Beyond that, much progress has happened on the verification side:

chain

seL4 proof chain.

  • We proved (in 2011) that the kernel is able to enforce integrity, i.e. prevent a subject from modifying data without explicit authorisation. This shows not only that the spec is correctly implemented, but that it has the desired property (i.e. integrity enforcement).
  • We proved (in 2013) that the separation-kernel configuration of seL4 enforces confidentiality in the (very strong) information-flow sense, meaning it can prevent unauthorised information leakage, including through covert storage channels. This is the complement of the integrity property, showing that (direct or indirect) reads only succeed if authorised. Note that the formalism has no notion of time, so this infoflow property cannot preclude timing channels (timing-channel prevention are one of our present foci).
  • We developed (in 2013) a translation-validation toolchain which proves that the binary (produced by the compiler and linker) is a correct translation of the verified C code. This means we do not have to trust the compiler (we use gcc), nor our assumptions on C semantics, drastically reducing the trusted computing base (TCB) of our verification.

Together, these proofs show that not only is the kernel’s implementation free of bugs, and that the compiler did not introduce bugs on its own, but that the kernel enforces security in a very strong sense, and that property applies to the binary that is executing on the silicon.

We finally performed (in 2011) a complete and sound worst-case execution-time (WCET) analysis of seL4. To my surprise, this was the first such analysis performed of a protected-mode OS kernel in the literature, and is a core ingredient making seL4 suitable (and superior to any other system) for use in safety-critical real-time systems where not all code is trustworthy (so-called mixed-criticality systems, MCS). We have meanwhile improved this analysis, using the translation-validation framework to link properties proved about the source code to the binary. This improves the assurance of the whole process, and allows us to leverage our correctness proofs to bound loops and eliminate infeasible paths in the binary.

More architectures

The original proofs were done for Arm v6, and have since been ported to Arm v7. We added verification of the Arm v7a virtualisation support (“hyp mode”) in April’17. Arm v7a remains the architecture with the most complete verification story. This made seL4 a verified hypervisor.

In July’18 we completed the functional correctness proofs for x64, making this the first verified 64-bit version of seL4.

And before the end of this year we expect to complete the functional-correctness proofs for the 64-bit RISC-V architecture, as well as the translation-validation toolchain that carries those proofs through to the binary.

Still to do

There are a number of things still to be completed. I have mentioned kernel initialisation earlier. There are also a number of parts that were not verified to the same degree as the rest of the kernel, in particular MMU management. A PhD thesis on formalising the MMU has just completed, so this is mostly done.

We are also working on verifying the multicore version of seL4. This is challenging because of seL4’s uncompromising design for performance, which means that the implementation is racy. Removing races would make the problem far easier to solve, but we won’t accept the performance compromises this would imply. After all, our motto is “security is no excuse for poor performance”, a core differentiator to other high-assurance systems.

Resource Management

We developed seL4, starting in 2004, on the back of 10 years of experience with high-performance L4 microkernels, including the OKL4 kernel from Open Kernel Labs, which developed out of our L4-embedded kernel and ended up on billions of Qualcomm modem chips. Another fork of L4-embedded is what now runs on the Secure Enclave of all Apple iOS devices.

Issues of earlier L4 kernels

During that time we learnt about a number of shortcomings in the original L4 design. Several of those were already fixed in our L4-embedded kernel (a fork of the Karlsruhe Pistachio kernel), including overcoming the restrictive nature of the overloaded synchronous IPC, and complementing it with an asynchronous notification mechanism. The IPC model kept evolving with seL4, until it finally settled into a protected procedure call, complemented by Notifications which are binary semaphores.

The removal of a number of non-minimal features, including “long” IPC, and IPC timeouts, had already begun with L4-embedded and was adopted in seL4. We further eliminated a number of implementation tricks that had outlived their usefulness, including the virtual TCB array, and the process-kernel design.

By the time we started with seL4, the L4 community had reached a consensus that the model of addressing IPC to threads needed to be replaced by capability-based addressing and port-like IPC endpoints. (This was helped along by Jon Shapiro’s observation that the traditional L4 model had covert storage channels.)

Capabilities also cleanly solved another issue with original L4, that of limiting communication. The original model relied on an (inflexible) process hierarchy and redirection to a monitor process (“chief”) to limit data flow. Capabilities provide a cleaner, simpler and low-overhead model: Having a privilege does not in itself imply the ability to share that privilege, an additional grant right is needed to pass on capabilities.

Our 2016 paper discusses these issues in detail.

Untypeds

An issue that remained unresolved pre-seL4 was principled management of kernel memory. Prior L4 kernels had a kernel heap for allocating kernel data structures, which lead to poor isolation and the potential for denial-of-service (DoS) attacks; some kernels introduced kernel memory pools and quota as an unconvincing fix. Kevin, who had experimented with various models for years, developed the present seL4 model: The kernel has no heap whatsoever, and a strictly bounded stack; it never allocates memory after boot. Instead, user-level code is required to explicitly provide memory to the kernel for any operation that requires meta-data allocation. The mechanism for doing this is seL4’s Untyped memory, and retyping it onto kernel-object types.

This model of retyping Untyped memory is extremely powerful. It is a very simple abstraction for managing (spatial) resources: A security domain can only create kernel objects out of whatever amount of Untypeds it is given. In particular, this makes it impossible for an agent to interfere with a domain’s ability to create kernel objects, thus avoiding DoS attacks by construction.

isolation

seL4’s memory management model extends user partitions into the kernel

Furthermore, the need for usermode to provide working memory to the kernel means that partitioning userland will implicitly lead to partitioning of kernel memory, a very powerful property. This has been a major enabler of our isolation (integrity and confidentiality) proofs.

The downside is that the model is tricky to use and makes it easy to shoot yourself in the foot (as many students will attest). However, it is not the job of the microkernel to protect your foot from yourself, it is to protect you from malicious code by providing strong isolation. This is exactly what seL4 does extremely well.

Time

The biggest remaining hole in the resource management story of the original seL4 had to do with managing time (as stated in the 2016 paper). L4 traditionally had a weak (if any) temporal isolation story, and at the time of the design of seL4, this was left in the too-hard basket, and we used a fairly naive approach to scheduling. This was always meant to be addressed later.

Temporal integrity

We fixed some minor issues when performing the WCET analysis of the kernel, including replacing the (in average fast but worst-case extremely slow) “lazy scheduling” by Benno scheduling (which is as fast in average without a pathological worst case).

But the main problem remained: no accurate accounting for time, especially with shared servers. This meant that seL4’s support for hard real-time (RT) systems was limited: only in a narrow set of scenarios could it provide the required integrity guarantee, namely that a critical thread will be able to meet its deadline.

passive

“Passive’ servers execute on a client’s donated scheduling context.

We finally addressed this cleanly by evolving the model of scheduling contexts (developed early by the Dresden group for a pre-capability kernel) and integrating it with seL4’s capability model. This finally made time just another resource authorised by capabilities, and thus a first-class resource managed in a principled fashion. Scheduling concepts are now first-class kernel objects; they represent the right to access a share of the CPU. 

This model is implemented in the MCS kernel, called so because it provides the right mechanism for supporting MCS, including the ability to guarantee time to critical threads in the presence of untrusted high-priority threads. It is the first capability-based model of managing time in a way that is suitable for hard real-time systems (together with the concurrent work on the Composite OS; KeyKOS “meters” were also time capabilities but not suitable for hard RT).

The MCS kernel is, for now, in a branch. The reason is our commitment that changes to the mainline kernel must not break verification. The MCS model requires very invasive changes to the kernel, which means re-verifying is a lot of work. However, this work is nearing completion, and we expect MCS to be merged into mainline by the end of this year or early next.

As it is the future of seL4, and has a number of other improvements that simplify many use cases, we strongly recommend new developments to be based on MCS.

Timing channels

The MCS kernel solves temporal integrity, but confidentiality remains unresolved: the kernel cannot prevent information leakage through timing channels.

Of course, in this respect seL4 is no worse than other general-purpose OSes, but in seL4’s case it’s the last remaining security problem, while others have bigger issues to deal with. In fact, given our experience with trying to stop leakage on commercial hardware, I strongly suspect that all “high-assurance” separation kernels (which are based on strict space and time partitioning, SATP) leak as well.

colour

Time protection partitions a system, including the kernel.

Our recent work on time protection indicates that we can solve this problem in seL4, by temporally or spatially partitioning all shared hardware resources, we partition even the kernel. This assumes that the hardware manufacturers get their act together and provide mechanisms for resetting all time-shared resources (and I’m working on the RISC-V Foundation to ensure that RISC-V does).

However, so far our time-protection is a set of primitive mechanisms. They require a proper integration with the seL4 model, which we are working on right now. This will then create a new verification challenge, which we think is solvable.

Design for Real-World Use

Suitability for real-world use really mans two things: generality and performance.

Generality

For a microkernel, this means keeping it as much as possible policy-free, and instead providing primitive mechanisms that are general enough to support the construction of virtually arbitrary systems on top.

Core to this is seL4’s capability-based access-control model (strongly influenced by KeyKOS and EROS), with its support for efficient delegation. And it required solving the resource-management problems that had plagued L4 (and other) microkernels in the past. Details of this model have evolved (as discussed above), the principles and “look-and-feel” have remained.

Whether we hit the right design point is still undecided. For years we were busy supporting various security- and safety-critical deployments (and evolving the system to improve this support) that we have not focussed very much on truly general systems, including flexible multi-server architectures. This is on-going work, which recently has accelerated.

Performance

When we started the seL4 project, my message to the team was unambiguous: “I will not consider this project a success if we lose more than 10% in IPC performance against the fastest kernel we’ve built to date.

At the time of the first publication on seL4 (SOSP, Nov’09), the verified kernel was right at that 10% limit (the unverified version was faster and almost on par with our previous ARM kernel). Further work on verification-friendly micro-optimisation of the IPC fastpath resulted in the verified seL4 outperforming all our previous kernels. And that means verified seL4 outperforms any microkernel. In almost all cases that’s by about a factor of 10 in IPC latency. The closest in performance is the Fiasco.OC (L4Re) microkernel, which is only about a factor two slower than seL4. As I said earlier: Security is no excuse for poor performance!

Our uncompromising performance focus means that we cannot take shortcuts others regularly take, and is, for example, the reason our (high-performance) multicore version is not yet verified. But it’ll happen!

Real-world deployments

That our design-for-real-use works is shown by the fact that seL4 uptake is accelerating. One might think that ten years are a long time for this, but keep in mind that for the first half of this period, seL4 was locked up in IP arrangements that make it essential useless for anything but an internal research and teaching vehicle. Furthermore, deploying seL4 for enforcing security or safety means that (unless you are developing a system from scratch) you have re-architect your system thoroughly. This is not something people enter into lightly, and it also takes time.

ULB

Boeing ULB flying on seL4.

The DARPA HACMS program was a great success in this regard. Not only did HACMS show successful use of seL4 in real-world systems: Boeing’s Unmanned Little Bird (ULB) autonomous helicopter and US Army autonomous trucks.

cyber-retrofit

Cyber-retrofit of ULB.

More importantly, it demonstrated the feasibility of using seL4 to retrofit security into an existing real-world system,
a process dubbed incremental cyber-retrofit by then DARPA I2O Director John Launchbury. This is an approach we (and others) are presently taking to harden a number of real-world systems.

It is in the nature of open source that we only know about a fraction of the projects building on seL4, and it is not unusual that I find out about one of them by accident. And some projects we are involved in are commercial-in-confidence. But there are some I can talk about.

  • A USB-based secure communication device, which uses military-strength encryption to communicate via Bluetooth or Wifi, has been evaluated and approved for defence use. To my knowledge this is the first time that software (i.e. seL4) enforced isolation was considered equivalent to air gapping by a certification authority. It is already in use in several defence forces.

    CDDC

    CDDC displaying windows of different classifications.

  • The Cross-Domain Desktop Compositor (CDDC) is a device that allows securely connecting a single keyboard, mouse and display to multiple networks of different security classifications or domains. Technically it is a drop-in replacement for a KVM switch, but more powerful, as it supports the concurrent rendering of windows from multiple domains on the same screen. Hardware ensures that each window is decorated according to its classification, and there is a reserved region on the screen that indicates where inputs go to.  If so configured, the system allows the operator to cut-and-paste between windows. Previous designs (not using seL4) left certification authorities unconvinced of their security. The CDDC is about to enter a productisation phase.
  • HENSOLDT Cyber is developing a secure hardware-software system, based on their own secure RISC-V chip, and an seL4-based secure operating system. [Disclosure: I’m chief scientist (software) of HENSOLDT Cyber.] It targets critical systems in defence, avionics, industrial automation and automotive.
  • A US-based autonomous driving startup is using seL4 to protect the safety of its systems.
  • There are various seL4-based TrustZone TEEs in development. And the Keystone TEE for RISC-V is based on seL4.
  • Third parties are running seL4 training courses.
  • A US DoD-funded organisation has run an seL4 Summit last year and is running another one this year.

This is a selection, there is much more going on, and deployments are definitely accelerating.

The Future

Above I listed number of current engineering projects, including full, verified RISC-V support, and verifying and mainlining the MCS model. On top of that we’re looking for funding to verify the 64-bit Arm v8 port. The differences in the privileged ISA between Arm v7 and v8 are big enough to make this essentially an architecture port, comparable to RISC-V. But the RISC-V verification project resulted in improved abstractions in the proof architecture which should reduce cost.

Then there is the verification of the multicore kernel, which I said is on-going, although slowly. I’m hopeful we’ll be able to accelerate this significantly with external funding. This has significant research challenges to address, besides the proof engineering.

And the there is time protection, where we have developed the core mechanism, but not yet a proper security model integrated into the seL4 API, this is on-going research, as is the verification of the mechanisms.

In this context I’d like to come up with a better (verifiable) confidentiality notion than the sledge-hammer information flow, which is super-restrictive and only applicable to a narrow set of realistic scenarios. We want something that is much more practicable.

Besides that we are working on strongly growing the seL4 community, especially fostering the seL4 open-source community. We have taken a number of steps there, including an RFC process. Much more is in the pipeline, and I hope we’ll be able to make an announcement in the near future.

Stay tuned!

How to (and how not to) use seL4 IPC

IPC is a message passing mechanism, right? After all, it’s an acronym for inter-process communication?

That’s historically true, but seL4 IPC has come a long way, and its IPC primitive is quite a different beast from message-passing primitives offered by other OSes, even quite different from the original L4 IPC.

A little bit of history

Jochen Liedtke’s L4 microkernel, developed in the mid-90s, was the first that showed that IPC could be fast, an incredible factor of 10–20 over all contemporaries. And its IPC was a highly overloaded mechanism, providing:

  • by-value data transfer, including large and unaligned buffers
  • page mappings in power-of-two multiples of the base page size
  • synchronisation.

The last was a by-product of the performance-driven rendezvous design that implicitly synchronised sender and receiver, but was the only way to synchronise threads that did not share memory. L4 IPC also had timeouts to prevent denial of service.

Over the years we removed much of the L4 IPC functionality, mostly to simplify the kernel, but also the programming model. For example, L4 forced a multi-threaded design on many applications even on a single core. This is bad policy that forces concurrency control onto code that is not meant to be concurrent. In seL4 we simplified things to the bare minimum. As a consequence, the nature of IPC changed dramatically from the original L4.

What seL4 IPC is not

When dealing with seL4, you probably need to forget just about everything you may have learned about IPC from other systems. In particular, you should forget that it is an acronym. Think of it as a term like IBM, ARM or ACM, these names have far outlived the original meanings. The same is true for IPC in seL4. In particular:

  • seL4 IPC is not a mechanism for shipping data
  • seL4 IPC is not a mechanism for synchronising activities.

So, what is it then?

A great, time-honoured definition (that actually predates seL4) was given by my then student Chuck Gray:

IPC is a user-controlled context switch with benefits.

You switch to a different thread (usually in a different address space), without involving the scheduler. The benefit is that you get to carry along a bit of data. That definition captures the bare-bonedness of seL4 IPC, which is in line with the general seL4 philosophy that the microkernel is a minimal software veneer for securely multiplexing hardware.

A more utilitarian definition is this:

IPC is the seL4 mechanism for implementing cross-domain function calls.

You should really think of IPC in those terms, and only those. Something like an RPC mechanism, except you don’t go across networks, only protection-domain boundaries.

One of the implications of this definition is that there are really two main APIs for IPC:

  1. result = call (function, arguments)
  2. client = reply&wait (client, result, &arguments)

where function is the capability of the (server) function you invoke. Present mainline has the clients represented implicitly, but this being fixed with the introduction of explicit reply capabilities in the new mixed-criticality (MCS) kernel that is making its way through verification, and will eventually replace mainline. The MCS kernel makes invocation more like a function call, by providing passive servers, effectively resulting in a migrating-threads model. More on that below.

Any other IPC APIs are only for initiating communication protocols or exception handling.

IPC no-no’s

With the above definition in mind, it should be clear that you should never use it for a number of things, especially:

  • sending bulk data
    You don’t pass by-value arrays to a function. Don’t do it with IPC either. The message arguments are function arguments, i.e. small data items that form the function inputs
  • using send-only or receive-only IPC
    … except for protocol initialisation or exception handling
  • using IPC for synchronisation
    that’s what Notification objects are for!
  • using cross-core IPC
    that’s forcing synchronisation of activities that should be separate, and forces extra scheduler invocations. Don’t do it!

So, why are those things supported at all?

Good question, and the answer is historical in some cases. And I certainly think we should reduce the size of the IPC buffer to no more than 64 bytes. You should never need more, and reducing this size will reduce the kernel’s worst-case execution time (WCET), which is important for real-time systems.

Obviously, send-only and receive-only IPC is still needed for the listed exceptions.

Cross-core IPC might actually still have valid use cases in the mainline kernel (that kernel is legacy for me, so I won’t bother thinking this through). The new MCS kernel introduces scheduling contexts that are passed along with IPC, and passive servers, which can only run at the expense of their client. This kernel definitely does away with the need for cross-core IPC, and we should seriously consider removing it.

Some interesting consequences

When IPC is used correctly, endpoint queues are shallow.

In particular, a shared server (i.e. a process providing functionality to multiple clients via an RPC interface) should always run at a priority at least as high as that of any of its clients. This prevents the server from being preempted by its own clients. (Not following this rule would result in high-priority clients impeding their own progress, including possible deadlocks.) If this rule is followed, then intra-core IPC can never lead to more than one thread blocked on a particular endpoint.

In other words, the maximum queue depth for an endpoint in a properly-designed single-core system should be one. This means the simplest implementation of the queue, i.e. a linked list, should be just fine.

On multicore, following the no-cross-core-IPC rule, it should be the same, as inter-core servers should be invoked by a protocol using shared buffers synchronised with semaphores (Notification objects).

However, with the MCS kernel and passive servers, the story is a bit different. A passive server always executes on its client’s core (so the local-only rule still applies). However, it can be invoked from any core – in fact, that’s the elegant way to implement mutual exclusion with priority ceiling, a fast and deadlock-free way to protect critical sections. This means that a passive server’s request endpoint can queue up to n-1 clients on an n-core system.

seL4 is not meant to run as a single kernel image on large number of cores, it is meant to scale as far as cache-line migration costs are small (say 10–20 cycles). For larger machines, clustering should be used. This means the linked-list implementation of endpoint queues should serve the multicore case as well.

Summary

Use IPC correctly: for RPC-like invocation of functions in a different address space. Period.

Microkernels Really Do Improve Security

Many of us operating systems researchers, going back at least to the US DoD’s famous 1983 Orange Book, have been saying that a secure/safe system needs to keep its trusted computing base (TCB) as small as possible. For system of significant complexity, this argues for a design where components are protected in their own address spaces by an underlying operating system (OS). This OS is inherently part of the TCB, and such should itself be kept as small as possible, i.e. based on a microkernel, which comprises only some 10,000 lines of code. Which is why I have spent about a quarter century on improving microkernels, culminating in seL4, and getting them into real-world use.

arch

Monolithic vs microkernel-based OS structure.

It is intuitive (although not universally accepted) that a microkernel-based system has security and safety advantages over a large, monolithic OS, such as Linux, Windows or macOS, with their million-lines-of-code kernels. Surprisingly, we lacked quantitative evidence backing this intuition, beyond extrapolating defect density statistics to the huge TCBs of monolithic OSes.

Finally, the evidence is here, and it is compelling

Together with two undergraduate students, I performed an analysis of Linux vulnerabilities listed as critical in the Common Vulnerabilities and Exposures (CVE) database. A vulnerability is tagged critical if it I easy to exploit and leads to full system compromise, including full access to sensitive data and full control over the system.

For each of those documented vulnerabilities we analysed how it would be affected if the attack were performed against a feature-compatible, componentised OS, based on the seL4 microkernel, that minimised the TCB. In other words, an application running on this OS should only be dependent on a minimum of services required to do its job, and no others.

We assume that the application requires network services (and thus depend on the network stack and a NIC driver), persistent storage (file system and a storage device driver) and console I/O. Any OS services not needed by the app should not be able to impact its confidentiality (C), integrity (I) and availability (A). For example, an attack on a USB device should not impact our app. Such a minimal TCB architecture is exactly what microkernels are designed to support.

The facts

The complete results are in a peer-reviewed paper, I’m summarising them here. We find that of all of the 115 critical Linux CVEs, we could analyse 112, the remaining 3 did not have enough information to understand how they could be mitigated. Of those 112:

  • 33 (29%) were eliminated simply by implementing the OS as a componentised system on top of a microkernel! These are attacks against functionality Linux implements in the kernel, while a microkernel-based OS implements them as separate processes. Examples are file systems and device drivers. If a Linux USB device driver is compromised, the attacker gains control over the whole system, because the driver runs with kernel privileges. In a well-designed microkernel-based system, only the driver process is compromised, but since our application does not require USB, it remains completely unaffected.
  • A further 12 exploits (11%) are eliminated if the underlying microkernel is formally verified (proved correct), as we did with seL4. These are exploits against functionality, such as page-table management, that must be in the kernel, even if it’s a microkernel. A microkernel could be affected by such an attack, but in a verified kernel, such as seL4, the flaws which these attacks target are ruled out by the mathematical proofs.
    Taken together, we see that 45 (40%) of the exploits would be completely eliminated by an OS designed based on seL4!
  • Another 19 (17%) of exploits are strongly mitigated, i.e. reduced to a relatively harmless level, only affecting the availability of the system, i.e. the ability of the application to make progress. These are attacks where a required component, such as NIC driver or network stack, is compromised, and as a result compromising the whole Linux system, while on the microkernel it might lead to the network service crashing (and becoming unavailable) without being able to compromise any data. So, in total, 57% of attacks are either completely eliminated or reduced to low severity!
  • 43 exploits (38%) are weakly mitigated with a microkernel design, still posing a serious security threat but no longer qualifying as “critical”. Most of these were attacks that manage to control the GPU, implying the ability to manipulate the frame buffer. This could be used to trick the human user into entering sensitive information that could be captured by the attacker.

Only 5 compromises (4%) were not affected by OS structure. These are attacks that can compromise the system even before the OS assumes control, eg. by compromising the boot loader or re-flashing the firmware, even seL4 cannot defend against attacks that happen before it is running.

cve-piechart

Effect of (verified) microkernel-based design on critical Linux exploits.

What can we learn from this?

So you might ask, if verification prevents compromise (as in seL4), why don’t we verify all operating systems, in particular Linux? The answer is that this is not feasible. The original verification of seL4 cost about 12 person years, with many further person years invested since. While this tiny compared to the tens of billions of dollars worth of developer time that were invested in Linux, it is about a factor 2–3 more than it would cost to develop a similar system using traditional quality assurance (as done for Linux). Furthermore, verification effort grows quadratically with the size of the code base. Verifying Linux is completely out of the question.

The conclusion seems inevitable: The monolithic OS design model, used by Linux, Windows, macOS, is fundamentally and irreparably broken from the security standpoint. Security is only achievable with a (verified) microkernel design. Furthermore, using systems like Linux in security- or safety-critical applications is at best grossly negligent and must be considered professional malpractice. It must stop.

Insecure by design – lessons from the Meltdown and Spectre debacle

The disclosure of the Meltdown and Spectre computer vulnerabilities on January 2, 2018 was in many ways unprecedented. It shocked – and scared – even the experts.

The vulnerabilities bypass traditional security measures in the computer and affect billions of devices, from mobile phones to massive cloud servers.

We have, unfortunately, grown used to attacks on computer systems that exploit the inevitable flaws resulting from vast conceptual complexity. Our computer systems are the most complex artefacts humans have ever built, and the growth of complexity has far outstripped our ability to manage it.

A new kind of vulnerability

Meltdown and Spectre are qualitatively different from previous computer vulnerabilities. Not only are they effective across a wide class of computer hardware and operating systems from competing vendors. And not only were the vulnerabilities hiding in plain sight for more than a decade. The really shocking realisation is that Meltdown and Spectre do not exploit flaws in the computer hardware or software.

As Intel stated in its press release, these attacks:

…gather sensitive data from computing devices that are operating as designed.

The ingenuity of the attacks lies in combining seemingly unrelated design features that were thought to be well understood – stuff we teach undergraduate computer science students. The vulnerability is not in any of the individual features, but in the complex interaction between them.

It turns out that computer systems are insecure not because of mistakes made in the implementation, but because of ill-conceived design.

As a community of computer systems experts, we have to ask ourselves how such a debacle is possible, and how a recurrence can be prevented.

We have known for a while that the established “wait for something to happen and then try to fix it” approach – better known as “patch and pray” – does not work even for more common implementation flaws, as witnessed by the proliferation of exploits. It works even less well for such insecure-by-design situations.

Automated evaluation of designs

The fundamental problem is that humans are unable to fully understand the conceptual complexity of modern computer systems and how its seemingly unrelated features might interact. There is no hope that this will change.

But solving complex problems is what machines are increasingly good at. So, the only real solution can be the automated evaluation of designs, with the aim of mathematically proving that under all circumstances a design will behave in a way that is considered secure – in particular by not leaking secret data.

In other words, a design must be considered insecure unless there is a rigorous mathematical proof to the contrary.

This is not an easy ask by any definition, and much more work across many areas of computer science and engineering is needed to make it a reality. But we need to start somewhere, and we need to start now.

We will reap benefits of embarking on such a program long before we achieve the goal of rigorous end-to-end proof. Significant improvements will be achieved through partial results, both in the form of proving weaker properties, and by establishing desired properties in a less rigorous fashion.

For example, an incomplete evaluation may be more feasible than a complete one, and produce a probabilistic result, such as a greatly reduced likelihood of exploits.

Rewriting the hardware-software contract

A necessary, and overdue, first step is a new and improved hardware-software contract.

Computer systems are a combination of hardware and software. The people and companies that develop hardware are largely separate from those developing the software. Given the vastly different skills and experience required, this is inevitable.

To make development practical, both sides work to an interface, called the instruction-set architecture (ISA), which presents the contract between hardware and software functionality.

The problem, clearly exhibited by the Meltdown and Spectre attacks, is that the ISA is under-specified for security, or safety for that matter. It simply does not provide ways to isolate the speed of progress of a computation from other system activities.

The ISA is a functional specification, meaning it defines how the visible state of the machine will (eventually) change if an operation is triggered. It intentionally abstracts away anything to do with time. In particular, it hides how long operations take and how this time depends on the internal state of the machine. The problem is that this internal state depends on potentially confidential data processed by previous operations.

This means that by observing the exact timing of particular sequences of operations, it is possible to infer data that is supposed to be kept secret. This is exactly what happened with Meltdown and Spectre.

The abstraction is there for a good reason: It allows hardware designers to change things “under the hood”, usually in order to improve performance. Consequently, there will be resistance from hardware manufacturers to a tighter contract. But we believe that the refined specifications can be kept abstract enough to retain manufacturers’ ability to innovate, and to avoid exposing confidential IP.

The ConversationThe recent debacle has shown that the ISA is too abstract, making it impossible to tell whether a system is secure or if it will leak secrets. This must change, urgently.

Co-authored with Yuval Yarom, Researcher, CSIRO’s Data61; Senior Lecturer, University of Adelaide

This article was originally published on The Conversation. Read the original article.

Verified software can (and will) be cheaper than buggy stuff

Software is expensive. Verified software is more expensive (although not as much as people think). But we’re working on closing that gap, with the aim of making verified (highest-assurance) software no more expensive than traditionally engineered (no-assurance) code. Our Cogent approach is a major step in that direction.

The cost of verified software

Our paper describing the complete seL4 verification story analysed the cost of designing, implementing and proving the implementation correctness of seL4. We found that the total cost (not counting one-off investments into tools and proof libraries) came to less than $400 per line of code (LOC).

This may look expensive at first, but is less so if taken in context. The Pistachio microkernel, developed a few years earlier in very similar circumstances (university environment with people who knew what they were doing) was only a factor of 2–3 less expensive. Pistachio makes no assurance claims whatsoever (but is generally well-engineered). Take into account that Pistachio experimented far less with kernel design, and its cost does not include the full life-cycle cost (especially later bug-hunting, which doesn’t happen for seL4), you’ll see that we aren’t all that far away.

Another data point is a number quoted by Green Hills some ten years ago: They estimated the full (design, implementation, validation, evaluation and certification) cost of their high-assurance microkernel to be in the order of $1k/LOC.

In other words, we are already much cheaper than traditional high-assurance software, and a factor of 2-3 away from low-assurance software. If we could close the latter gap, then there would be no more excuse for not verifying software.

How to reduce verification cost

The main reason for the high cost of verification is that it’s very labour-intensive. Almost all of the (by now) quarter million lines of seL4 proof code are hand-written, by real humans. And as those numbers show, there is a lot of proof code to write. Just for the seL4 correctness proof (we now have many additional proofs of high-level properties), there were over 20 lines of proof per line of C.

This means there are two obvious avenues to reducing cost:

  1. reducing the ratio of proof lines to code lines, and
  2. generating proofs automatically.

Reducing the amount of proof is possible when the code is written in a more verification-friendly way. Not much can be done there when writing in C, the implementation language of seL4 (for good reason). However if we use a high-level language that is strongly typed and memory safe, many properties we had to prove for seL4 would be automatically enforced by the language. Furthermore, functional languages have many advantages, such as freedom from side effects, and verifying code written in such a language is easier (requires fewer proofs) than imperative-language code.

A glimpse of this we got in the original seL4 verification. We did this in two steps, first refining from the formal kernel spec to an “executable spec” which was essentially Haskell code, and then a second refinement from there to C code. Had we stopped at Haskell, we would have saved about 1/3 of the total proof effort. But the Haskell code was actually intentionally written in an imperative style, and more low-level than you would normally do. This is because verifiers preferred to work on the first refinement, and tried to get as much as possible out of it by pushing the Haskell level as low as possible. Refining to more “normal” Haskell code would have made the first refinement step much simpler/cheaper (at the cost of disproportionately more effort going into the second refinement).

Of course, had we stopped at Haskell we would not have a complete verification story. We would have to trust the Haskell compiler, which is an order of magnitude bigger than seL4, and a big Haskell run time. We therefore did the real implementation in C, and proved its correctness (and later also that it was correctly translated into machine code).

However, the translation from Haskell to C, while it was done by a human, was for the most part fairly mechanical (the main exception being the performance-critical fast paths). In fact, it should be possible to automate this translation. But if you automate the translation (i.e. build a compiler) then it should be possible to prove the correctness of that translation, as it had been done by the CompCert certifying C compiler.

Cogent: Co-generation of code and proof

Our Cogent framework combines the above two observations in order to reduce verification cost:

  • we use a high-level, strongly typed, memory-safe functional
    implementation language — Cogent;
  • we automatically translate Cogent into C, and with the C also
    generate a formal spec of the code, plus a proof that the generated
    C is correct against this spec.

The generated spec is in the logic of the theorem prover, Isabelle, and is at the same abstraction level as the Cogent code (visually equivalent).

Importantly, Cogent compiles into straight C, without the need for a run-time system, so verifying that code is all that’s needed. Cogent code does require some abstract data types (ADTs) implemented in C and to be verified manually, but these are explicitly called by Cogent code, so there’s no library code that gets pulled in under the hood. And there’s certainly no such ugly stuff as garbage collectors.

The combination of a nice, clean functional spec and ability to compile into straight and efficient C without garbage collection is possible through the use of linear types, which restrict the functional representation enough to enable the right translation. The need for a separate ADT library is the cost we pay for this, as the resulting language is not Turing complete.

With this, all we have to do is implement our system in Cogent and then verify that the Cogent code (as represented by the Isabelle spec) is correct against the top-level formal spec. In other words, had we written seL4 in Cogent, we would be done after the first refinement (plus verifying the ADTs). Furthermore, Cogent code is more high-level than our Haskell implementation. We would have easily saved between a third and half of the overall verification effort.

It works!

We didn’t try to use Cogent to (re)implement seL4. After all, seL4 is done, and who wants to redo something that’s already been done? (Note that the linear type system would have forced much of the kernel into external libraries, thus defeating the point of using Cogent.) Instead we picked other important systems code as a case study: file systems.

We implemented two file systems in Cogent. One is a re-implementation of the Linux ext2fs. It’s a complete implementation of the original ext2fs and passes the test suite, but lacks some more recent features (symlinks and ACLs).

The second file system is BilbyFS, a custom flash file system, which in complexity and efficiency sits somewhere in between the Linux standard flash file systems JFFS2 and UBIFS.

Our paper recently published at ASPLOS describes the experience. We found that we’re not yet quite there in terms of performance: while our file systems achieved essentially the same throughput as the hand-written C implementations, CPU load indicates that there’s presently about a factor-two overhead. According to our motto that security is no excuse for bad performance, we’ll have more work to do. One issue was that we overestimated the ability of the C compiler to optimise struct arguments, we get a lot of redundant data copies. But this is fixable by putting more smarts into the Cogent compiler.

But our experience so far shows that we’re on the right track: We can implement real-world systems code in Cogent, and integrate with real systems. For evaluation we loaded our file systems into Linux, but we will use them natively in seL4-based systems as well.

What’s the cost of verification now?

An evaluation of the verification cost is extremely encouraging. We did not completely verify our file systems, but we completely verified the implementation of two external APIs for BilbyFS: sync() and iget(). And the result is stunning: The implementation of all functionality required for sync() is 300 lines of Cogent, iget() is 200 lines. The verification of that functionality required 5,700 lines of proof for sync() and 1,800 lines for iget(). So, the first take-away is that the ratio of proof to code lines seems to be reduced, which is what we hope to see and guarantees some cost reduction.

We see a similar picture when looking at proof effort. In seL4, the cost was 1.4 person years (py) per kLOC, in BiblyFS this is down to 0.6py/kLOC, about a 60% reduction. This is due to a combination of two factors: the functional abstraction provided by Cogent, but also the fact that BilbyFS is much more modular than possible for a high-performance microkernel.

We cannot yet read too much into those numbers, as they are for small pieces of code, and experience shows that effort grows roughly with the square of the size. But the trend is good. And it is roughly in line with what we expect.

But the important part is that we totally eliminate the manual effort in verifying down to C. That’s roughly the equivalent of seL4’s second refinement step, except that we can make this automated step bigger with Cogent (and correspondingly reduce the manual step). As argued above, this alone should result in a 1/3 to 1/2 reduction of effort. And this is what we have already.

Summary

More work remains to be done, especially with respect to eliminating the performance gap. But we have taken a big step with respect to reducing the factor 2–3 cost gap of formal verification. We estimated that we’ve at least halved it.

The grand goal remains the elimination of this gap. I’m confident we’ll get there in a few years. And imagine the implications: verified, highest-assurance software at the cost of traditional, no-assurance software. We really may be able to make software bugs a thing of the past!

Cogent is open source.

Thanks to Toby Murray for feedback on this blog.