Can seL4 reduce the cost of satellites?

2019/11/19

Space satellites are expensive. Part of that is the launch cost, but that cost is dropping dramatically to tens of k$/kg. This, combined with the growing demand for micro-/nano-satellites, is increasing the sensitivity to other cost factors.

One of the big cost factors is the computer system that’s controlling the satellite. This may seem surprising, given that modern processor chips cost a few dollars. However, space is a very hostile environment, not only for people, but also for electronics. High-energy ionising radiation hitting a CMOS chip is creating clouds of charge that can flip bits in memory, caches or CPU registers, so-called single-event upsets (SEUs). Satellites typically use radiation-hardened processors that are not susceptible to SEUs. The drawback of these processors is that they are very expensive, typically 100s of k$ for a single one! And they run at low clock rates (few 100 MHz) and are based on old, power-hungry silicon technology.

What does this have to do with seL4?

What can an operating system (OS) do to address hardware cost? More than one might think at first. The core of the problem is that software relies on hardware operating correctly, i.e. according to a specification, namely the instruction-set architecture (ISA). SEUs result in hardware behaviour deviating from spec (unpredictable bit flips).

Traditional fault-tolerant systems

Traditional triple-modular redundancy (TMR) fault-tolerance.

The traditional way of protecting against such random faults is redundancy: the system is replicated, and the replicas check each other to detect (and ideally correct) faults. Such redundant hardware configurations have been used for decades in critical systems (e.g. airplanes or nuclear power plants). They typically use 2- or 3-way redundant hardware, called dual- or triple-modular redundancy (DMR or TMR, respectively). DMR tends to be cheaper but can only detect divergence (and then pull the plug), while TMR can use a majority vote to detect the faulty replica and correct by either turning the faulty one off (i.e. downgrading to DMR) or restarting it.

But they are expensive not only because of replicated hardware, but also as they require special voting hardware that checks for divergence among the replicas. Also, replicas typically have to operate in lock-step, which introduces high overheads.

Enter cheap multicores

The abundance of cheap multicore processors, driven by the mobile-phone market, is a game-changer here. We now have redundant (at least in terms of processor cores) hardware, that is cheap, high-performance and very energy-efficient. Can these be leveraged for making systems fault tolerant?

Multicore-based TMR configuration with non-replicated OS.

A number of recent proposals have explored this, typically along the lines of the diagram on the left: A shared operating system (or hypervisor) presents the multicore system as a logical single core that transparently replicates application software, and does the voting. This approach can effectively protect against faults in the application software. The problem is that the non-replicated OS is still vulnerable to SEUs – any bit-flip in OS data or instruction memory can cause the system to fail. Rather than real protection, this reduces the vulnerability to a relatively small memory region (assuming a slim OS kernel or hypervisor).

The advantage is that this is much cheaper (and more energy-efficient) than either traditional hardware replication or using rad-hardened processors.

Redundant co-execution with seL4

RCoE even replicates the OS kernel, and thus all software (except the minimal shim to non-replicated devices).

We have recently pioneered a new approach that combines the benefits of the above, which we call redundant co-execution (RCoE), and have implemented this in seL4. The idea is shown in the diagram (for simplicity as a DMR configuration, but we can do an arbitrary number of replicas, including TMR): We also replicate the OS kernel, which is now aware of running in a replicated configuration. It compares (votes) the inputs and outputs of all kernel replica whenever entering or exiting, as well as configurable points within the kernel. This now extends the sphere of replication to practically the whole system, minimising vulnerabilities.

Applications get transparently replicated with RCoE (just as with the simplified model above). As seL4 is an extreme case of a minimal microkernel, where all traditional OS services, including device drivers, run as usermode programs, these get automatically replicated.

The one exception are peripherals that are (inherently) non-replicated, such as network interfaces. Such a peripheral has a single set of hardware device registers which the device driver uses for controlling I/O through the device. This represents a residual vulnerability, if bit-flips happen while accessing those registers. We keep exposure to the absolute minimum: When reading from a device register, a low-level shim immediately copies the data to replicated buffers, and then hands control to the (replicated) drivers which each operate on their own buffer, any divergence can be detected at that point. Similar, when writing to a device register, the shim compares the replicated output buffer, and then immediately copies to the device. The few instructions required for this is the only part where a bit-flip could happen undetected.

Virtual machines

While seL4 is a microkernel, it works pretty well as a hypervisor. So while RCoE approach transparently replicates applications, these do not have to be native seL4 apps, they can be virtual machines (VMs). Such a VM can run a complete Linux system as a guest, in which case the Linux system will also be transparently replicated. Obviously, in such a configuration, replica voting is very course grain, as the seL4 kernel gets rarely invoked during VM execution.

A typical configuration has the more critical code running natively on seL4 (with frequent voting), while less critical code can run inside the Linux VM.

Two variants

We implemented RCoE on Intel x86 processors as well as on Arm processors (although virtual machines we presently only support on x86). We have also designed and implemented two different versions of RCoE, a closely-coupled one, where for voting the replicas are synchronised to the exact instruction (while executing asynchronously between votes) and a loosely-coupled version, where replicas of user-mode components are unsynchronised when voting. These represent different trade-offs of performance, vulnerability and restrictions on application characteristics.

Does it work?

It works pretty well. We ran a large number of experiments where we introduced random faults in physical memory. We injected enough faults to observe 1,000 errors (which took hundreds of thousands of fault injections). Our default configuration (where voting happens on kernel entry and exit, as well as device communication) handled all but a handful gracefully, i.e. detecting an error before the system would produce erroneous outputs. We could also clearly see a performance-vs-dependabilty tradeoff at work, where making the voting more frequent would increase dependability at the cost of reduced performance. With enough effort (i.e. checking) faults can be made close to negligible.

We could also show that the TMR configuration can effectively downgrade to DMR and continue operating with the faulty replica taken off-line. The cost of this transition is only of the order of a millisecond.

At what cost?

Performance overhead of the default configuration (which eliminates almost all uncontrolled failures) is of the order of 30–100%, compared to unprotected single-core operation. In other words, a redundant system takes 1.3–2 times the time of the unprotected single-core system to execute a workload. In terms of energy, this means that the TMR configuration will consume 3×1.3–2, or 4–6 times the energy of unprotected execution.

To put this in context: The rad-hardened RAD750 processor (costing 100s of k$) delivers 240 DMIPS at 133 MHz, or at most 40 DMIPS/W. Our SabreLite board produces 2,000 DMIPS per core at 800 MHz. Using three cores (for TMR) with a 2-times performance overhead it still delivers five times the DMIPS/W of the RAD750. In other words, being able to use a cheap, low-power multicore processor comes out far in front of the rad-hardened processor despite the overhead.

And in terms of cost, it ends up >10,000 times ahead. So there really is an opportunity to reduce cost of satellites with seL4!

Find out all the details

Full details of this work are in a paper published last June in DSN, the top publication venue for dependability and fault tolerance. It contains the complete technical details and complete evaluation. If you want even more gory detail, it’s all in Yanyan Shen’s PhD thesis.

Can I try it out?

We’ll make the software available for download from the project page soon.

But keep in mind, this is a research prototype so far, and would need some more engineering for real deployment. Ideally the kernel with redundancy support should be verified, to achieve a degree of dependability approaching that of (expensive and slow) reliable hardware.

From → embedded systems, seL4

One Comment

windows 10 emag permalink

Hi, I want to subscribe for this webpage to take newest updates, so where can i do it please help
out.

Reply

Can seL4 reduce the cost of satellites?

What does this have to do with seL4?

Traditional fault-tolerant systems

Enter cheap multicores

Redundant co-execution with seL4

Virtual machines

Two variants

Does it work?

At what cost?

Find out all the details

Can I try it out?

Leave a comment Cancel reply

Recent Posts

Categories

Archives

Can seL4 reduce the cost of satellites?

What does this have to do with seL4?

Traditional fault-tolerant systems

Enter cheap multicores

Redundant co-execution with seL4

Virtual machines

Two variants

Does it work?

At what cost?

Find out all the details

Can I try it out?

Share this:

Leave a comment Cancel reply

Recent Posts

Categories

Archives