Green computing through formal verification?
This may seem like a long bow to draw, and I admit there’s a bit of marketing in the title, but there’s substance behind it: Verification can save energy in data centres, by significantly speeding up transaction processing.
How does that work?
The key is the defensive way many systems are built to cope with failure. Specifically database systems, which are behind e-commerce and many other web sites, are designed to protect against data loss in the case of system crashes. This is important: if you do a transaction over the web, you want to make sure that you aren’t charged twice, that if they charge you, they are actually going deliver etc. That wouldn’t be ensured (even if the vendor is trustworthy) if the server crashed in the middle of the transaction.
Databases avoid losses by ensuring that transactions (which is now a technical term) have the so-called ACID properties of atomicity, consistency, integrity and durability. The ones relevant to this discussion are atomicity (meaning that a transaction is always either completed in totality or not at all) and durability (meaning that once a transaction is completed successfully, its results cannot be lost). Databases achieve atomicity by operating (at least logically) on a temporary copy of the data and making all changes visible at once, without ever exposing intermediate state.
Durability is generally ensured by write-ahead logging: The database appends the intended effect of the transaction to a log, which is kept in persistent storage (disk), before letting the transaction complete and making its results visible to other transactions. If the system (the database itself or the underlying operating system) crashes halfway through, the database at restart goes through a recovery process. It will replay operations from the log, as long as the log records a complete transaction, and discards incomplete transaction logs. This ensures that only complete transaction effects are visible (atomicity).
Durability is only ensured if it can be guaranteed that the complete log entry of any completed transaction can be recovered. This requires that the database only continues transaction processing once it is certain that the complete transaction log is actually safely recorded on the disk.
This certainty does not normally exist for write operations to disk. Because disks are very slow (compared to RAM), the operating system does its best to hide the latency of writes. It caches writes to files in buffers in RAM (the so-called block cache), and performs the physical writes asynchronously, that is it overlaps them with continuing execution. This is fine, except if you want to ensure durability of your transactions!
Databases therefore perform log writes synchronously. They use an OS system call called “sync()” (to be precise, it’s “fsync()” in Unix/Linux systems). This call simply blocks the caller until any pending writes on the respective file have finished. The database then knows that the data are on disk, and can proceed, knowing that durability is maintained.
What’s the problem with this?
There’s a significant cost, in terms of latency. Disk writes are slow, and the database has to wait until completion. In some cases, it can spend the majority of the time waiting and twiddling thumbs.
What does formal verification have to do with this?
If the underlying operating system is guaranteed not to crash, then we don’t need to block. The database can continue processing, and the OS does the write asynchronously.
This is essentially the thought I had while sitting in a database seminar a few years ago. I was thinking “you guys go though such a lot of trouble to ensure you don’t lose data. We could do that so much easier!”
In fact, that’s what we’re doing now, in a system we call RapiLog. We’re using the stability guarantees of our formally verified seL4 microkernel to get the best of both worlds: the performance advantage of asynchronous I/O, and the durability guarantees of synchronous writes. And does it ever work! We’re observing a doubling of database throughput under the right circumstances!
And increased throughput means that you can process more requests with the same hardware. This means that your data centre can be smaller, which doesn’t only reduce capital cost, but also running cost, most of which is the cost of electricity for powering the computers and for air conditioning. That’s the link to green computing.
And the cool thing is that we can do this with unmodified legacy systems: We run an unmodified database binary (including commercial databases where we have no source) on an unmodified Linux system. We just put it into a virtual machine and let an seL4-based “virtual disk” do all the magic.
How about power blackouts?
Good question! What happens if mains power suddenly goes down?
For the traditional database setup that isn’t a problem. Either the transaction is recorded in a log entry that is safely on disk when power goes, in which case the transaction is durable and will be recovered at restart time. Or there is no (complete) log entry, in which case the transaction was never completed, and is discarded at restart. But with seL4?
Turns out we can deal with that too. Computer power supply have capacitors which store some electrical energy. When mains power is cut, that energy can run the computer for a bit. Not a long time, only some 100 milliseconds. But that is plenty of time to flush out any remaining log data to the disk. Turns out that all we need is about 20 milliseconds, which is very comfortably inside the window given by the capacitors. All we need is an immediate warning (in form of an interrupt raised by hardware).
Implications on system complexity
Maybe even more important than the performance improvement is the potential for reducing system complexity. Complex systems are expensive (in terms of development and maintenance). They are also error prone, meaning more likely to fail. Reducing complexity is always a good thing. And we have shown that some of the complexity (in this case in database systems) can be avoided if we can provide guarantees about the reliability of the software. Which is what our Trustworthy Systems work is all about.