Shielding applications from an untrusted cloud with Haven Baumann et al. OSDI 2014
Our objective is to run existing server applications in the cloud with a level of trust and security roughly equivalent to a user operating their own hardware in a locked cage at a colocation facility.
We’re all familiar with the idea of a sandbox execution environment designed to protect a host from untrusted code that executes on it. Baumann et al. introduce the concept of shielded execution which does the opposite: it protects the confidentially and integrity of code and data from an untrusted host. Their safe Haven implements this idea for Windows applications using Intel software guard extensions (SGX).
SGX allows a process to instantiate a secure region of address space known as an enclave; it then protects execution of code within the enclave, even from malicious privileged code or hardware attacks such as memory probes. While SGX was designed to enable new trustworthy applications to protect specific secrets by placing portions of their code and data inside enclaves, Haven aims to shield entire unmodified legacy applications written without any knowledge of SGX.
SGX is the first commodity hardware that permits efficient multiplexing among multiple isolated programs without relying on trusted software. Shielded execution builds on top of the SGX isolation mechanism to provide two higher-level security properties: confidentiality and integrity. Only the inputs and outputs of a shielded program are visible, no intermediate states are observable. Regarding integrity, the system can deny service, but otherwise if the program completes its output is the same as a correct execution on a reference platform. The threat model assumes that the processor itself is implemented correctly and not compromised, but beyond this the adversary can have full control of hardware and network, and may also control the cloud provider’s entire software stack.
Building blocks: SGX and Drawbridge
SGX protects the confidentiality and integrity of pages in an enclave, a region of a user-mode address space. While cache-resident, enclave data is protected by CPU access controls (the translation look-aside buffer, TLB). However, it is encrypted and integrity protected when written to memory, and if the data in memory is modified, a subsequent load will signal a fault.
Enclaves are created using ECREATE, and pages of memory can be added to an enclave with EADD. All such pages of memory much occupy a specify region of physical memory known as the enclave page cache (EPC).
On each enclave page access, after walking the page table, SGX ensures that the processor is in enclave mode, the page belongs to the EPC and is correctly typed, the current enclave maps the page at the accessed virtual address, and the access agrees with the page permissions.
A process known as attestation enables a remote system to verify what has been loaded within an enclave, and that the enclave is running on a genuine SGX implementation.
During enclave creation, a secure hash known as a measurement is established of the enclave’s initial state. The enclave may later retrieve a report signed by the processor that proves its identity to, and communicates a unique value (such as a public key) with, another local enclave. Using a trusted quoting enclave, this mechanism can be leveraged to obtain an attestation known as a quote which proves to a remote system that the report comes from an enclave running on a genuine SGX implementation. Ultimately, the processor manufacturer (e.g., Intel) is the root of trust for attestation.
Drawbridge provides for low-overhead sandboxing of Windows applications via two core mechanisms: the picoprocess and a library OS.
The picoprocess is a secure isolation container constructed from a hardware address space, but with no access to traditional OS services or system calls [15]; instead, a narrow ABI of OS primitives is provided, implemented using a security monitor. The ABI consists of 40 downcalls and three upcalls.
Haven uses the picoprocess mechanism as a traditional sandbox. I.e., to protect the cloud provider’s host from a potentially malicious guest. Drawbridge LibOS is a version of Windows 8 refactored to run as a set of libraries within picoprocess.
LibOS helps Haven to defend against ‘Iago’ attacks. An Iago attack is when a malicious OS attempts to subvert an isolated application by exploiting its assumption of correct OS behaviour. For example, an attacker could change the results of system calls. To narrow down the surface area for Iago attacks, Haven includes LibOS within the enclave. LibOS implements the full OS API using a much narrower set of core primitives. Established techniques such as careful defensive coding, exhaustive validation of untrusted inputs, and encryption and integrity protection of any private data exposed to untrusted code are then used around those core primitives.
Haven’s architecture
SGX, the picoprocess container, and LibOS are combined in Haven. An enclave is created inside a Drawbridge picocontainer process, and this enclave in turn contains the entire application as well as LibOS. Haven adds two additional layers: a shield module inside the enclave, below LibOS, and an untrusted runtime outside of the enclave.
In Haven, an application performs most of its execution in the enclave, calling out to the untrusted host only for system services. This is the opposite of the typical SGX usage model of untrusted code calling into an enclave.
It is the shield module that implements the defences against Iago attacks, as described previously. “At its most basic, this validation consists of ensuring that the parameters of upcalls and results of downcalls are consistent with their specification.” The shield also implements a form of user-level scheduling to prevent a malicious host from e.g., allowing two threads to concurrently acquire a mutex. The shied handles calls for entropy generation too. Process creation is not supported. (“One advantage of the Windows OS is that surprisingly few applications use child processes.”).
The untrusted runtime outside of the enclave is a Drawbridge ABI subset primarily used as bootstrap and glue code. It is trusted by neither the enclave nor the host kernel. It is used for creating the enclave, loading the shield, and forwarding calls between the enclave and the host OS. The interface is constructed such that the guest controls policy for virtual resources, while the host manages policy only for physical resources (e.g. memory and CPU time).
In general, this prevents operations that give any implementation freedom to the host beyond physical resource allocation, makes verification efficient, and limits the scope of attacks.
SGX provides confidentiality and integrity protection fo data in memory, but Haven also needs to provide for secure persistent storage. Encrypting file contents risks leaking file metadata, so the prototype implementation uses a filesystem inside an encrypted virtual hard disk (VHD) image.
Application deployment and attestation
There are many more details in the paper, including limitations in the original versions of the SGX specification that were removed in a subsequent version based on the author’s feedback. These aren’t necessary to understand the big idea though. The deployment process for applications, including attestation as part of it, is worth a closer look.
- The shield binary (the shield is not encrypted) is sent to the cloud provider.
- The cloud provider creates a picoprocess and loads the untrusted runtime which in turn creates an enclave and loads the shield module.
- The SGX hardware mechanism is used to measure (compute a secure hash of) the code and initial state of the shield.
- The shield generates a public/private key pair, and establishes a network connection with the user (a machine physically under the control of the user, or another enclave in the cloud).
- The shield uses the SGX attestation mechanism to produce a quote containing its public key, which it sends to the user. This proves that it has been correctly loaded and is running in an SGX enclave.
- If the enclave’s measurement is as expected the user sends encrypts a VHD containing the application and LibOS using the shield’s public key and sends it to the shield.
- The shield decrypts the VHD using its private key, which allows it to continue to load the LibOS and application.
From this point onward, communication with the outside world, and therefore access to any secrets contained in the VHD, is under application control. Typical server applications supporting SSL-encrypted connections may be configured using certificates and keys stored directly in the VHD, and accessed over the cloud provider’s untrusted network. For future work, we are planning to add support for encrypted virtual private networks between a user’s enclaves (or trusted hosts), providing a secure network to applications that require one.
Performance overheads
Haven was developed and tested using a functional emulator for SGX provided by Intel. The authors developed a performance model to allow them to estimate Haven’s overheads. The benchmark applications were SQL Server running the TCP-E benchmark, and Apache HTTP server. Figure 4 below shows how Haven performs for these two different workloads.
For SQL Server, the extra runtime layers and TLB flush on enclave crossings give Haven a 13% slowdown vs. Drawbridge. Furthermore, Haven’s unoptimised FAT filesystem is a bottleneck for the I/O-intensive SQL workload. Besides a further 25% slowdown (with encryption), it shows significant drops in throughput when limited I/O bandwidth causes the server to periodically delay transaction processing to allow checkpoint writes to complete. Drawbridge and Haven exhibit relatively poor networking performance with Apache, because all socket operations traverse a security monitor in a separate process. Moreover, Haven performs substantially (40%) worse than Drawbridge with the host filesystem, because of many small file operations that flush the TLB. The private filesystem avoids this and even outperforms Drawbridge, since the workload is read-intensive and served almost entirely from the buffer cache inside the enclave
Overall, Haven’s performance penalty vs a VM is 31-54%. “We suspect that significant classes of users will readily accept such overheads, in return for not needing to trust the cloud.”
Haven brings us one step closer to a true “utility computing” model for the cloud, where the utility provides resources (processor cores, storage, and networking) but has no access to user data.
A final note on SGX
For an in-depth look at SGX, see Intel SGX Explained by Victor Costan and Srinivas Devadas at MIT CSAIL. Questions have been raised around the attestation implementation, as reported in The Register.