Skip to content

Oblix: an efficient oblivious search index

July 6, 2018

Oblix: an efficient oblivious search index Mishra et al., IEEE Security & Privacy 2018

Unfortunately, many known schemes that enable search queries on encrypted data achieve efficiency at the expense of security, as they reveal access patterns to the encrypted data. In this paper we present Oblix, a search index for encrypted data that is oblivious (provably hides access patterns) is dynamic (supports inserts and deletes), and has good efficiency.

There’s a lot to this paper! Starting with a recap of existing work on Path ORAM (Oblivious RAM) and Oblivious Data Structures (ODS), Mishra introduce an extension for an Oblivious Sorted Multimap (OSM) (such that you can look up a key and find a set of associated values, handy for building indexes!). Then because their design runs a client-proxy process inside an enclave at the server, and enclaves still leak some information, they also design “doubly-oblivious” versions of all of the above that hide the client access patterns in addition to those at the server process. It’s all topped off with an implementation in Rust (nice to see Rust being used for systems research), and an evaluation with three prime use cases: private contact discovery in Signal, private public key lookup based on Google’s Key Transparency, and a search index built over the entire Enron email dataset.

We show that Oblix can scale to databases of tens of millions of records while providing practical performance. For example, retrieving the top 10 values mapped to a key takes only 12.7ms for a database containing ~ 16M records.

It’s a good reminder that some of the things we often accept as ‘obviously true,’ such as that an app which wants to tell you what’s nearby must track your location, aren’t necessarily so.

It’s impossible to cover it all in my target post length, but I’ll do my best to give you a feel for the work.

Motivation and high-level design

A server holds an encrypted database, and a client wants to retrieve matching records by key from that database, without leaking information about their access patterns. For example, in the Signal messaging system a user can query to determine which of the their phone contacts also use Signal, and this needs to be done in such a way that Signal never learns the user’s contact list. Beyond access patterns, it is also important to hide the result size.

In Path ORAM the client is responsible for maintaining a position map (index) and a stash of temporary values. When the database gets too big for the client to store (e.g., all of Signal’s contacts) it is possible to add one or more intermediate ORAM servers. With one intermediary for example the client maintains a smaller index of positions in the intermediary, which then knows the true positions in the server. Each index lookup therefore requires a logarithmic (in index size) number of requests.

Using hardware enclaves (e.g. SGX), Oblix resolves this multiple roundtrip problem by placing ORAM clients in enclaves on the same server as the ORAM server. On its own, just using an enclave isn’t good enough for the privacy standards we’re aiming at here:

Recent attacks have shown that hardware enclaves like Intel SGX leak access patterns in several ways… For Path ORAM, this means that if the attackers sees accesses to the client’s position map or stash, it can infer access patterns to the ORAM server, defeating the purpose of using ORAM.

To get around these leaks, Oblix also uses ORAM techniques inside the client proxy process as well, resulting in doubly oblivious schemes. Hiding result set sizes requires in the basic case that every returned result set is padded to the size of the maximum possible result set. That clearly introduces a lot of overhead! Oblix is designed to work with result pages (e.g., top 20 results), which caps the amount of padding to the page size. This in turn means that the scoring information must also be contained within the index data structure, for which we need an efficient oblivious structure for ordered list values. That’s where the the doubly oblivious sorted multimap (DSOM) comes in.

(Singly) Oblivious Data Structures

Path ORAM essentials

Path ORAM is an ORAM protocol allowing a client to perform oblivious reads and writes to external server memory with low bandwidth and latency. The server maintains a binary tree containing C buckets of B bits each. The client maintains a position map from block id to leaf node, such that the associated block can be found an a path from the root node to that leaf. The client also maintains a stash – think of it like a small cache of blocks – that the server populates and the client may modify. For example, a read request from the client contains a block id, leaf id, and the stash. ORAM.ReadBlock with fetch all blocks on the path to the given leaf id and put them in the stash, outputting the block in the path with the given id. The block is also assigned a new random leaf in the positions map for future accesses.

The Oblivious Data Structures (ODS) framework

The Oblivious Data Structures framework lets you design oblivious data structures that can be expressed as trees of bounded degree. So long as you can express the desired functionality via such an abstraction you’re good to go. Oblivious data structures work with an underlying ORAM scheme, such as Path ORAM. There are three rules:

  1. Every node must have a unique predecessor (a node pointing to it) [Excepting the root node I presume!]
  2. Every operation accesses a unique root node before any other node
  3. Every operation accesses a non-root node’s predecessor before it accesses the node.

In short: there’s a tree of nodes, and every operation walks the tree starting at the root.

Oblivious sorted multimaps

Our oblivious sorted multimap (OSM) enables a client to outsource a certain type of key-value map to an untrusted server so that the map remains encrypted and, yet, the client can still perform search, insert, and delete operations with small cost (in latency and bandwidth) without revealing which key-value pairs of the map were accessed. This notion extends previous notions such as oblivious maps.

An OSM scheme contains init, find, insert, delete, and server protocol algorithms. At a high level it supports a mapping from a key to a possible empty list of sorted and distinct values. The authors realise their scheme using the ODS framework, which involves first constructing a plaintext data structure with tree-like memory accesses, and then replaces those accesses with the oblivious counterparts defined by the ODS framework.

The solution to this challenge combines ideas from AVL search trees and order statistic trees. In an order statistic binary search tree, each node also stores the number of nodes in each (left and right) child subtree, and this is used to find the the node with the i-th smallest key. The OSM scheme modifies this idea such that a node with key k also stores the number of nodes in each of its child subtrees that also have key k. Using this we can efficiently find a key’s i-ith value, and also therefore any sublist of values.

We remark that our OSM construction above (coupled with padding as discussed below) already provides a search index that does not leak access patterns without relying on hardware enclaves: the client stores the OSM state locally and interacts with the remote server over the network for each OSM operation. Leveraging hardware enclaves will enable better performance and support for multiple users.

Doubly Oblivious Data Structures

When the ORAM client runs at a server, it’s memory accesses are visible to the server (even when inside an enclave), and that leaks information. The authors introduce doubly oblivious RAM (DORAM) client algorithms for Path ORAM, ODS, and OSM schemes. We pay a time complexity for this additional obliviousness, as summarised in the following table.


(Enlarge)

Path DORAM

A simple but inefficient client scheme for Path ORAM would be to replace all suitable sub-routines (such as binary search) with linear scans. Splitting eviction into two steps (first assigning blocks, and then writing blocks) reduces the naive time by a factor of C (number of buckets in the tree). Details are in section V.A.

Doubly-oblivious data structures (DODS)

Replacing the underlying ORAM scheme for ODS with a doubly-oblivious one is not sufficient to make a doubly-oblivious ODS (DODS) since an adversary can still learn some information by observing whether a returned node is fetched from the cache or from external memory. Always doing a (possibly) dummy external read solves this issue at the expense of performance. DODS mitigates this cost somewhat by only doing the additional read when an observer could not already know for certain whether or not a node is in the cache.

Doubly-oblivious sorted multimaps

The doubly-oblivious sorted multimap (DOSM) can now be built on top of DODS. This on its own is not sufficient, since accesses to the OSM internal state (outside of the ODS framework) can still be leaked.

To eliminate such leakage, we identify data-dependent sub-procedures of our algorithms, and appropriately pad out the number of accesses made in these procedures to worst-case bounds that depend only on the number of key-value pairs in the map… Next we design our algorithms so that we can always predict whether or not a given dummy access needs to return a cached node. We can then take advantage of the fine-grained DODS interface to avoid unnecessary dummy operations.

Evaluation

It takes about 10K lines of Rust code to implement singly and doubly oblivious version of Path ORAM, ODS, and OSM.

For the Signal contacts use case, Oblix can enable private contact discovery with latency O(m \log N) for m contacts and N overall users. Signal itself currently uses a form of linear scan over all N users given latency O(m^2 + N).

The Oblix approach really shines with incremental contact discovery where smaller values of m (e.g. 10 or less) are much more common.

Google’s Key Transparency (KT) today does not provide anonymity – the server knows the identity of the user whose key it returns. The authors use Oblix to anonymise KT, with look-up latency O(d \log N). “We store all Merkle tree nodes in an oblivious index in which keys are node identifiers and each key is mapped to a hash.”

The final case study creates an oblivious searchable encryption systems over the entire Enron email dataset (~528K emails). On average it takes 20.1ms to find the ten highest-ranking results for a keyword.

EnclaveDB: a secure database using SGX

July 5, 2018

EnclaveDB: A secure database using SGX Priebe et al., IEEE Security & Privacy 2018

This is a really interesting paper (if you’re into this kind of thing I guess!) bringing together the security properties of Intel’s SGX enclaves with the Hekaton SQL Server database engine. The result is a secure database environment with impressive runtime performance. (In the read-mostly TATP benchmarks, overheads are down around 15%, which is amazing for this level of encryption and security). The paper does a great job showing us all of the things that needed to be considered to make EnclaveDB work so well in this environment.

One of my favourite takeaways is that we don’t always have to think of performance and security as trade-offs:

In this paper, we show that the principles behind the design of a high performance database engine are aligned with security. Specifically, in-memory tables and indexes are ideal data structures for securely hosting and querying sensitive data in enclaves.

Motivation and threat model

We host databases in all sorts of untrusted environments, potentially with unknown database administrators, server administrators, OS and hypervisors. How can we guarantee data security and integrity in such a world? Or even how can minimise the attack surface even when we do think we can trust some of these components?

Semantically secure encryption can provide strong and efficient protection for data at rest and in transit, but this is not sufficient because data processing systems decrypt sensitive data in memory during query processing.

With machines with over 1TB memory already commonplace, many OLTP workloads can fit entirely in memory. Good for performance, but bad for security if that’s the one place your data is unprotected.

EnclaveDB assumes an adversary can control everything in the environment bar the code inside enclaves. They can access and tamper with any server-side state in memory, on-disk, or over the network. They can mount replay attacks by shutting down the database and attempting to recover from a stale state, and they can attempt to fork the database and sending requests from different clients to different instances (forking attacks).

Denial of service attacks are out of scope, as are side-channel attacks. “Side channels are a serious concern with trusted hardware, and building efficient side channel protection for high performance systems like EnclaveDB remains an open problem.

High level design

The starting point for EnclaveDB is Hekaton, the in-memory database engine for SQL Server. Hekaton supports in-memory tables and indices made durable by writing to a persistent log shared with SQL Server. (It would be interesting to see what Hekaton could do with NVM…).Hekaton also supports an efficient form of stored procedures where SQL queries over in-memory tables can be compiled to efficient machine code.

  • In-memory tables and indexes are ideal structures for secure hosting in an enclave
  • The overheads of software encryption and integrity checking for disk-based tables are eliminated
  • Query processing on in-memory data minimises leakage of sensitive information and the number of transitions between the enclave and the host
  • Pre-compiled queries reduce the attack surface available to an adversary

In EnclaveDB an untrusted database server hosts public data, and an enclave hosts sensitive data. The enclave combines a modified Hekaton query engine, natively compiled stored procedures, and a trusted kernel providing the runtime environment for the database engine, and security primitives such as attestation and sealing. Database administration tasks such as backups and troubleshooting are supported by the untrusted server. This is an important separation of concerns in cloud hosting environments where the administrators / operators are not part of the end-user trust model.

As always with enclave-based design, one of the key questions is what goes inside the enclave (larger TCB), and what goes outside (more traffic crossing between secure and insecure environments).

In EnclaveDB, we adopt the principle of least privilege – we introduce a thin layer called the trusted kernel that provides the Hekaton engine with the minimal set of services it requires. The trusted kernel implements some enclave-specific services such as management of enclave memory and enclave threads, and delegates other services such as storage to the host operating system with additional logic for ensuring confidentiality and integrity.

For traditional databases, the entire query processing pipeline is part of the attack surface. With EnclaveDB however, all queries are first compiled to native code and then packaged along with the query engine and the trusted kernel. I.e., they are sealed inside the enclave and can’t be tampered with. This does mean you can only run the set of queries built into the database, and currently implies that any change in schema requires taking the database offline and redeploying the package. Online schema changes via a trusted loader are deferred to future work.

Security considerations

Whereas existing system require users to associate and manage encryption keys for each column containing sensitive data, EnclaveDB takes advantage of the encryption and integrity protection provided by the SGX memory encryption engine, which kicks in whenever data is evicted from the processor cache. Thus only a single database encryption key is needed. This key is provisioned to a trusted key management service (KMS) along with a policy specifying the enclave that the key can be provisioned to. When an EnclaveDB instance starts it remotely attests with the KMS and receives the key.

Clients connect to EnclaveDB by creating a secure channel with the enclave and establishing a shared session key. The enclave authenticates clients using embedded certificates. All traffic between clients and EnclaveDB is encrypted using the session key.

Once client requests have been validated, the stored procedure executes entirely within the enclave, on tables hosted in the enclave. Return values are encrypted before being written to buffers allocated by the host.

Care must be taken with error conditions, which may reveal sensitive information (think e.g. database integrity constraints such as uniqueness). EnclaveDB translates ‘secret dependent’ errors into a generic error, and then packages the actual error code and message into a single message that is encrypted and delivered to the client. Care must also be taken with profiling and statistics, some of which are sensitive because they reveal properties of sensitive data. EnclaveDB maintains all profiling information in enclave memory for this reason. There is an API to export it in encrypted form, from where it can be imported into a trusted client database for analysis.

This takes care of the in-memory (in-enclave) parts of EnclaveDB. But there still remains the important question of the logs, which are kept on-disk and outside of the enclave…

Logging and recovery

Since the host cannot be trusted, EnclaveDB must ensure that a malicious host cannot observe or tamper with the contents of the log.

Simply using an encrypted file system leads to high overheads in a database system where the logs can have highly concurrent and write-intensive workloads, with writes concentrated at the tail of the log. This plays badly with Merkle trees. So the authors introduce a new protocol for checking log integrity and freshness.

Encryption is based on an AEAD scheme (Authenticated Encryption with Associated Data), which writes data with associated authenticated data alongside it. (The implementation uses AES-GCM). A cryptographic hash is maintained for each log data and delta file, and these hashes are verified during recovery. A state-continuity protocol based on Ariadne is used to save and restore the system table within the root file while guaranteeing integrity, freshness, and liveness.

The protocol depends on monotonic counters. The SGX built-in counters are too slow to meet the latency and throughput requirements of EnclaveDB, so EnclaveDB uses a dedicated monotonic counter service implemented using replicated enclaves.

EnclaveDB also carefully tracks log records to identify those that must exist in the log on recovery. Full details of the logging and recovery system, including a proof sketch, can be found in section V of the paper. That’s worthy of a detailed study in it’s own right.

Evaluation

Evaluation is a little tricky since the current generation of Intel Skylake CPUs top out at 128MB for the Enclave Page Cache. So small databases can be tested for real, and the results for larger database sizes are obtained from a carefully calibrated performance model. The authors run both TPC-C and TATP benchmarks (TATP simulates a typical location register database as used by mobile carries to store subscriber information).

The TCB (trusted computing base) for both of these benchmarks (remember all the queries are pre-compiled and included) comes in at around 360Kloc (vs. e.g. 10Mloc for the SQL server OLTP engine).

In the charts that follow, the most interesting comparison is between BASE (Hekaton running outside enclaves) and CRYPT-CALL-MEM which includes all the enclave and encryption overheads.

For TPC-C there’s a drop in throughput of around 40% compared to the BASEline (that’s still about 2 orders of magnitude better than prior work). For TATP the throughput overhead is around 15%, with latency increased by around 22%.

Based on these experiments, we can conclude that EnclaveDB achieves a very desirable combination of strong security (confidentiality and integrity) and high performance, a combination we believe should be acceptable to most users.

What next?

There are many ways EnclaveDB can be improved, such as support for online schema changes, dynamically changing the set of authorized users, and further reducing the TCB. But we believe that EnclaveDB lays a strong foundation for the next generation of secure databases. (Emphasis mine)

Grand Pwning Unit: Accelerating microarchitectural attacks with the GPU

July 4, 2018

Grand Pwning Unit: Accelerating microarchitectural attacks with the GPU Frigo et al., IEEE Security & Privacy

The general awareness of microarchitectural attacks is greatly increased since meltdown and spectre earlier this year. A lot of time and energy has been spent in defending against such attacks, with a threat model that assumes attacks originate from the CPU. Frigo et al. open up an entirely new can of worms – modern SoCs contain a variety of special purpose accelerator units, chief among which is the GPU. GPUs are everywhere these days.

Unfortunately, the inclusion of these special-purpose units in the processor today appears to be guided by a basic security model that mainly governs access control, while entirely ignoring the threat of more advanced microarchitectural attacks.

I’m sure you know where this is heading…

It turns out the accelerators can also be used to “accelerate” microarchitectural attacks. Once more we find ourselves in a situation with widespread vulnerabilities. The demonstration target in the paper is a mobile phone running on the ARM platform, with all known defences, including any applicable advanced research defences, employed. Using WebGL from JavaScript, Frigo et al. show how to go from e.g. an advert on a web page to a fully compromised browser in under two minutes.

Our end-to-end attack, named GLitch, uses all these GPU primitives in orchestration to reliably compromise the browser on a mobile device using only microarchitectural attacks in under two minutes. In comparison, even on PCs, all previous Rowhammer attacks from JavaScript require not default configurations (such as reduced DRAMh refresh rates or huge pages) and often take such a long time that some researchers have questioned their practicality.

If only I could flip a bit…

In Firefox, values stored in JavaScript ArrayObjects are 64 bits. The first 32 bits are used as a tag identifying the type of the object. When the tag value is below 0xffffff80 whole 64-bit work is considered as an IEEE-754 double, otherwise the last 32 bits are considered as a pointer to an object. (This strategy is known as NaN-boxing, encoding object pointers in IEEE-754 doubles as NaN values).

So… if only we could flip bits within the first 25 bits of the tag, we could turn pointers into doubles, and vice-versa.

The goal of the [GLitch] exploit is to obtain an arbitrary read/write primitive which can eventually lead to remote code execution. ArrayBuffers are the best fit to gain such a primitive since they provide the attacker with full control over their content. As a consequence, we want to create a reference to a fake ArrayBuffer whose data pointer we control.

  1. The GLitch exploit tool starts by storing a pointer to an inlined ArrayBuffer (header adjacent to data) in 1-to-0 bit-flip vulnerable location. Triggering a bit flip then turns this into a double that can be read, breaking ASLR (address space layout randomisation).
  2. Store a double in a 0-to-1 vulnerable cell in the ArrayBuffer, constructed in such a way that when a bit is flipped it becomes a pointer to a JSString, in turn pointing at the header (address obtained in step 1) for its immutable data. Read the value of the string to extract the content of the ArrayBuffer’s header.
  3. Create a header for a fake ArrayBuffer (using the header information obtained in step 2) within the leaked ArrayBuffer and craft a reference to it using the same double-to-pointer bit flip technique that we used for the JSString. Now we have the desired arbitrary read/write primitive.

Here’s how long it takes for Glitch to break ASLR and compromise the browser on a Nexus 5:

On average, GLitch can break ASLR in only 27 seconds and fully compromise the browser remotely in 116s, making it the fastest known remote Rowhammer attack.

So how does the GPU help us to flip bits (or just steal data in general using side-channel attacks)?

Four attack primitives

We need to be able to either leak data (side-channel attacks) or corrupt data (e.g. Rowhammer attacks).

A primary mechanism for leaking data using microarchitectural attacks is to time operations over resources shared with a victim process.

The first attack primitive then is access to a high-resolution timer. There has been a bit of an arms race in CPU land with clever news ways of creating timers being devised and then blocked as best as possible. But the defences don’t take into account GPUs. There are two explicit timer sources within the OpenGL / WebGL world that will do the job, available when the EXT_DISJOINT_TIMER_QUERY extension is present. Both GPU and CPU operations can be timed directly using the primitives it provides. It’s also possible to craft your own timers using only standard (i.e., always available) WebGL2 functions: clientWaitSync and getSyncParameter. WebGL2 itself is not yet as widely supported as WebGL1 though.

…in order to comply with the WebGL2 specification none of these functions can be disable. Also, due to the synchronous nature of these timers, we can use them to measure both CPU and GPU operations.

Here we can see for example clear timing differences between cached and uncached data using the EXT_DISJOINT_TIMER_QUERY extension:

A second attack primitive is having access to resources shared with other process. By figuring out the caching structure within their GPU (Adreno 330), the authors were able to figure out the sequence of operations needed to effectively bypass the GPU caches and measure memory page accesses for memory pages shared with the rest of the system.

Internally the GPU has two levels of caching, and two ways of accessing memory (by inputting vertices to vertex shaders, or by fetching textures within shaders). Texture fetching turned out to be the easiest to control, and section IV.B of the paper describes in detail how the authors deduced an efficient strategy to evict cache sets from the GPU.

The third attack primitive is knowledge of the physical location of allocated memory addresses: a requirement in order to understand which rows to hammer in a rowhammer atttack. When a row of memory is accessed we can tell if it was already in the row buffer or not by measuring the time the operation takes (buffer hits are faster). To carry out a reliable Rowhammer attack, three adjacent rows within a DRAM bank are required. Distinguishing between row buffer hits and misses enables us to determine whether allocations are contiguous or non-contiguous. (Details are in section VII.D, and the appendix covers the relationship between adjacency and contiguity).

The fourth and final attack primitive is fast memory access needed to trigger bit flips with Rowhammer attacks. Using the knowledge of the GPU cache hierarchy gained via probing the authors derive efficient access patterns to perform double-sided Rowhammering attacks…

Rowhammering

DRAM rows are composed of cells which store the value of a bit in a capacitor.

The charge of a capacitor is transient, and therefore DRAM needs to be recharged within a precise interval (usually 64ms). Rowhammer is a software-based fault injection attack that can be considered a fallout of this DRAM property. By frequently activating specific rows an attacker can influence the charge in the capacitors of adjacent rows, making it possible to induce bit flips in a victim row without having access to its data.

In a double-sided Rowhammer attack quick accesses to rows n-1 and n+1, impose high pressure on the capacitors in victim row n, triggering bit flips. “… our novel GPU-based side-channel attack provides us with information about contiguous physical memory regions in JavaScript, allowing us to perform double-sided Rowhammer on ARM devices in the browser.

Mitigations

You could combine these primitives in a number of imaginative ways to construct different attacks, GLitch is but one end-to-end example. Eliminating known timers (e.g., disabling the EXT_DISJOINT_TIMER_QUERY) is still the best line of defence (though more timing strategies will likely be discovered). The WebGL2 getSyncParameter function can be disabled, and the clientWaitSync function could be replaced by a callback design. (This requires changes to the WebGL2 spec.). Stricter policies for memory reuse may also make it harder for an attacker to hammer valuable data.

We showed that it is possible to perform advanced microarchitectural attacks directly from integrated GPUs found in almost all mobile devices… more alarming, these attacks can be launched from the browser. For example, we showed for the first time that with microarchitectural attacks from the GPU, an attacker can fully compromise a browser running on a mobile phone in less than 2 minutes… we hope our efforts make processor vendors more careful when embedding the next specialized unit into our commodity processors.

Privacy risks with Facebook’s PII-based targeting: auditing a data broker’s advertising interface

July 3, 2018

Privacy risks with Facebook’s PII-based targeting: auditing a data broker’s advertising interface Venkatadri et al., IEEE Security and Privacy 2018

This is one of those jaw-hits-the-floor, can’t quite believe what I’m reading papers. The authors describe an attack exploiting Facebook’s custom audience feature, that can leak your PII.

Specifically, we show how the adversary can infer user’s full phone numbers knowing just their email address, determine whether a particular user visited a website, and de-anonymize all the visitors to a website by inferring their phone numbers en masse. These attacks can be conducted without any interaction with the victim(s), cannot be detected by the victim(s), and do not require the adversary to spend money or actually place an ad.

Following responsible disclosure of the attack vectors to Facebook, Facebook acknowledged the vulnerability and have put in place a fix (not giving audience size estimates under certain scenarios). The experiments conducted by the authors were performed between January and March 2017, and presumably the disclosure happened around that time or shortly afterwards. That probably means your PII on Facebook was vulnerable from when the custom audiences feature was first introduced, until early 2017. Someone with more time could probably put an exact date on it by looking to see when the change was rolled out, presuming it was announced in some manner.

Custom audiences

We explored custom audiences a little earlier this year when we looked at Facebook’s transparency mechanism. In short: whereas old-school targeted advertising let you build an audience by specifying desired attributes, custom audiences allow you to upload personally identifying information (PII) about specific users, and the platform searches for matches and builds an audience out of those.

The custom audience feature has proven popular with advertisers: it allows them to directly select the users to whom their ad is shown, as opposed to only selecting the attributes of the users.

An advertiser including a Facebook code (traditionally referred to as a tracking pixel) on their site can also build a tracking pixel audience based on Facebook users that visit the site.

Custom audience creation at Facebook is done using a web interface where you can upload up to 15 different types of user information for matching. Any one of email address, phone number, mobile advertiser ID, Facebook app user ID, and Facebook page user ID can be used on its own to create a custom audience. Other attributes may be used in combination with one of these. A few hours after uploading the PII, your audience will be available. Facebook requires at least 20 users in a custom audience, other platforms offering similar functionality (see table below) often require hundreds of users or more.

Information available for custom audiences at Facebook

Once the audience has been created, Facebook will give you an approximate indication of the audience size (the number of matched records). Also of interest is the potential reach: this is a measure of the number of active users in the audience (as opposed to dormant Facebook accounts). Active users within the audience are called targetable users in this paper.

Things get more interesting with the ability to combine audiences using include (union) and exclude (set minus) operators. The result is called a combined audience. You can include or exclude audiences that have fewer than 20 users.

On the audience comparison page advertisers can measure the overlap (audience intersection size) between different PII-based audiences they have created. This feature requires audience sizes of at least 1,000, and shows the audience size intersection (not the potential reach) intersection.

None of the sizes above are given as exact numbers (that would make for trivial attacks), instead they are all given as approximate values. Through a series of controlled experiments the authors deduce the following:

  • Potential reach values have a minimum size of 20, and are always multiples of 10. Up to 1,000 the values are always a multiple of 10. Between 1,000 and 10,000 they are multiples of 100. Between 10,000 and 100,000 they are multiples of 1000, and above 100,000 they are multiples of 10,000. Potential reach values are stable in the face of repeated queries in the short term, and largely consistent over longer timescales. Furthermore, potential reach numbers always increase monotonically when adding more users to the uploaded list.
  • Audience sizes are multiples of 10 up to 100, and multiples of 100 thereafter. Audience size numbers are stable and have similar monotonicity to potential reach.
  • Audience size intersection is also stable and monotonic, but the rounding works differently: the intersection size is rounded in steps of 5% of the smaller audience size. E.g. if we intersect an audience of size 1000 and an audience of size 4,500, we get an answer that is a multiple of 50.

Facebook also applies de-duplication when giving counts.

We found that when Facebook has an audience – or combination of audiences – that was created with different PII that both refer to the same user, Facebook only “counts” that user once when reporting the potential reach, audience size, or audience intersection size.

This de-duplication happens both within a single audience and across combinations or intersections of audiences.

Threshold audiences

Suppose for a moment Facebook gave you exact counts. You could learn whether an individual was in a given audience or not (and recall that with includes and excludes you can create audiences down to a single user) by finding the size of an audience, then adding the user in question and getting the size again. If it goes up by one, the user was not in the audience, and vice-versa. Using this same idea, we can also explore audience sizes that lie on rounding boundaries…

The insight behind the method is to create a threshold audience: an audience or combination of audiences where the size statistic of interest – potential reach, audience size, or audience intersection size – lies right before or right after Facebook’s rounding threshold. We call an audience with a size right before the threshold a lower threshold audience, and an audience with a size right after the rounding threshold an upper threshold audience.

E.g., if Facebook is rounding to the nearest 10, then 84 is a lower threshold audience and adding one makes the reported size jump from 80 to 90. In other words, we’re back in the situation where an individual users results can be discerned in the outputs.

How do you find a lower threshold audience? Upload a sequence of lists, each with one more record than the previous one. At the point where the reported size jumps up from that of the list with one less member (e.g., from 810 to 820), you have yourself a lower threshold audience.

Have you visited my site?

The warm-up example shows how to de-anonymise website visitors. Let’s make things a bit more entertaining and say you run an adult-oriented site, and want to know if a certain celebrity frequents it.

Start with your lower threshold audience, combine it with an audience containing just your target victim, and if the potential reach changes, that person has an active Facebook account. Assume that you use a tracking pixel on your website so that you have a tracking pixel audience available to you. Find an upper threshold audience (using the same ‘lists that differ by one technique’ by trying difference combinations of tracking pixel audience + a list you control). Now ask for the audience size using your upper threshold audience combined with an exclusion list with just the victim in it. If the audience size drops, that victim was in your tracking pixel audience. You just discovered whether a certain celebrity has frequented your adult site.

What’s your phone number?

Things are progressing nicely. Maybe we want to blackmail our celebrity, and to do that it would be jolly handy if we could phone them up and tell them what we know. How can we get their phone number though? The details of how to do this are in section V.B of the paper. At the core it relies on the same threshold audience ideas, combined with the audience size intersection capability. With a minimum set size of 1,000 for intersection, and granularity based on 5% of the smaller set, we have to contend with jumps of 50 users.

Take 1,999 users with known PII, and divide them into two sets R and L, where R has 1,949 members, and L has the other 50. Upload two custom audiences: C1 is the union of R and L, and C2 is the union of R, L, and the victims PII. (There’s a twist on this basic idea that accounts for the fact we don’t necessarily know that all 1,999 initial users have active Facebook accounts, it uses similar probing in sequences approach to find a lower threshold intersection).

How does that help us get a phone number?? A phone number is one of the types of PII we can match on. So we create subsets lists L_{ij} of all possible phone numbers, where in L_{ij} we have all phone numbers where the ith digit has value j. Now for each i, we should find exactly one j such that the victim is in L_{ij}.


(Enlarge)

Using this technique it took 140 lists to cover all of Boston, and 82 lists to detect French mobile numbers. It took about 20 minutes to successfully recover the phone numbers of study participants from France (8/8) and Boston (11/14). For the three cases that failed in Boston, one had never provided their phone number to Facebook, and the other two had provided multiple phone numbers which confused the matching. The appendix shows how to handle this case.

Just tell me everyone’s phone numbers!

The final flourish is to recover the phone numbers of every visitor to a website (i.e., in a tracking pixel audience). It combines the upper threshold trick we saw before to de-anonymise website visitors, with the multiple phone-list audiences approach. Full details are in section V.C of the paper.

An experiment successfully determined the phone numbers of nine study participant volunteer visitors to a website.

Our method took around one hour to do the entire search and output the phone numbers by sending queries to Facebook in a serial fashion; this time could be cut down significantly by parallelizing the queries sent to Facebook.

You can use your imagine to figure out how to build custom audience lists to discover other kinds of PII about your victims. The potential also exists for similar attacks on other platforms (e.g. Instagram, Google, Twitter, Pinterest, LinkedIn), “but due to space constraints we leave exploring them to future work.”

Closing the loophole

The authors propose a change to the way Facebook deals with audience de-duplication when reporting audience information, which balances advertiser utility (wanting to know about sizes) with privacy preservation. Facebook actually went with something simpler:

Facebook has acknowledged the vulnerabilities reported in this paper and has confirmed to us that it has deployed a stricter version of our defense: no size statistics will be provided for audiences created using multiple PII attributes. Additionally, no size estimates will be provided when combining audiences that were created using different PII attributes.

So it’s good that this is now fixed, but it’s bad that such a vulnerability existed in the wild for quite some time. We’ll probably never know how much it was exploited if at all. Platforms whose business is all about the collection and monetisation of PII have a duty to protect that information. Their privacy teams really should know that an interface that produces different results based on the presence or absence of a single record is a bad idea for privacy.

While we have proposed a solution to the attacks we uncovered, our work shows that platforms need to carefully audit their interfaces when introducing PII-based targeting.

The rise of the citizen developer: assessing the security impact of online app generators

July 2, 2018
tags:

The rise of the citizen developer: assessing the security impact of online app generators Oltrogge et al., IEEE Security & Privacy 2018

“Low code”, “no code”, “citizen developers”, call it what you will, there’s been a big rise in platforms that seek to make it easy to develop applications for non-export developers. Today’s paper choice studies the online application generator (OAG) market for Android applications. When what used to be a web site (with many successful web site templating and building options around) is often in many cases now also or instead a mobile app, so it makes sense that the same kind of templating and building approach should exist there too. For a brief period at the end of last year, Apple flirted with banning such apps from their app store, before back-tracking just a couple of weeks after the initial announcement. After reading today’s paper I can’t help but feel that perhaps they were on to something. Not that templated apps are bad per se, but when the generated apps contain widespread vulnerabilities and privacy issues, then that is bad.

With the increasing use of OAGs the duty of generating secure code shifts away from the app developer to the generator service. This leaves the question of whether OAGs can provide save and privacy-preserving default implementations of common tasks to generate more secure apps at an unprecedented scale.

Being an optimist by nature, my hope was that such app generation services would improve the state of security, because spending time and effort getting it right once would pay back across all of the generated apps. In theory that could still happen, but in practice it seems the opposite is occurring. It doesn’t seem to make a lot of difference whether you use a free online app generator (what are their incentives, and where does their revenue come from? Always good questions to ask), or a paid service, the situation is not good. The re-configuration attacks that the authors discover are particularly devastating.

Online app generation services and penetration

Online application generators enable app development using wizard-like, point-and-click web interfaces in which developers only need to add and suitably interconnect UI elements that represent application components… There is no need and typically no option to write custom code.

The authors started by searching the web for advertised online app generation platform, resulting in the set of services shown in the table below (the Como the authors refer to is I believe now called ‘swiftic’ – https://www.swiftic.com/). For each of these application generators, it is possible to identify fingerprints in the generated apps that reveal whether or not an app was created using that generator. Such give-aways include application package names, Java package names, shared signing keys and certificates, certain files included in the distribution, and so on. Using a corpus of 2.2M free apps from Google Play (collected from Aug 2015 to May 2017) the authors found that at least 255,216 of the apps were generated using online app generators. The five most popular OAGs accounting for 73% of all generated apps.


(Enlarge)

As is the nature of such apps, many of them have comparatively few downloads. A small number have been more successful though, and the total number of downloads across all of the generated apps is still meaningful.

What’s in a generated app?

A typical OAG offers a palette of components that can be configured together to deliver app services. The OAGs in this study offered between 12 and 128 different components. There seem to be two major app generation techniques: monolithic boilerplate and custom assembly.

All but two online services generate apps with the exact same application bytecode [monolithic boilerplate], i.e. apps include code for all supported modules with additional logic for layout transitions independent of what the app developer has selected. Apps only differ in a configuration file that is either statically included in the apk file or dynamically loaded at runtime.

Andromo and Appinventor both include only the modules actually needed by the app, with the generated code tailored to the configured modules.

The authors built and studied three simple apps with each of the 13 most popular services according to the market penetration data gathered above:

  • A minimal ‘hello-world’ app.
  • The minimal hello-world app extended to also make web requests to a server controlled by the authors. “We perform both HTTP and HTTPS requests to analyze the transferred plaintext data and to test whether we can inject malicious code and emulate web-to-app attacks.”
  • An app with a user login form or other form to submit user data to a server controlled by the authors. (Appinventor, Biznessapps, and Como did not offer such modules).

New attack vectors

From the 11 out of 13 services that use a monolithic strategy, 5 use either static or dynamic loading of configuration files exclusively, while the others combine both (e.g., a static file included in the app, with the option to download updates later on). Needless to say these configuration files are all-powerful: the monolithic app bytecode can do ‘anything,’ and it’s just the configuration that determines what a particular instance does.

For the nine generators that use static files, seven of them store these files in plain text such that they can be easily read and modified. AppYet does encrypt the file, but the passphrase is hard-coded in the bytecode and the same for every generated app. There are no runtime integrity checks to determine if config files have been tampered with. Cloning or repackaging apps is therefore easy.

Where things get really interesting though, is the 8 out of 11 OAGs that load config files dynamically at runtime.

In five cases, the config is requested via HTTP and transmitted in plain without any protection.

Game over!

Mobincube uses SSL by default, but the request can be downgraded to HTTP.

None of them use public key pinning to prevent MITM attacks.

Similar to static configs, the complete absence of integrity checks and content encryption allows tampering with the app’s business logic and data. This is fatal, since in our tests we could on-the-fly re-configure app modules, compromise app data, or modify the application’s sync URL.

Example attacks include phishing attacks to steal user data/credentials, or replacement of API keys for advertising to steal ad revenue.

Beyond the configuration files, generated apps are also often bound to web services provided by the app generators.

Consequently, when considering that a single service’s infrastructure can server many hundreds or thousands of generated apps, it is paramount that the app generator service not only follows best practices, such as correctly verifying certificates, but also that the service’s infrastructure maintains the highest security standards for their web services.

You know where this is going by now…. “the results of analyzing the communication with the server backend are alarming“. Only two of the providers use HTTPS consistently (Apps Geyser uses HTTP exclusively!). Several send sensitive data from user input forms in plain text. Only three use a valid and trusted certificate. All of them are running outdated SSL libraries.

In addition to putting their customers at risk, these OAGs are also leaking their own credentials in the form of API keys to e.g. Google Maps and Twitter, that are shared across all generated apps.

All identified keys were exactly the same across all analyzed apps, underlining the security impact of boilerplate apps. We could not find a single attempt to obfuscate or protect these keys/secrets.

Tried and trusted attack vectors

In addition to the OAG-specific attacks outlined above, the authors also analysed the generated code for other violations of security best practices on Android. One obvious-with-hindsight finding is that all the OAGs using the monolith strategy generate over-privileged apps by design. Overall, the situation is not good, as summarised in this table. See section VI in the paper for the full details behind it.

Privacy leaks and tracking

The authors checked outgoing traffic from generated apps and compared it with the privacy policies provided by the online services. None of them contacted sites known for distributing malware, but four of them (Como, Mobincube, Biznessapps, and Appy Pie) “clearly exhibited questionable tracking behavior“.

  • Apps generated with Mobincube sent more than 250 tracking requests within the first minute of execution
  • Mobincub also includes the BeaconsInSpace library to perform user tracking via Bluetooth beacons
  • AppyPie tracks using Google Analytics, Facebook, and its own backend
  • Como registers with multiple tracking sites including Google Analytics and its own servers
  • Biznessapps is tracking your location

While such extensive tracking behavior is already questionable for the free services of Appy Pie and Mobincube, one would certainly not expect this for paid services like Como and Biznessapps.

Where are the gatekeepers?

Perhaps Apple was onto something when it flirted with banning such apps. Not that I think templated apps in general should be banned, but if it’s as easy to identify app generators as the authors in this post demonstrate, and an app generator with known security issues has been used to generate an app, then in order to protect users of their platforms I do think such apps should be blocked. It’s probably the only way to force the generator platforms to up their game and start taking security seriously.

Debugging data flows in reactive programs

June 29, 2018

Debugging data flows in reactive programs Banken et al., ICSE’18

To round off our look at papers from ICSE, here’s a really interesting look at the challenges of debugging reactive applications (with a certain Erik Meijer credited among the authors).

… in recent years the use of Reactive Programming (RP) has exploded. Languages such as Elm and libraries such as Reactor, Akka, and Rx are being used by companies such as Netflix, Microsoft, and Google, to build highly responsive and scalable systems.

The rise of reactive programming fits well with the increasing need to process streams of data. In a reactive program, you set up a data processing pipeline and then wait for input to arrive.

Many RP implementations share a notion of a collection that abstracts over time, in contrast to space like standard collections. This collection comes in different flavors, such as Observable (Rx)… the implementations differ in the precise semantics of their collections, their execution model (push/pull) , and the set of available operators.

While in theory events and threads are duals, in practice the RP abstraction works very well for expressing streaming pipelines. If you have an abstraction over time though, then one consequence is that the program control flow is also abstracted. That brings us to the central problem statement:

RP borrows from functional programming (FP) for its abstractions, its laziness, and its use of “pure” functions. Those features contribute to a control flow that is hidden inside the RP implementation library and leads to non-linear execution of user code. This results in non-useful stack traces, while breakpoints do not help either, as relevant variables are frequently out of scope. Furthermore, using a low-level debugger makes it harder to interact with the high-level abstractions that RP provides. (Emphasis mine).

Banken et al. set out to find out how experienced RP programmers deal with these issues in practice. Then having understood the state-of-the-art (spoiler alert: printf), they set out to see if it’s possible to build a debugging environment better suited to RP applications. The result is RxFiddle, which you can experiment with online. The source code is available on GitHub.

The Rx model and Marble diagrams

Observables define the data flow and produce data, while Observers receive data. In the following example, of() creates a simple Observable, and map and filtercreate dependent Observers. The call to subscribe starts the flow of data.

Until that call to subscribe, we are just in the assembly phase, building up the pipeline. When subscribe is called on an Observable, the subscription propagates back upstream. This is the subscription phase.

In the runtime phase Observables with subscriptions start emitting data. Rx defines three types of events that can occur during the runtime phase: next, error, and complete.

More complex programs feature operators that merge Observables, split Observables, or handle higher-order Observables, resulting in more complex graphs.

Marble diagrams (the name comes from the shape of the glyphs used in the Rx documentation) contain one or more timelines containing events that enter and leave Observables. By convention next events are represented with by a circle, error events with a cross, and complete events with a vertical line. See examples at RxMarbles.com, which visualises single Rx operators for the purposes of learning and comprehension. RxViz, an “animated playground for Rx Observables’” is similar but uses a code editor instead of prepared inputs, and visualises the output of the stream.

The diagrams help developers understand how the various operators work. “Mapping time on the x-axis provides insight that is missing when inspecting only a single time slice.

Debugging in practice today

The authors interviewed five developers with between 4 and 12 years each of professional programming experience. The first four developers work at a company A which builds reactive systems, the fifth is building a large scale distributed server application at company B, using Rx to handle asynchronous events. Experience with Rx ranges from a month to over a year across the group

At company A there are no explicit Rx tests (they do test the business logic of course). Company B extensively test using Rx’s ‘marble tests’ library and TestScheduler.

Subjects mostly use printf() statements to debug. Printing values through the flow allows them to “quickly reason about what happens,” whereas using the Node.js debugging is “painful.” There are frequent trips to the documentation to find an operator for the task in hand, and to comprehend how existing operators work. When comprehending code, subjects first look at which operators are used, and then they reason about the types and values that might flow through the stream. Traditional IDE support turns out to be much less helpful when building reactive applications:

  • Rx is more about timing than about types“…, “you miss some sort of indication that the output is what you expect.”
  • It is not always clear what happens when you execute a piece of code, “mostly due to Observables sometimes being lazy.”
  • Flows are clear and comprehensible in the scope of a single class or function, but for application-wide flows it becomes unclear.
  • You need to know a lot as a starting Rx developer.

The team also scoured official documentation, books, online course and blog posts searching for debugging best practices and recommendations. There were slim pickings, with this article by Staltz on “How to debug RxJS code” being a rare exception. Staltz recommends (a) tracing to the console, (b) manually drawing the dependency graph, and (c) manually drawing Marble diagrams.

Overall, we identified four overarching practices when debugging Rx code: (i) gaining high-level overview of the reactive structure, (ii) understanding dependencies between Observables, (iii) finding bugs and issues in reactive behavior, (iv) comprehending the behavior of operators in existing code.

(Salvaneschi and Mezini also looked into some of the issues in their 2016 paper, ‘Debugging for Reactive Programming

Introducing RxFiddle

RxFiddle is designed to help with these objectives by providing a visual overview of Observable flows, and also detailed views inside the data flows at an event level. RxFiddle therefore contains a data flow graph visualiser, and a dynamic Marble diagram generator. It looks like this:

The Data Flow Graph bundles together multiple subscriptions on the same Observable, helping developers find related flows and identify possible performance optimisations. The layout engine is based on StoryFlow, originally designed to visually describe storylines involving multiple characters and the interactions between them. The StoryFlow elements are positioned to align with the Marble diagram. Node colours are used to identify the same Observable used at multiple places in the graph.

The Dynamic Marble Diagrams update live when new events occur, and are stacked to show the data in a complete flow.

This allows developers to trace a value back through a flow, an operation which is impossible using a classic debugger. Handcrafted marble diagrams can use custom shapes and colors to represent events, but for the generic debugger we use only three shapes: next-events are a green dot, errors are a black cross, and completes are a vertical line.

Does RxFiddle help with debugging?

RxFiddle was evaluated in an offline experiment at a Dutch software engineering company, with developers from little to several years of RP experience. There was also a Twitter campaign within the RP community to get people to try the online version. 111 people participated in total. The following charts show the time taken to complete four tasks. T1 and T2 are synthetic examples of two simple data flows, and the participants have to understand how the program works and report findings. T3 and T4 are programs that need to be debugged. In T3 an error in an external service propagates through a stream, and in T4 concurrent requests lead to out-of-order processing of responses.

Across all participants, the following chart shows the chart completion times when using the console, and when using RxFiddle.

Only T3 shows clearly better results for RxFiddle. When controlling for more experienced programmers only, RxFiddle does slightly better.

While the results of our evaluation could be observed as a negative, RxFiddle is a new tool, where subjects have only just been exposed to the tool and received only a short training. We expect that by designing a debugger model so close to the actual abstractions, our debugger works especially well for users with some knowledge of these abstractions…

Users experimenting with the tool online in their own projects reported an improved understanding of what was happening in their projects.

There are several promising directions for improving our design. Specifically scalability could be improved and different edge visualizations could be explored, to improve the usability of the tool. Furthermore, by leveraging already captured meta data about timing of events, even more insight could be provided. At the implementation level, we plan to extend RxFiddle to other members of the Rx-family of libraries.

Of note in the related work section is RP Debugging, embodied in the [Reactive Inspector[(http://guidosalva.github.io/reactive-inspector/), a recently created (? seems to be 2015/16 from what I can tell) tool for REScala delivered as an Eclipse plugin.

How not to structure your database-backed web applications: a study of performance bugs in the wild

June 28, 2018

How not to structure your database-backed web applications: a study of performance bugs in the wild Yang et al., ICSE’18

This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.

To prove our point, we manually fix 64 performance issues in [the latest versions of the applications under study] and obtain a median speed-up of 2x (and up to 39x max) with fewer than 5 lines of code change in most cases.

The Hyperloop website provides access to a tool you can use to identify and solve some of the common performance issues in your own (Rails) apps.

I’m going to skip the intro parts about what ORMs do and how a typical web app is structured, on the assumption that you probably have a good handle on that already. Note that fundamentally a lot of the issues stem from the fact that the ‘O’ in ORM could just as easily stand for ‘Opaque.’

On one hand, it is difficult for application compilers or developers to optimize the interaction between the application and the underlying DBMS, as they are unaware of how their code would translate to queries by the ORM. On the other hand, ORM framework and the underlying DBMS are unaware of the high-level application semantics and hence cannot generate efficient plans to execute queries.

I remember in the early days of Spring (when Hibernate was young too, and JPA didn’t exist) a framework called iBATIS was quite popular amongst people I considered to be good developers. It turns out iBATIS is still going strong, though now it’s called MyBatis. The key selling point is that you retain control over the SQL, without the overheads of directly using an API like JDBC.

Finding and profiling real-world applications

The study focuses on Ruby on Rails applications, for which many large open source applications: “compared to other popular ORM frameworks such as Django and Hibernate, Rails has 2x more applications on GitHub with 400 or more stars than Django and Hibernate combined.” Six popular application categories (covering 90% of all Rails apps with more than 100 stars on GitHub) are further selected, and then the two most popular applications in each category. Resulting in the following twelve apps:

They have been developed for 5-12 years, are all in active use, and range from 7Kloc to 145Kloc (Gitlab). To generate realistic data for these apps, the team collected real-world statistics based on either the app in question where available, or similar apps otherwise, and implemented a crawler that fills out forms on the application’s websites following the real-world stats. For each application, three sets of data were created: with 200, 2000, and 20,000 records in the main database table.

When we discuss an application’s scalability we compare its performance among the three above settings. When we discuss an application’s performance we focus on the 20,000-record setting, which is a realistic setting for all the applications under study. In fact, based on the statistics we collect, the number of main table records of every application under study is usually larger than 20,000 in public deployments…

(E.g., more than 1 million records).

With databases in hand, the applications are then profiled by visiting links randomly for two hours and the resulting logs produced by Chrome and the Rails Active Support Instrumentation API are processed to obtain average end-to-end loading time for every page, the detailed performance breakdown, and the issued database queries. For each app, the 10 pages with the worst loading time are plotted in the figure below.

11 out of 12 applications have pages whose average end-to-end loading time (i.e. from browser sending the URL request to page finishing loading) exceeds 2 seconds; 6 out of 12 applications have pages that take more than 3 seconds to load…. Note that our workload is smaller, or for some applications, much smaller, than today’s real-world workload.

Looking at where the time goes for these queries, server time (app server + DBMS) contributes at least 40% of this latency in at least 5 out of the top 10 pages for 11 of the 12 apps.

In total, 40 problematic actions are identified from among the top 10 most time-consuming actions of every application. 34 of these have scalability problems, and 28 take more than 1 second of server time. 64 performance issues were found in these 40 actions, as summarised in the table below.

In addition to studying these actions, the authors also look at performance issues reported in the application’s bug trackers.

The causes of inefficiency are categorised into three main areas: ORM API misuse, database design, and application design.

ORM API Misuse

Inefficient computation (IC) occurs when the same operation on persistent data can be implemented via different ORM calls. These often look similar on the surface, but can have very different performance implications. For example the difference between any? (scans all records in the absence of an index) and exists? as shown here. Another example is use of Object.where(c).first instead of Object.find_by(c).

There all also opportunities to move computation to the DBMS, for example, replacing pluck(:total).sum with sum(:total). Sometimes we can also take advantage of data already in memory (e.g. replace Object.count— a db query— with Object.size which will count in-memory objects if they have already been loaded).

Unnecessary Computation (UC) occurs when redundant or useless queries are issued. A classic example is a query inside a loop body, that could be computed once outside the loop body.

Inefficient data accessing (ID) results in slow data transfers by either not getting enough data (resulting in the familiar N+1 selects problem), or getting too much data. This can be due to inefficient lazy loading (still prevalent in this study), or too-eager eager loading. Here’s an example of a patch in Lobsters that replaces 51 database queries with one:

A similar inefficient updating issue occurs when N queries are issued to update N records separately, instead of using update_all.

Unnecessary data retrieval (UD) occurs when an application retrieves persistent data that it doesn’t then use.

Inefficient Rendering (IR). This one was a surprise to me in the measured impact. It’s very common in Rails to do something like this, which calls link_to within a loop to generate an anchor:

Replacing it with one call to link_to outside of the loop, and the use of gsub within the loop is faster.

There’s a readability trade-off here, but with complex rendering functions it is clearly worth it. 5 problematic actions in the study dropped their time by over half as a result.

Database design issues

Missing Fields (MF) occurs when an application repeatedly computes a value that it could just store.

Missing Database Indexes (MI). Need I say more?

… missing index is the most common performance problem reported in ORM application’s bug tracking systems. However, it only appears in three out of the 40 problematic actions in latest versions.

Application design issues

Content Display Trade-offs (DT). Typically returning all records instead of using pagination. This becomes a problem as the number of records in the database grows.

Application Functionality Trade-offs (FT). Sometimes an application has a side information on a page that is actually quite expensive to compute. Removing this when it is not essential to the user’s task in-hand can make things go faster.

Fixing inefficiencies

The authors study all 40 of the problematic actions and manually fix 39 of the (the other one spends its time on file-system actions). This leads to 64 fixes being applied.

Many fixes are very effective. About a quarter of them achieve more than 5x speedup, and more than 60% of them achieve more than 2x speedup. Every type of fix has at least one case where it achieves more than 2x speedup.

40 of the 64 fixes alter neither the display nor the functionality of the original application, achieving an average speed-up of 2.2x (maximum 9.2x).

Per a recent comment from Steve Powell, it’s actually quite confusing to talk about a ‘2x speed-up’ when you’re talking about the time something takes! I guess we could interpret this as either 2x requests-per-second (that’s a speed metric) , or that an action takes on average half the time it used to (a ‘/2’ latency reduction?). With a single thread of execution, they both amount to the same thing.

The average server time is reduced form 3.57 seconds to 0.49 seconds, and the end-to-end page load times from 4.17 seconds to 0.69 seconds. More than 78% of fixes require fewer than 5 lines of code.

In other words, by writing code that contains the anti-patterns discussed earlier, developers degrade the performance of their applications by about 6x.

The authors wrote a static analyser (you can find it here) that looks for some of the simple API misuse patterns and ran it on the latest versions of the 12 ORM applications. There are still plenty of incidences! (Some of these in actions that weren’t identified as worse performing actions during profiling of course).

By profiling the latest versions of 12 representative ORM applications and studying their bug-tracking systems, we find 9 types of ORM performance anti-patterns and many performance problems in the latest versions of these applications. Our findings open up new research opportunities to develop techniques that can help developers solve performance issues in ORM applications.

Now if you’ll excuse me, I’ve got a Ruby codebase I just need to go and look at…