Application crash consistency and performance with CCFS Pillai et al., FAST 2017
I know I tend to get over-excited about some of the research I cover, but this is truly a fabulous piece of work. We looked “All file systems are not created equal” in a previous edition of The Morning Paper, which showed that the re-ordering and weak atomicity guarantees of modern filesystems cause havoc with crash recovery in applications. For anyone building an application on top of a file system, “All file systems are not created equal” is on my required reading list. Today’s paper is the follow-up, which looks at what we can do about the problem. CCFS (the Crash-Consistent File System) is an extension to ext4 that restores ordering and weak atomicity guarantees for applications, while at the same time delivering much improved performance. That’s not supposed to happen! Normally when we give stronger consistency guarantees performance degrades. The short version of the story is that ccfs introduces a form of causal consistency in which file system operations are grouped into streams. Within a stream ordering is preserved, across streams by default it is not (some important details here we’ll cover later on). In the simplest mapping, each application is given its own stream.
I’m going to reverse the normal order of things and start with some of the evaluation results, then we’ll take a step back and see how ccfs pulls them off. Before we do that, just one other small observation – ccfs ends up being just 4,500 lines of changed/added code in the 50Kloc ext4 codebase. Which just goes to reinforce the point that it’s not the amount of code you can churn out that counts, it’s the careful design and analysis that goes into figuring out what code to write in the first place (and that often leads to there being less of it as well).
The situation addressed by ccfs is related to, but different from, the corruption and file system crash problems that we looked at last week. Here we assume a system crash, and the file system is behaving ‘correctly’ – just not as the application designer expected. On the former front, many thanks to @drsm79 and @danfairs who pointed me to some fabulous data from CERN on the prevalence of file system corruptions in the wild (about 6/day on average that are observed!). Steve Loughran’s talk on “Do you really want that data?” might also be of interest.
So let’s start out with a reminder of the current situation, and then compare it against ccfs. Using Alice and BoB (the testing tools from “All file systems are not created equal”), the authors test LevelDB, SQLite, Git, Mercurial, and ZooKeeper. Alice records the system call trace for an application workload, and then reproduces the possible set of file system states that can occur if there is a system crash.
Here are the Alice results:
Ext4 results in multiple inconsistencies: LevelDB fails to maintain the order in which key-value pairs are inserted, Git and Mercurial can result in repository corruption, and ZooKeeper may become unavailable. With ccfs, the only inconsistencies were with Mercurial. These inconsistencies are exposed on a process crash with any file system, and therefore also occur during system crashes in ccfs.
BoB is the ‘block order breaker’ – it records the block-level trace for an application workload and reproduces the set of disk images possible if a crash occurs.
We used Git and LevelDB to test our implementation and compare it with ext4; both have crash vulnerabilities exposed easily on a re-ordering file system… LevelDB on ext4 is consistent on only 158 of the 465 images reproduced; a system crash can result in being unable to open the datastore after reboot, or violate the order in which users inserted key-value pairs. Git will not recover properly on ext4 if a crash happens during 30.05 seconds of the 40 second runtime of the workload. With ccfs, we were unable to reproduce any disk state in which LevelDB or Git are inconsistent.
So ccfs clearly improves application consistency – but does it harm performance to do so? The evaluation first studies each of Git, LevelDB, andSQLite in isolation, then later on looks at mixed workload results. Single application results are shown below. The ccfs rows show what happens when the application is run as is on ccfs, ccfs+ represents a minimal change to the application to remove fsync calls not needed with ccfs.
For the Git workload, ccfs is 80x faster, for the SQLite workload it is 685x faster, and for LevelDB it is 5x faster. Not bad eh?
Achieving correctness atop ccfs (while maintaining performance) required negligible developer overhead: we added one
setstream()call to the beginning of each application, without examining the applications any further.
Here’s what happens when you run a multiple application workload (each application with its own stream).
Once more, in the ccfs+ case, we’re looking at almost 100x!
Now that you’re sufficiently motivated to dive into the details of how this feat is possible, let’s take a step back and look at the root causes of crash inconsistencies and poor performance to set the scene.
Many applications use a specialized write protocol to maintain crash consistency of their persistent data structures. The protocols, by design or accident, frequently require all writes to commit in their issued order.
The problem is, with modern file systems, they don’t. Constraining ordering leads to performance costs, and so ‘many file systems (including ext4, xfs, btrfs and the 4.4 BSD fast file system) re-order application writes, and some file systems commit directory operations out of order (e.g., btrfs). Lower levels of the storage stack also re-order aggressively, to reduce seeks and obtain grouping benefits.”
The hypothesis leading to ccfs (and that seems to be borne out by the results) is that application code is depending on two specific file system guarantees that aren’t actually guaranteed:
- The effect of system calls should be persisted on disk in the order they were issued by applications, and a system crash should not produce a state where system calls appear re-ordered.
- Weak atomicity guarantees when operating on directories, and when extending the size of file during appends. (I.e., both things happen or neither of them happen).
Among 60 vulnerabilities found in “All file systems are not created equal,” 50 are addressed by maintaining order and weak atomicity, and the remaining 10 have minor consequences that are readily masked or fixed. The amount of re-ordering done in modern file systems seems to increase with every release, so the problem in the wild is only getting worse.
Ordering is bad for file system performance because it requires coordination, and because it prevents write avoidance optimisations (writes that actually never need to be written to disk because of future truncates or overwrites). Conversely, when fsync calls or cache eviction force writes, it is called write dependence.
Apart from removing the chance for write avoidance, write dependence also worsens application performance in surprising ways. For example, the
fsync(f2)call becomes a high-latency operation, as it must wait for all previous writes to commit, not just the writes to f2. The overheads associated with write dependence can be further exacerbated by various optimizations found in modern file systems. For example, the ext4 file system uses a technique known as delayed allocation, wherein it batches together multiple file writes and then subsequently allocates blocks to files. This important optimization is defeated by forced write ordering.
Streams in CCFS
Ccfs says that with proper care in the design and implementation, an ordered and weakly atomic file system can be as efficient (or better, as we saw!) as a file system that does not provide these properties.
The key idea in ccfs is to separate each application into a stream, and maintain order only within each stream; writes from different streams are re-ordered for performance. This idea has two challenges: metadata structures and the journaling mechanism need to be separated between streams, and order needs to be maintained within each stream efficiently.
There is a simple
setstream API which allows an application developer to further subdivide applications writes into multiple streams should they so wish.
Ccfs is based on ext4, which maintains an in-memory data structure for the running transaction. Periodically this transaction is committed, and when the committing process starts a new running transaction is created. Ext4 therefore always has one running transaction, and at most one committing transaction. In contrast, ccfs has one running transaction per stream. When a synchronization system call (e.g.
fsync) is made, only the the corresponding stream’s transaction is committed. This is part of the secret to ccfs’ performance gains.
So far then we’ve talked about a model whereby causality is assumed for writes from a single application (stream), and these writes are preserved in order. There are no ordering guarantees for writes across streams though. But this isn’t quite right… suppose application A downloads a file, then application B updates it – this is an example where causality matters across streams.
The current version of ccfs considers the following operations as logically related: modifying the same regular file, explicitly modifying the same inode attributes (such as the owner attribute), updating (creating, deleting or modifying) directory entries of the same name within a directory, and creating a directory and any directory entries within that directory.
These logically related operations across streams are tracked. If the stream that performed the first operation commits before the second operation happens, all is well. Otherwise when a cross-stream dependency on an uncommitted updated is detected ccfs temporarily merges the streams together and treats them as one until commit.
To support these stream-based operations, ccfs extends ext4’s block-granularity journaling of file system metadata with an additional level of byte granularity journaling (since different streams might exist in the same block). The result is described as hybrid-granularity journaling. By allowing streams to be separated in this way the delayed-logging optimisation of ext4 is unaffected. Commits and checkpoints are block-granular which preserves delayed checkpointing. Say streams A and B are both active in a block, and B wants to commit while A is still running – what will be written to the disk is just the initial content of the block, with B’s changes super-imposed.
Hybrid-granularity journaling works for independent metadata changes within the same block, but also need to consider the case of multiple concurrent updates to the same file system metadata structure. For this ccfs using delta journaling which works by storing command logs (add 1, subtract 5 etc.) instead of changed values. So long as the operations are commutative this gives the needed flexibility.
To finish up, we still have the question of how ccfs maintains order within a stream. Ext4 uses metadata-only journaling and can re-order file appends and overwrites. Added data journaling in addition preserves application order for both metadata and file data – but it is slow because data is often written twice. To address this ccfs using a technique called selective data journaling that preserves the order of both data and metadata by journalling only overwritten file data.
Since modern workloads are mostly composed of appends, SDJ is significantly more efficient than journaling all updates.
On its own, SDJ would disable a meaningful optimisation called delayed allocation (see section 3.4), ccfs restores this ability using an order-preserving delayed allocation scheme.
Panoply: Low-TCB Linux applications with SGX enclaves Shinde et al., NDSS, 2017
Intel’s Software Guard Extensions (SGX) supports a kind of reverse sandbox. With the normal sandbox model you’re probably used to, we download untrusted code and run it in a trusted environment that we control. SGX supports running trusted code that you wrote, but in an environment you don’t control (and therefore don’t trust). For example, a public cloud environment. SGX protects the application and its data from a compromised system. For a more detailed introduction, see “Shielding applications from an untrusted cloud with Haven.” We also recently looked at “SCONE: Securing Linux containers with IntelSGX.” Both of those papers consider the security of a single enclave process. Today’s paper, Panoply, brings in a very interesting new dimension by considering what happens when you decompose an application into multiple smaller services (called micro-containers or microns in this work) each running in its own enclave, and they need to communicate with each other.
The key design goals of Panoply are to support rich OS abstractions with a low TCB (Trusted Compute Base), and to secure inter-enclave communications. Supporting a rich subset of the POSIX interface clearly helps in porting applications to run within an enclave. And given that you’re clearly concerned about security to be considering SGX in the first place, a lower TCB makes absolute sense too (providing that you can put up with the performance overheads induced by the trade-off). It’s worth looking a little closer at the need for secure inter-enclave communications though.
The goal is to be able to divide a system up into a number of components, and run each in its own enclave. Panoply introduces an abstraction called a micro-container (micron) to support this. When two enclaves communicate with each other, their communication necessarily goes through an adversarial channel under the OS control (for example, in an IPC call).
Consider the following code snippet from the FreeTDS (Tabular Data Stream) application, allowing communication with SQLServer and Sybase.
The FreeTDS application and a trusted certificate manager are each deployed in their own separate SGX enclaves, and need to communicate:
- The OS could drop the call on line 5 meaning that the callback never gets registered (and hence the security checks on line 22 never execute)
- The call to priority_set_direct on line 22 could also be dropped, which opens up an opportunity for a session downgrade attack
- The OS could replay a message from an earlier session and also cause a downgrade (“Specifically, the OS can record the inter-enclave message transcript from a different session consisting of an invalid certificate with weaker parameters. The previous session would have failed, however the OS can replay one recorded message from that transcript in a session with a strong certificate to cause it to downgrade.“)
- The OS could drop a message on line 22 where the certificate manager returns false (signaling an invalid certificate), and return true instead.
Basically, once you allow an untrusted intermediary to intercept communications, all bets are off unless you take precautions.
These attacks highlight that applications that aren’t designed with the objective of running on an enclave abstraction will be susceptible to subtle vulnerabilities. Further, there is a gap between the SGX-native low-level guarantees (of remote attestation and memory isolation) and those needed to ensure safe end-to-end execution of applications.
Panoply consists of a set of runtime libraries (including a shim library) and a build toolchain that helps developers to prepare microns. A programmer adds per-function annotations to specify which micron should execute a function – to partition an application into multiple microns simply use multiple micron identifiers. Unmarked functions will be assigned to one specially designated ‘default’ micron. Panoply will then instrument the application and create multiple micron binaries. It will also add code for inter-micron flow integrity during this stage.
One of the key design points when working with enclaves is what should go inside the enclave (increasing the size of the TCB) and what should go outside (higher overheads for invocation). Panoply optimises for low TCB, while still allowing applications access to a rich subset of the POSIX API. Calls to
glibc are intercepted, meaning that (unlike in Scone for example)
libc is outside of the enclave. Panoply performs custom checks on all system call return values inside of the enclave.
… we show that the total TCB of Panoply can be about 20Kloc (excluding the original application logic), which is 2 orders of magnitude smaller than previous LibraryOS systems. We believe such a TCB is within the realm of automated verification in the near future.
Unlike many other SGX-based designs, Panoply also supports unrestricted threading as well as forking.
When microns start up Panoply assigns each a micron-identity which is used for all further inter-micron interactions. “Before starting any interactions, the Panoply shim code attests other microns by asking the processor for an attestation quote and verifying the measurement from the public attestation service.” Panoply then protects microns from attacks such as silent aborts, tampering, or replay.
- Panoply provides authenticated encryption of every incoming and outgoing message
- Panoply prevents impersonation of sender or receiver microns by verifying against the secure mapping of micron identities to instances established during the micron initialization phase. Messages from unauthorized microns are discarded and execution is aborted (so you can conduct a denial of service attack, but silent failures aren’t possible via this mode)
- Panoply using unique sequence numbers for messages to prevent replay attacks
- Panoply adds an acknowledgement message for every inter-enclave function call. The sender micron aborts its execution if the ACK is either not received or contains an invalid session id.
Panoply supports unlimited multi-threading. Most other SGX designs have a hard limit on thread pool size, because in the SGX SDK an application has to statically determine how many threads it wants. Panoply multiplexes virtual threads onto the underlying enclave threads. If a micron reaches its maximum concurrent thread limit, then Panoply launches a new micron and all shared memory operations are performed securely via a thread control manager.
Even though the SGX SDK only supports spin locks and mutexes, Panoply provides the full set of higher level synchronization primitives (semaphores, conditional variables, barriers, read/write locks) by implementing them on top of SGX’s mutex support.
For all synchronization operations which are local to a micron’s code, PANOPLY keeps the synchronization objects inside the micron. For global synchronizations across microns, PANOPLY creates a notion of inter-micron locks. The thread control manager micron holds and manages all such shared objects and all global changes are synchronized via this micron.
Forking is supported by taking a full replica of the parent micron’s data and communicating it to the child over a secure communication channel. This can be expensive, so there is also an option for a developer to explicitly annotate a subset of the data to be shared between parent and child. Shared memory communication between microns is supported using a public page as the shared memory resource and encryption based on a shared secret established using a secure inter-micron protocol. Finally, Panoply supports event listeners, callbacks, dispatch handlers and so on via hooks into the OS signals and interrupts.
Panoply was implemented on top of the Intel SDK 1.5 beta for Linux Kernel v3.13. Support for fork and exec required patching the SGX SDK and the corresponding SGX Linux driver.
Panoply in action
Panoply was used in Tor to secure the Tor Directory Authority (DA) service against a malicious OS on the DA servers. It was also used to create a privilege-separated H2O webserver with RSA operations moved to a separate micron, and to run OpenSSL inside a single micron and access it from a FreeTDS micron. On average it took 905 lines of code to port each application to Panoply.
Panoply achieves 2 orders of magnitude lower TCB as compared to state-of-the-art approaches such as Graphene-SGX and Haven. The size of compiled micros in MB is an order of magnitude smaller than Graphene-SGX.
As expected though, Panoply pays for this is performance, adding a 24% overhead to an applications execution time on average.
We can see that creating and destroying enclaves is expensive, as is copying enclave data to-and-from non-enclave memory. This effect can be seen in the chart below as the size of the served web page is increased in H2O:
A third source of the overhead though is that the hardware AES support is not correctly detected from inside the enclave, forcing a fallback to a software-only solution when using SSL. This issue is expected to be fixed.
Panoply bridges the gap between the expressiveness of the SGX-native abstractions and the requirements of feature-rich Linux applications. Panoply offers a new design point, prioritizing TCB over performance, without sacrificing compatibility. It achieves 2 orders of magnitude lower TCB than previous systems.
Enlightening the I/O Path: A holistic approach for application performance Kim et al., FAST ’17
Lots of applications contain a mix of foreground and background tasks. Since we’re at the file system level here, for application, think Redis, MongoDB, PostgreSQL and so on. Typically user requests are considered foreground tasks, and tasks such as housekeeping, checkpointing etc. are run as background tasks. This leads to a challenge of providing consistent low latency responses for user requests in the presence of mixed background workloads.
For example, background checkpointing in relational databases has been known to hinder delivering low and predictable transaction latency, but the database and operating system communities have no reasonable solution despite their collaborative efforts.
(As another example, consider what happens to your JVM when the garbage collector runs).
What about priorities you say? Yes, the solution does involve giving higher priority to critical I/O. What the authors show however is that priority information needs to flow through all layers of the storage stack (caching, file system, block, and device), and that priority inversions (where a high priority I/O ends up waiting on a low priority I/O) also need to be addressed. The request-centric I/O prioritization (RCP) scheme that they introduce does both.
This somewhat messy chart illustrates the point well. Here you’re looking at MongoDB running an update heavy YCSB workload. The grey line is the default Linux I/O scheduler (CFQ – Completely Fair Queuing), and the green line (CFQ-IDLE) is CFQ with low priority (idle-priority) assigned to the checkpoint task. SPLIT references variations of the Split-level scheduler (another cross-layer I/O scheduling framework we looked at previously, although it does not consider priority inversions). QASIO does consider priority inversions, but is more limited in scope. Compared to all the other approaches, the throughput with RCP (the black line) is much more consistent, avoiding the large peaks and troughs in the CFQ line for example.
The 3,100 lines of code needed to implement RCP in Linux make a big difference: tested with PostgreSQL, MongoDB, and Redis, RCP delivered up to 53% higher request throughput and 42x better 99%-ile request latency compared to the default configuration in Linux.
Causes of inconsistent performance for requests
The layering in a typical modern storage stack is shown in the figure below (centre and left). Priority agnostic admission control in the caching and block layers can degrade application performance. In the file system layer fate sharing of buffered writes in a single global compound transaction can leave foreground tasks waiting on background task I/O. Shared block level queues in the block layer also mean that a burst of background I/Os can delay foreground I/O by filling the queue. There is further re-ordering at the device level due to the limited size of device internal queues and firmware reordering. What we really need instead is an end-to-end view of I/O priority that can be taken into account at each layer.
Priority inversions (right-hand side of the figure above) comes in two flavours: task dependencies and I/O dependencies. A task dependency arises when two tasks interact with each other via synchronisation primitives such as locks and condition variables.
The dependency caused by a lock complicates effective I/O prioritization because a background task can be blocked waiting for an I/O within a critical section that a foreground task needs to enter. For instance, a foreground task attempting to write a file can be blocked on an inode mutex if the mutex is already held by a background task concurrently writing to the different part of that file.
I/O dependencies occur when a task needs to wait for the completion of an existing I/O in order to ensure correctness and/or durability. For example, a foreground
fsync call may be blocked waiting for completion of a write from a kernel thread cleaning buffer caches.
Once the task and the I/O induced priority inversions occur, the foreground task could wait for a long time because each layer in the I/O path can arbitrarily prolong the processing of low-priority background I/Os.
Dealing with I/O priority inversions is made more challenging for two key reasons: (i) dependency relationships may depend on runtime conditions, such that they can’t be statically determined, and (ii) dependency occurs in a transitive manner involving multiple concurrent tasks blocked at either synchronization primitives or I/O stages in multiple layers.
How enlightenment works
Whereas typical approaches to prioritisation tend to choose either an I/O centric approach, favouring synchronous I/O over asynchronous, or a task centric approach favouring foreground I/O over background I/O, RCP instead takes a request centric approach.
It’s closest in spirit to a task-centric approach, but also allows background tasks of lower priority to become high priority when priority inversions are detected.
RCP classifies I/Os as critical and non-critical, finding that these two priorities are sufficient to obtain good performance. A critical I/O is one that is in the critical path of request handling (i.e., it can directly affect the response time). The process begins by tracking the set of tasks involved in request handling, and the simple solution to this is minimal application instrumentation. It took 15, 14, and 2 lines of code to add the necessary tagging to PostgreSQL, MongoDB, and Redis, respectively.
With foreground tasks tagged at initiation, the next step is to handle priority I/O inversions caused by runtime dependencies.
Inspired by previous work on CPU scheduling, we resolve the lock-induced dependency by inheriting critical I/O priority in a background task when it blocks a foreground task until it leaves a critical section.
Tasks that signal condition variables are labelled as helper tasks. A helper task will inherit the priority of any critical task blocked to wait for the condition to become true. In the kernel this is viable because there are only a few condition variables and associated threads.
Unlike the condition variable-induced dependency in kernel-level, handling such dependency in user-level is difficult because it is hard to pinpoint helper tasks for condition variables. Modern applications extensively use user-level condition variables for various purposes. For instance, we found over a hundred of user-level condition variables in the source code of MongoDB. Therefore, properly identifying all helper tasks is not trivial even for an application developer.
The pragmatic solution is based on the assumption that if a task signals a condition once, it is likely to do so again in the future, and can therefore be tagged as a helper task.
Resolving dependencies to outstanding I/Os is achieved by tracking all non-critical I/Os in the block level queue. If a dependency occurs, it is bumped to critical.
With an understanding of critical and non-critical I/Os in place through a combination of initial tagging and priority inversion detection, the final piece of the puzzle is adapting each layer to make it understand and enforce the I/O criticality.
- In the caching layer, separate dirty ratios are applied to tasks issuing critical and non-critical writes (much lower for non-critical writes). This prevents background tasks from filling all the available space when a burst of background writes arrives.
- In the block layer, admission control is also modified to have separate slots for critical and non-critical I/Os. A similar priority-based I/O scheduler is also employed at this layer.
- In order to minimise interference at the device level, the number of non-critical I/Os dispatched to a storage device is conservatively limited.
See sections 4.3 and 4.4 in the paper for the full details.
The scheme was implemented in 3100 lines of code and tested with PostgreSQL, MongoDB, and Redis.
RCP improves throughput in all three systems. For example, with PostgreSQL RCP outperforms existing schemes by 15-37% (10GB dataset), 8-28% (60GB dataset), and 6-31% (200GB dataset).
In order to show the effectiveness of RCP in terms of request latency, we ran a rate-controlled TPC-C workload (i.e., fixed number of transactions) with 100 scale factor. Figure 6 demonstrates a complementary cumulative distribution function (CCDF), and so the point (x,y) indicates that y is the fraction of requests that experience a latency of at least x ms. This form of representation helps visualizing latency tails, as y-axis labels correspond to the 0th, 90th, 99th (and so on) percentile latencies.
You can clearly see in the figure above that RCP does a much better job of curtailing tail latencies.
Here’s a similar chart for MongoDB using a YCSB workload:
The authors also conduct an interesting experiment in which they selectively disable just one layer’s components in the scheme at a time, which fully supports the end-to-end argument by revealing that disabling the optimisation at any one layer degrades overall application throughput by 7-33% and 6-45% for PostgreSQL and MongoDB respectively.
The results all look great, but there’s just one snag:
Our prototype implementation involves modifications to all the layers and the synchronization methods in the I/O path, thereby hindering our scheme from wide adoption. The most promising direction to be practically viable is exploiting the Split framework. It provides a collection of handlers for an I/O scheduler to operate across all layers in the I/O path. We believe our scheme can be cleanly implemented based on the framework by controlling non-critical writes at the system-call level, before the caching and file system layer generates complex dependencies, and non-critical reads at the block-level.
We covered the Split framework on The Morning Paper at the tail end of 2015.
Chronix: Long term storage and retrieval technology for anomaly detection in operational data Lautenschlager et al., FAST 2017
Chronix (http://www.chronix.io/ ) is a time-series database optimised to support anomaly detection. It supports a multi-dimensional generic time series data model and has built-in high level functions for time series operations. Chronix also a scheme called “Date-Delta-Compaction” (DDC) for timestamps as well as data compression.
On real-world operational data from industry and on queries that analyze these data, Chronix outperforms general-purpose time series databases by far. With a small memory footprint, it saves 20%-68% of the storage space, and it saves 80%-92% on data retrieval time and 73%-97% of the runtime of analyzing functions.
Time series requirements for anomaly detection
To support anomaly detection over time series, the authors cite three key requirements:
- a generic data model that allows exploratory analysis over all types of operational data
- built-in analysis functions
- time and space efficient lossless storage.
Chronix supports all three (surprise!), whereas other popular time series databases are lacking in one or more dimensions:
… while the traditional time series databases support explorative and correlating analyses on scalar values, they often fail to do so efficiently for generic time series data (e.g. logs, traces, etc.). InfluxDB also supports strings and booleans, KairosDB is extensible with custom types. Both lack operators and functions and hence only party fulfil the requirements.
In table 2 below you can see that in addition to basic query functions supported by many of the databases, Chronix also has built-in support for higher level functions such as SAX, Dynamic Time Warping, outlier detection and so on. Without built-in support, these functions have to be hand-built on top of the database in the other stores. This gives Chronix a significant speed advantage.
Chronix stores ordered chunks of time series data (timestamp and value pairs) of arbitrary type in blobs. Type information defines the available functions over the data. Each such blob is called a record and also has two timestamps for the start and end of the time range covered by the chunk, together with a set of user defined attributes (for example, for recording the origin of the data).
Chronix builds on top of Apache Solr , and queries are solr queries which may also reference Chronix functions.
The storage model is described as “functionally-lossless.” By this the authors mean that anything useful for anomaly detection is preserved. A user can specify for example that logging statements from an application framework (vs from the application code itself) will never be of interest, in which case these are not kept. A second example of functional-lossless storage is the timestamp approximation through Date-Delta-Compaction that we’ll look at shortly.
The overall flow of Chronix’s pipeline has four main staging (plus a Chronix Ingester that performs input buffering):
The Optional Transformation can enhance or extend what is actually stored in addition to the raw data, or instead of it. The goal of this optional phase is an optimized format that better supports use-case specific analyses. For example, it is often easer to identify patterns in time series when a symbolic representation is used to store the numerical values.
Compression is in three stages: first the time series values can be compressed as there are often only small changes between subsequent data points; secondly the timestamps themselves can be compressed, and then finally the resulting binary data fields can be run through a standard compression algorithm. The first two forms of compression seem to be a (simpler?) variation on the compression that we looked at for Facebook’s Gorilla (now available in open source as Beringei). See also the delta-delta coding used by BTrDB.
Chronix’ Date-Delta-Compaction (DDC) for timestamps exploits the fact that it its domain the exact timestamps do not matter than much, at least if the difference between the expected timestamp and the actual timestamp is not too large.
Chronix will drop timestamps if it can almost exactly reproduce them. A timestamp is considered reproduceable if the expected next timestamp and actual next timestamp differ by less than some configurable threshold. DDC also keeps track of the accumulated drift and stores a correcting delta if ever it gets too large.
Binary data fields are compressed using a configurable compression algorithm (because different algorithms work better with different data sets). The available choices are bzip2, gzip, LZ4, Snappy, and XZ. You can also plugin your own should you so wish.
Chronix is based on Apache Solr as both the document-oriented storage format of the underlying reverse-index Apache Lucene and the query opportunities match Chronix’ requirements. Furthermore Lucene applies a lossless compression (LZ4) to all stored fields in order to reduce the index size and Chronix inherits the scalability of Solr as it runs in a distributed mode called SolrCloud that implements load balancing, data distribution, fault tolerance, etc.
Configuration tuning (commissioning)
Deploying an instance of Chronix means choosing three key parameter values: the DDC threshold, the chunk size, and the compression algorithm. Chronix supports the administrator in finding suitable values for these parameters.
DDC threshold selection is based on a plot that compares inaccuracy rates against average deviations of timestamps for different threshold values. The inaccuracy rate is the percentage of inaccurately reconstructed timestamps, and the average deviation shows how far those inaccurate timestamps are from the actual values, on average.
The default DDC threshold is set to 200ms by experimentation.
Chunk size and compression technique are selected in tandem. First a minimal chunk size is found by looking at the chunk size where compression rate saturates across a range of algorithms:
Then for a range of chunk size above the found minimum, tests are run to see how long it takes to find a record, ship it to the client, and reconstruct the raw data. Compression techniques with a slow average access to single records are dropped at this stage (e.g. XZ and bzip2 in the plot below):
Finally the system considers all combinations of the remaining compression techniques and chunk sizes to select the best one.
The evaluation compares Chronix to InfluxDB, OpenTSDB, and KairosDB using 108 GB of operational data from two real-world applications. You can see that Chronix has a much smaller memory footprint, as well as significantly reduced storage requirements:
Chronix only needs 8.7GB to store the 108.2GB of raw time series data. Compared to the general purpose time series databases Chronix saves 20%-68% of the storage demand.
Table 8 below shows the benefits of Chronix’ built-in functions.
MaMaDroid: Detecting Android malware by building Markov chains of behavioral models, Mariconti et al., NDSS 2017
Pick any security conference of your choosing, and you’re sure to find plenty of papers examining the security of Android. It can paint a pretty bleak picture, but at the same time the Android ecosystem also seems to have the most active research when it comes to detecting and preventing malware. Today’s choice comes from that side of the fence. I must admit, I was surprised at how well the technique works: it seems on the surface that a lot of important detailed information is abstracted away, but it turns out that doesn’t matter and actually helps to make the solution more robust. One to file away in the memory banks, as I’m sure the ‘less is more’ realisation must have applicability in other circumstances too.
MaMaDroid is a system for detecting malware on Android.
We evaluate MaMaDroid’s accuracy on a dataset of 8.5K benign and 35.5K malicious apps collected over a period of six years, showing that it not only effectively detects malware (with up to 99% F-measure), but also that the model built by the system keeps its detection capabilities for long periods of time (on average, 87% and 73% F-measure, respectively, one and two years after training).
The problem: there’s a lot of Android malware out there
There’s also a lot of Android out there! “In the first quarter of 2016, 85% of smartphone sales were devices running Android.” That ubiquity makes it a popular attack target. Detecting malware on-device is battery draining, so Android malware detection is typically performed by Google in a centralized fashion by analysing apps that are submitted to the Play store.
However, many malicious apps manage to avoid detection (see e.g., ‘Google Play has hundreds of Android apps that contain malware‘), and anyway Android’s openness enables manufacturers and users to install apps that come from third-party market places, which might not perform any malware checks at all, or anyway not as accurately.
Malware detection needs to be relatively efficient, and it also needs to be robust over time – as new APIs are added into the platform, and older ones deprecated and removed, we’d like the detection techniques to continue working well.
Analysing call sequences
MaMaDroid works in four phases: first the byte code is examined to recover the call graph, then sequences of API calls are extracted from the graph. These sequences are modelled as Markov chains representing the probability that a call will be made to a certain destination when in a given state. The probabilities in the Markov model then become the feature vector for a classifier that finally pronounces the app as malware or not.
The static analysis to extract the call graph is done using the Soot framework, in conjunction with FlowDroid. Starting with the nodes in the call graph that have no incoming edges (entry nodes), the paths reachable from each entry node are traced (which may mean exploring multiple branches of conditionals etc.).
The set of all paths identified during this phase constitute the sequences of API calls which will be used to build a Markov chain behavioural model and to extract features.
Here comes the part that initially surprised me: instead of retaining the raw API call details (e.g. to class or even method), the nodes are abstracted either to package or even just to family. In package mode, a node in the graph is abstracted to just its package name (there are 243 Android packages, plus 95 from the Google API, plus the special package names ‘self-defined’ for any user-defined code, and ‘obfuscated’ for obfuscated package names. In family mode, every single call node is reduced down to just one of 11 possible values: android, google, java, javax, xml, apache, junit, json, dom, self-defined, and obfuscated. (I.e., the root of the package name). Even in family mode, the malware detection works surprisingly well!
This allows the system to be resilient to API changes and achieve scalability. In fact, our experiments, presented in section III, show that, from a dataset of 44K apps, we extract more than 10 million unique API calls, which would result in a very large number of nodes, with the corresponding graphs (and feature vectors) being quite sparse.
Now that we have these call sequences we can build a Markov chain where each package/family is a state node, and transitions represent the probability (based on simple occurrence counts) of moving from one state to another.
Next, we use the probabilities of transitioning from one state (abstracted call) to another in the Markov chain as the feature vector of each app. States that are not present in a chain are represented as 0 in the feature vector. Also note that the vector derived from the Markov chain depends on the operational mode of MAMADROID. With families, there are 11 possible states, thus 121 possible transitions in each chain, while, when abstracting to packages, there are 340 states and 115,600 possible transitions.
Given this feature vector, a classifier is trained: Random Forests, 1-Nearest Neighbour, 3-Nearest Neighbour, and SVMs are all experiment with. SVMs performed the worst and are not shown in the results. The authors also experiment with applying PCA (Principle Component Analysis) to reduce the feature space.
The authors gathered a collection of 43,490 Android apps, 8,447 benign and 35,493 malware apps. This included a mix of apps from October 2010 to May 2016, enabling the robustness of classification over time to be explored. All experiments used 10-fold cross-validation (do 10 times: shuffle the dataset, divide it into 10 parts, train using 9 of them, and test on the other one). Accuracy is presented using the F-score ( 2 * precision * recall / precision + recall ).
Here are the resulting scores when using ‘family’ mode:
and the more accurate ‘package’ mode:
As Android evolves over the years, so do the characteristics of both benign and malicious apps. Such evolution must be taken into account when evaluating Android malware detection systems, since their accuracy might significantly be affected as newer APIs are released and/or as malicious developers modify their strategies in order to avoid detection. Evaluating this aspect constitutes one of our research questions, and one of the reasons why our datasets span across multiple years (2010–2016).
To see how well MaMaDroid does in this scenario, the team used older samples as training sets, and tested on newer samples. Here are the results for year gaps of up to four years (for both family mode and package mode):
We also need to test the reverse direction – can older malware be detected by a system training using newer samples? The answer is yes, with similar F-measure scores across the years (tested up to four year gaps) ranging from 95-97% in package mode.
In family mode, it takes 11 (malware) and 23 (benign) seconds on average to classify an app from start to finish. In package mode, this goes up to 14 and 34 seconds respectively. Most of the time is spent on call graph extraction. With an estimate of up to 10,000 apps a day being submitted to Google Play, that’s a combined execution time of about one-and-a-half hours to scan them all.
Our work yields important insights around the use of API calls in malicious apps, showing that, by modeling the sequence of API calls made by an app as a Markov chain, we can successfully capture the behavioral model of that app. This allows MAMADROID to obtain high accuracy overall, as well as to retain it over the years, which is crucial due to the continuous evolution of the Android ecosystem.
Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions
It’s a tough life being the developer of a distributed datastore. Thanks to the wonderful work of Kyle Kingsbury (aka, @aphyr) and his efforts on Jepsen.io, awareness of data loss and related issues in the face of network delays and partitions is way higher than it used to be. Under the Jepsen tests, every single system tested other than ZooKeeper exhibited problems of some kind. But many teams are stepping up, commissioning tests, and generally working hard to fix the issues even if they’re not all there yet.
So I’m sure they’ll all be delighted to know that there’s a new horror in town. Network partitions aren’t the only kind of nasty that distributed datastores can be subjected too. File system / storage devices can exhibit corruptions and errors. And just like network partitions, these aren’t as rare as the designers of datastores would like to have you believe. So what happens if you take eight popular distributed stores, and subject them not to a file-system related stress test, but just to one single itsy-bitsy local filesystem fault? Carnage!
The most important overarching lesson from our study is this: a single file-system fault can induce catastrophic outcomes in most modern distributed storage systems. Despite the presence of checksums, redundancy, and other resiliency methods prevalent in distributed storage, a single untimely file-system fault can lead to data loss, corruption, unavailability, and, in some cases, the spread of corruption to other intact replicas.
And this time around, not even ZooKeeper escaped unscathed.
File system faults
File systems can encounter faults for a variety of underlying causes including media errors, mechanical and electrical problems in the disk, bugs in firmware, and problems in the bus controller. Sometimes, corruptions can arise due to software bugs in other parts of the operating system, device drivers, and sometimes even due to bugs in file systems themselves. Due to these reasons, two problems arise for file systems: block errors, where certain blocks are inaccessible (also called latent sector errors) and block corruptions, where certain blocks do not contain the expected data.
A study of 1 million disk drives over a period of 32 months showed that 8.5% of near-line disks, and 1.9% of enterprise class disks developed one or more latent sector errors. Flash-based SSDs show similar error rates. Block corruptions (for example caused by bit-rot that goes undetected by the in-disk EEC) may corrupt blocks in ways not detectable by the disk itself. File systems in many cases silently return these corrupted blocks to applications. A study of 1.53 million drives over 41 months showed more than 400,000 blocks with checksum mismatches. Anecdotal evidence has also shown the prevalence of storage errors and corruptions (see e.g. ‘Data corruption is worse than you know‘).
Given the frequency of storage corruptions and errors, there is a non-negligible probability for file systems to encounter such faults.
In many cases file systems simply pass faults (errors or corrupted data) onto applications as-is, in a few other cases the file system may transform the fault into a different one. In either case, throughout the paper these are referred to as file-system faults.
Given that local file systems can return corrupted data or errors, the responsibility of data integrity and proper error handling falls to applications, as they care about safely storing and managing critical user data…. The behaviour of modern distributed storage systems in response to file-system faults is critical and strongly affects cloud-based services.
Errfs – the test harness that strikes fear in the heart of distributed storage systems
CORDS is a fault-injection system consisting of errfs, a FUSE file system, and errbench, a set of workloads and a behaviour inference script for each system under test.
Once a system has been initialized to a known state, the application (distributed datastore) is configure to run on top of errfs by specifying its mount point as the data-directory of the application. All reads and writes then flow through errfs, which can inject faults. Errfs can inject two different types of corruptions (zeros or junk) and three different types of errors (EIO on reads, EIO on writes, and ENOSPC/EDQUOT on writes that require additional space. These
The fault injection is about as kind as it’s possible to be:
… our model considers injecting example a single fault to a single file-system block in a single node at a time. While correlated file-system faults are interesting, we focus on the most basic case of injecting a single fault in a single node because our fault model intends to give maximum recovery leeway for applications.
Is errfs realistic?
All the bugs that we find can occur on XFS and all ext file systems including ext4, the default Linux file system. Given that these file systems are commonly used as local file systems in replicas of large distributed storage deployments and recommended by developers, our findings have important implications for such real-world deployments.
Given the single fault, the authors test for the following expected behaviours:
- committed data should not be lost
- queries should not silently return corrupted data
- the cluster should be available for reads and writes
- queries should not fail after retries
Using CORDS, the authors tested Redis, ZooKeeper, Cassandra, Kafka, RethinkDB, MongoDB, LogCabin, and CockroachDB. All systems were configured with a cluster of three nodes, and a replication factor of 3.
OK, so this isn’t good….
We find that these systems can silently return corrupted data to users, lose data, propagate corrupted data to intact replicas become unavailable, or return an unexpected error on queries. For example, a single write error during log initialization can cause write unavailability in ZooKeeper. Similarly, corrupted data in one node in Redis and Cassandra can be propagated to other intact replicas. In Kafka and RethinkDB, corruption in one node can cause a user-visible data loss.
Section 4 of the paper contains a system-by-system breakdown of the problems that the testing uncovered. The results are summarised in this monster table, but you’ll be better off reading the detailed description for any particular system of interest.
Here are some examples of problems found.
- Corruption propagation in Redis, and write unavailability in ZooKeeper:
- Corruption propagation in Cassandra:
- Data loss in Kafka and RethinkDB
Five key observations
Taking a step back from the individual system problems the authors present a series of five observations with respect to data integrity and error handling across all eight systems.
- Even though the systems employ diverse data integrity strategies, all of them exhibit undesired behaviours.
Sometimes, seemingly unrelated configuration settings affect data integrity. For example, in Cassandra, checksums are verified only as a side effect of enabling compression. Due to this behavior, corruptions are not detected or fixed when compression is turned off, leading to user-visible silent corruption. We also find that a few systems use inappropriate checksum algorithms. For example, ZooKeeper uses Adler32 which is suited only for error detection after decompression and can have collisions for very short strings.
- On the local node, faults are often undetected, and even if detected crashing is the most common local reaction. Sometimes this leads to an immediate harmful global effect. After crashing, simple restarting may not help if the fault is sticky – nodes repeatedly crash until manual intervention fixes the underlying problem.
For instance, in Redis, corruptions in the appendonly file of the leader are undetected, leading to global silent corruption… Likewise, RethinkDB does not detect corruptions in the transaction head on the leader which leads to a global user-visible data loss.
- Redundancy is under-utilised, and a single fault can have disastrous cluster-wide effects… Even if only a small amount of data becomes lost or corrupted, an inordinate amount of data can be affected.
Contrary to the widespread expectation that redundancy in distributed systems can help recover from single faults, we observe that even a single error or corruption can cause adverse cluster-wide problems such as total unavailability, silent corruption, and loss or inaccessibility of inordinate amount of data. Almost all systems in many cases do not use redundancy as a source of recovery and miss opportunities of using other intact replicas for recovering. Notice that all the bugs and undesirable behaviors that we discover in our study are due to injecting only a single fault in a single node at a time. Given that the data and functionality are replicated, ideally, none of the undesirable behaviors should manifest.
- Crash and corruption handling are entangled – the detection and recovery code of many systems does not distinguish between crashes and corruptions. (They should!)
- Nuances in commonly used distributed protocols can spread corruption or data loss… “we find that subtleties in the implementation of commonly used distributed protocol such as leader-election, read repair and resynchronization can propagate corruption or data loss.“
How can we improve the situation?
- “As more deployments move to the cloud where reliable storage hardware, firmware, and software might not be the reality, storage systems need to start employing end-to-end integrity strategies.“
- “…we believe that recovery code in distributed systems is not rigorously tested, contributing to undesirable behaviors. Although many systems employ checksums and other techniques, recovery code that exercises such machinery is not carefully tested. We advocate future distributed systems need to rigorously test failure recovery code using fault injection frameworks such as ours.“
- The body of research work on enterprise storage systems provides guidance on how to tackle partial faults, “such wisdom has not filtered down to commodity distributed storage systems.” (and it needs to!)
- We’re going to need a fairly fundamental redesign in some cases:
…although redundancy is effectively used to provide improved availability, it remains underutilized as a source of recovery from file-system and other partial faults. To effectively use redundancy, first, the on-disk data structures have to be carefully designed so that corrupted or inaccessible parts of data can be identified. Next, corruption recovery has to be decoupled from crash recovery to fix only the corrupted or inaccessible portions of data. Sometimes, recovering the corrupted data might be impossible if the intact replicas are not reachable. In such cases, the outcome should be defined by design rather than left as an implementation detail.
Developers of all the systems under test were contacted regarding the problems uncovered by this testing. RethinkDB are changing their design to include application level checksums. The ZooKeeper write unavailability bug was also discovered in the wild and has recently been fixed.
Related to this paper, anyone building systems that interact with a file system API should also read “All file systems are not created equal.”
Let’s briefly look at how the authors collected their data before diving deeper into what the results themselves tell us.
Data gathering methodology
The team crawled the Alexa Top 75K websites (ALEXA) and also a random sample of 75K websites drawn from the
- Manually constructing a catalogue of all releases versions of the 72 most popular open source libraries (using popularity statistics from Bower and Wappalyzer).
- Using static and dynamic analysis techniques to cope with the fact that developers often reformat, restructure, or append code making it difficult to detect library usage in the wild
- Implementing an in-browser causality tracker to understand why specific libraries are loaded by a given site.
To figure out why a certain library is being loaded, the authors develop a causality tree chrome extension. Nodes in the tree are snapshots of elements in the DOM at a specific point in time, and edges denote “created by” relationships.
The median causality tree in ALEXA contains 133 nodes, and the median depth is 4 inclusions.
jQuery remains by far the most popular library, found on 84.5% of ALEXA sites. Note also SWFObject (Adobe Flash) still used on 10.7% of ALEXA sites despite being discontinued in 2013.
When externally loaded, scripts are mostly loaded from CDNs (note also the domain parking sites popping up in the long tail of the COM sites):
Overall though, there seems to be a pretty even split between internally hosted and CDN-delivered script libraries:
Distribution of vulnerable libraries
37.8% of ALEXA sites use at least one library version known to the authors to be vulnerable.
Highly-ranked websites tend to be less likely to include vulnerable libraries, but they are also less likely to include any detected library at all. Towards the lower ranks, both curves increase at a similar pace until they stabilise. While only 21 % of the Top 100 websites use a known vulnerable library, this percentage increases to 32.2 % in the Top 1 k before it stabilises in the Top 5 k and remains around the overall average of 37.8 % for all 75 k websites.
37.4% of the COM sites use at least one vulnerable library. Within the ALEXA grouping, financial and government sites are the worst, with 52% and 50% of sites containing vulnerable libraries respectively.
The following table shows the percentage of vulnerable copies in the wild for jQuery, jQ-UI, Angular, Handlebars, and YUI 3.
In ALEXA, 36.7% of jQuery inclusions are known vulnerable, when at most one inclusion of a specific library version is counted per site. Angular has 40.1% vulnerable inclusions, Handlebars has 86.6%, and YUI 3 has 87.3% (it is not maintained any more). These numbers illustrate that inclusions of known vulnerable versions can make up even a majority of all inclusions of a library.
Many libraries it turns out are not directly included by the site, but are pulled in by other libraries that are. “Library inclusions by ad, widget, or tracker code appear to be more vulnerable than unrelated inclusions.”
Another interesting analysis is the age of the included libraries – the data clearly shows that the majority of web sites use library versions released a long time ago, suggesting that developers rarely update their library dependencies once they have deployed a site. 61.7% of ALEXA sites are at least one patch version behind on one of their included libraries, and the median ALEXA site uses a version released 1,177 days before the newest release of the library. Literally years out of date.
If you like a little non-determinism in your web app (I find it always make debugging much more exciting 😉 ), then another interesting find is that many sites include the same libraries (and multiple versions thereof) many times over!
We discuss some examples using jQuery as a case study. About 20.7 % of the websites including jQuery in ALEXA (17.2 % in COM) do so two or more times. While it may be necessary to include a library multiple times within different documents from different origins, 4.2 % of websites using jQuery in ALEXA include the same version of the library two or more times into the same document (5.1 % in COM), and 10.9 % (5.7 %) include two or more different versions of jQuery into the same document. Since jQuery registers itself as a window-global variable, unless special steps are taken only the last loaded and executed instance can be used by client code. For asynchronously included instances, it may even be difficult to predict which version will prevail in the end.
What can be done?
So where does all this leave us?
From a remediation perspective, the picture painted by our data is bleak. We observe that only very small fraction of potentially vulnerable sites (2.8 % in ALEXA, 1.6 % in COM) could become free of vulnerabilities by applying patch-level updates, i.e., an update of the least significant version component, such as from 1.2.3 to 1.2.4, which would generally be expected to be backwards compatible. The vast majority of sites would need to install at least one library with a more recent major or minor version, which might necessitate additional code changes due to incompatibilities.
Version aliasing could potentially help (specifying only a library prefix, and allowing the CDN to return the latest version), but only a tiny percentage of sites use it (would you trust the developers of those libraries not to break your site, completely outside of your control?). Note that:
Google recently discontinued this service, citing caching issues and “lack of compatibility between even minor versions.”
We need proper dependency management which makes it clear which versions of libraries are being used, coupled with knowledge within the supply chain of vulnerabilities. “This functionality would ideally be integrated into the dependency management system of the platform so that a warning can be shown each time a developer includes a known vulnerable component from the central repository.”
Of course, that can only work if we have some way of figuring out which libraries are vulnerable in the first place. The state of the practice here is pretty damning :
Consider jQuery, one of the most widely used libraries:
Since we also know that many libraries are only indirectly loaded by web sites, and are brought in through third-party components such as advertising, tracking, and social media code, even web developers trying to stay on top of the situation may be unaware that they are indirectly introducing vulnerable code into their websites.