Skip to content

Replex: A scalable, highly available multi-index data store

October 27, 2016

Replex: A scalable, highly available multi-index data store Tai et al. USENIX 2016

Today’s choice won a best paper award at USENIX this year. Replex addresses the problem of key-value stores in which you also want to have an efficient query capability by values other than the primary key.

… NoSQL databases achieve scalability by supporting a much simpler query model, typically by a single primary key. This simplification made scaling NoSQL databases easy: by using the key to divide the data into partitions or “shards”, the datastore could be efficiently mapped onto multiple nodes. Unfortunately, this model is inconvenient for programmers, who often still need to query data by a value other than the primary key.

How do you do that? Typically by sending the query to all partitions and gathering the results. This has poor performance as the number of partitions grows. If we had secondary indices, we could go straight to the appropriate partitions. But maintaining secondary indices brings other problems – to maintain data integrity constraints distributed transactions are often used for update operations. Doing this for every update can cripple system performance though.

Tai et al. address this conundrum in the following way: first, they introduce the notion of a replex, which combines the functions of a replica and an index in one; and secondly they use chain replication to validate inserts across all replexes. To improve recovery times, and throughput during recovery, they also introduce a concept of hybrid replexes. We’ll get to those later.

Replex reconsiders multi-index data stores from the bottom-up, showing that implementing secondary indexes can be inexpensive if treated as a first-order concern. Central to achieving negligible overhead is a novel replication scheme which considers fault-tolerance, availability, and indexing simultaneously. We have described this scheme and its parameters and have shown through our experimental results that we outperform HyperDex and Cassandra, state-of-the-art NoSQL systems, by as much as 10×.

I ended up having to work through an example to get all the ideas straight in my head, and I’m going to introduce the main ideas in the paper through that example.

Consider a people table with columns id, name, city, weight, and height. For some reason, alongside the primary key (id), we also want to be able to do efficient lookups by city and by height.

Replex supports a particular type of predicate query, lookup(R), where R is a row of predicates, each entry being either nil (matches any value in the corresponding column), or a predicate. For example:

Is the Replex equivalent of

select * from people 
where city='London' and height > 195

Thus Replex as described only supports conjunctions. Furthermore, you can only query (specify predicates), on columns which are indexed. So we’ll need secondary indices on city and height. Now in Replex, every index stores a full copy of each row – thus it also serves as an independent replica of the data, just with the data sorted in a different order. This insight is what allows Replex to combine the functions of index and replica. If you elected to have say 3 replicas of your data for fault-tolerance, then with Replex you can essentially have 2 secondary indices for free!.

Suppose we have four partitions of the data. In our worked example we will have three replexes (the Id replex, the City replex, and the Height replex), each with four partitions, like so:

To answer a query, the predicate h(R) returns the replex partitions that should be consulted for each replex. For the example query above, this would be the H-M partition of the City replex, and the ‘> 190’ partition of the Height replex. (Yes, that’s a terrible partitioning for heights, but let’s just roll with it…).

Replacing replicas with replexes requires a modified replication protocol. The difficulty arises because individual replexes can have requirements, such as uniqueness constraints, that cause the same operation to be both valid and invalid depending on the replex. Hence before an operation can be replicated, a consensus decision must be made among the replexes to agree on the validity of an operation.

Replex reaches this consensus using chain replication:

In the consensus phase the operation is propagated along the chain of replexes, collecting valid or invalid decisions. The decision is made by the partition within the replex that will ultimately accept the insert. When the end of the chain is reached, the decisions are ANDed, and the result send back along the chain in a replication phase. As each replex discovers the decision, it commits the insert (assuming the decision is to commit) and then passes the result back up the chain.

It is guaranteed that when the client sees the result of the operation, all partitions will agree on the outcome of the operation, and if the operation is valid, all partitions will have made the decision durable.

A consequence of this approach is that writes are not allowed during failure.

So far so good, but now we need to consider what happens when things do start failing…

Failed partitions bring up two concerns: how to reconstruct the failed partition and how to respond to queries that would have been serviced by the failed partition.

Both of these problems can be solved as long as the system knows how to find data stored on the failed partition. The problem is even though two replexes contain the same data, they have different sharding functions, so replicated data is scattered differently.

Different solutions have different failure amplifications – the overhead of finding data when a partition is unavailable. Consider the straightforward scheme of keeping more than one copy of a given replex partition (replicas):

In the example above, this improves failure amplification for A (queries don’t have to route to all partitions of B under failure), but not for B. To eliminate failure amplification of a single failure on both replexes, we’d need to create exact replicas of both, doubling all storage and network overheads. And that’s just with two replexes…

The solution to this problem is hybrid replexes, which provide a degree of shared failure tolerance and amplification across multiple replexes. Let’s see how it works with our City and Height replexes. Each has four partitions, numbered 0..3. Let’s assume a crude partitioning by first letter of cities, and height windows for heights, just to make it easier to follow along:

We introduce a hybrid replex which sits alongside the City and Height ones. This will have the same number of partitions (4), but we use a hybrid partitioning strategy to decide which rows will go in each partition.

We have a hash function hcity to decide which partition a row belongs to in the City replex (A-G -> 0, H-M -> 1, N-R -> 2, S-Z -> 3), and a hash function hheight to decide which partition a row belongs to in the Height replex ( < 150 -> 0, 150-170 -> 1, 170-190 -> 2, > 190 -> 3). To decide which partition to place a row in within the hybrid replex we use the formula:

hhybrid(r) = 2(hcity(r) mod 2) + (hheight(r) mod 2)

Consider all rows where the city name starts with A-G. Regardless of the height, these will end up in either partition 0 or partition 1.

Repeat the process for all combinations and you end up with the following:

The first hybrid partition for example will contain all rows where

(City in A-G || City in N-R) && (Height < 150 || Height in 170-190)

The result of such a scheme is that should we lose any one City or Height partition from a replex, we only have to look in two partitions of the hybrid replex to find a matching row, rather than in all 4 as was the case with standard replication.

You can generalize this ‘2-sharing’ scheme and parameterize a hybrid replex by n1 and n2 where n1.n2 = p and p is the number of partitions, using

This lets you tune which replex is most important from an availability / throughput under recovery perspective. The hybrid layer enables only O(&sqrt;p) amplification under failure rather than O(p). Of course there are extensions for more than 2 replexes, and even configurations with multiple layers of hybrid replexes, see the full paper for details.

Replex was implemented on top of HyperDex, and gave read latency equivalent to HyperDex, and insert latency less than half of HyperDex’s (when configured with the same number of secondary indices).

During recovery, a Replex-2 system (no hybrid replex layer) recovered the fastest, but had poor throughput during recovery. A Replex-3 systems (2 secondary indices and one hybrid layer) gives the best balance of recovery time and throughput:

There are many other experiments in the paper evaluating relative performance in which Replex performs favourably.

The authors conclude:

… we have demonstrated not only that a multi-index scalable, high-availability NoSQL datastore is possible, it is the better choice.

Preemptive intrusion detection: theoretical framework and real-world measurements

October 26, 2016

Preemptive intrusion detection: theoretical framework and real-world measurements Cao et al, HotSoS 2015

Phuong Cao (the first author of this paper) got in touch following my review of DeepDive to say “Thanks for the review on DeepDive. I was inspired by that paper to apply factor graph on detecting intrusions at an early stage…” Preemptive intrusion detection details the results.

AttackTagger is a factor-graph based framework for preemptive detection of system attacks (i.e., before any system misuse has actually occurred). The evaluation was performed against the logs from real attacks on NCSA over a six-year period. AttackTagger detected the 74% of the attacks, the majority before the point at which system misuse occurred.

The problem

Attacks on a computer system take place in a number of stages. For example, in 2008 attackers logged onto a gateway node at NCSA using comprised account credentials, they downloaded an executable, compiled it and escalated to root privileges by exploiting a kernel bug on the compromised node. They then injected credential-collecting code into SSHd and forced it to restart. The credentials of anyone logging into the gateway node were subsequently leaked to the attackers.

Not all of these activities were captured in the system logs, but some were:

Many Intrusion Detection Systems (IDSs) use per-event classifiers, which use rules or signatures to identify malicious users based on individual events. AttackTagger reasons about the stages of an attack collectively, which gives it greater predictive power.  For example, in the scenario above, a remote login by itself is not enough to tag the corresponding user state as malicious, but we might tag it as ‘suspicious.’ Later activities can then be tagged differently for ‘suspicious’ users than for ‘benign’ users.

The goal is to identify malicious users as soon as possible, and ideally before any damage is done.

A quick factor graph recap

The family of probabilistic graphical model techniques for modelling graph-based dependencies among random variables includes Bayesian Networks, Markov Random Fields, and Factor Graphs. In section 3 of the paper Cao et al. compare these three approaches and show that Factor Graphs are both simpler and more expressive for their use case.

A Factor Graph has two types of nodes: variable nodes, and factor nodes. The factor nodes represent functions identifying functional relationships between variables. An edge connects a factor node and a variable node if that variable is used by the factor function.

The factor functions can be used to represent causal or non-causal relations. AttackTagger uses factor functions to capture the relations between a user state and events, between a user state and earlier events or states, and between a user state and a user profile:

  • event-state factor nodes f<sub>e</sub>(s,e) capture the relation between an event e and the hidden state variables s
  • state-state factor nodes f<sub>s</sub>(e<sup>t-1</sup>,e<sup>t</sup>,s<sup>t-1</sup>,s<sup>t</sup>) capture the relation between hidden states now (t) and earlier (t-1), and previous events (t-1) and an event now (t).
  • user-state factor nodes f<sub>u</sub>(U, e<sup>t-1</sup>, s<sup>t-1</sup>) capture the relation between a user profile U, an event, and a hidden state.

A user state may be benign, suspicious, or malicious.

A factor graph model of an attack looks something like this:

Variable nodes (circles) correspond to either observed variables (U, e) or hidden variables (s).

Examples of user profile attributes and events include:

How AttackTagger works

We’ll talk about how AttackTagger builds factor graphs, and what factor graphs it builds in just a moment. For the moment, let’s just assume we have one.

To identify malicious users, AttackTagger infers the most probable values of the user state in the sequence S<sup>t</sup> using the constructed factor graph. Specifically, is the user state s<sup>t</sup> = suspicious, then the user is allowed to operate in the target system under close monitoring (e.g. logging network traffic of the user or logging user commands); if the user state s<sup>t</sup> = malicious, the user is identified as an attacker and actions are taken to disconnect the user from the target system (e.g. terminating the user’s active network connections or disabling the user account). Our inference is based on the join probability distribution on the Factor Graph.

You can use a brute-force approach to calculate the values of the hidden variables (their probabilities) but this is costly. Instead, as was done in DeepDive, AttackTagger uses Gibbs sampling, which can produce results in near real-time.

A Gibbs sampler runs over N iterations.  It starts with a random user state sequence at iteration 0. At iteration 1, it samples a new user state, starting at user state s<sup>0</sup>. That sampling process is conditioned on the value of the previous user state sequence and the Factor Graph. In the next step, this sampling process is repeated for the next user state s<sup>i</sup> and so forth, until it reaches the last user state s<sup>n</sup>. That concludes the sampling process for a user state sequence at iteration 1. The Gibbs sampler repeats the iteration process and stops when it reaches one of two termination conditions: i) N iteration, or ii) the user state sequence convergences.

So that’s the heart of the system. Now let’s look at the overall flow.

Events are extracted from user logs, and then experts define the set of factor functions to be used for the system in question.  For example, we might define event-state factor functions of the form “if e happens then s“, such as “if a user downloads a file with a sensitive extension then the user is suspicious.” That would look like this:

Given a sequence of user events E and a defined set of factor functions F, a Factor Graph is automatically constructed for the user, namely per-user factor graph. Each factor connects its corresponding user events and user states.

As events are observed, the factor graph evolves over time:

Given a per-user Factor Graph, a possible sequence of user states S is automatically evaluated through a summation of the weighted factor functions F… The sequence of user states that has the highest score represents the most probable sequence corresponding to the event sequence observed until that point.

Comprised users are determined when the user state at the time of an observation is malicious.

Using our framework, security analysts can quickly examine user states to identify the transition of a user from benign to suspicious and malicious, without having to manually examine a large amount of raw logs. As a result, security analysts have more time to respond to security incidents or to increase additional monitoring of suspicious users to uncover potentially unauthorized activities.

Detecting attacks on the NCSA

AttackTagger was evaluated using data from 116 real-world security incidents observed at NCSA between 2008 and 2013. The incidents included tampering with system services to steal credentials, misusing systems to build botnets, send spam etc., and remote exploitation to gain system shells. Most attacks were multi-staged and spanned 24 to 72 hours. The constructed factor graphs were dense with many edges, since the factor functions link all of the events in the user event sequence. The weights for all factor functions were just set to 1, no training was performed to tune them. Despite this use of equal factor weights, the model still performed well:

Our model was able to detect most of the malicious users (74.2%) relatively early (i.e. before system misuse). More importantly, our model uncovered a set of six hidden malicious users, which had not been discovered by NCSA security analysts.

Here’s an example of how user states progressed during detection of an attack:

16 out of 62 malicious users were not detected by AttackTagger – mostly due to a lack of events being observed in the logs.  13 out of 1,253 users were wrongly recorded as malicious (false positives).  AttackTagger took 530ms on average to classify an event.

Purposes, concepts, misfits, and a redesign of Git

October 25, 2016

Purposes, concepts, misfits, and a redesign of GitDe Rosso & Jackson OOPSLA ’16

It’s OOPSLA in just a few weeks time, and ‘Purposes, concepts, misfits, and a redesign of Git’ will be presented there, bringing the story we looked at yesterday  up to date. The subject matter is the same of course – looking at the extent to which Git’s conceptual model causes confusion in its interface – but the thinking framework around concepts and concept modelling has matured, and the coverage of Git’s model is more comprehensive. There is more validation too – an analysis of 2400 Stack Overflow questions relating to Git (as of July 18th 2016), and a user study with existing Git users.

I have mixed feelings about that study. The tasks that the users were asked to do, and which are documented in Appendix A, are really quite interesting in the way they reveal some of Git’s corner cases (worth a read). But the total user study involved only 11 people, and they each interacted with the tools for one hour.  11 user tests lasting one hour each seems a very small sample indeed for a project that has been running for 3 years now. That’s enough to draw some qualitative conclusions on the version of the tool used at the time, but not enough to produce any quantitative results (‘How many test users in a usability study?‘. I think we can say at this point that the analysis of the difficulties with Git and how they tie back to concepts seems sound. Yet the question we really want answered – does a redesign based around a simpler, cleaner conceptual model produce a tool that is substantially easier to use –  remains open. We have an intuition that it does, and some early results to back that up,  but nothing of statistical significance.

The authors found a significant polarisation of reactions to their work, which group are you in?

In sharing our research with colleagues… we have discovered a significant polarization. Experts, who are deeply familiar with the product, have learned its many intricacies, developed complex, customized workflows, and regularly exploit its most elaborate features, are often defensive and resistant to the suggestion that the design has flaws. In contrast, less intensive users, who have given up on understanding the product, and rely on only a handful of memorised commands, are so frustrated by their experience that an analysis like ours seems to them belabouring the obvious.

A Conceptual design framework

Conceptual design is concerned with the selection and shaping of the essential concepts in a system. This is in contrast with representation design which concerns how to embody those concepts in the code.  The authors’ theory of conceptual design has evolved to include a study of motivating purposes and operational principles.

A concept is something you need to understand in order to use an application… and is invented to solve a particular problem which is called the motivating purpose.  A concept is defined by an operational principle, which is a scenario that illustrates how the concept fulfills its motivating purpose. The operational principle is thus a very partial description of the behavior associated with the concept, but focused on the particular aspect of behavior that motivated the introduction of the concept in the first place.

My interpretation: there are a number of goals/tasks the end user has, and the purpose of the system is to support them in achieving those goals. If a concept is necessary to fulfil a given purpose, then that purpose is its motivating purpose – the reason we need the concept. Each concept should be accompanied by a scenario that illustrates how that concept is necessary to fulfil its motivating purpose.

For example, Mac OS has the concept of a trash can. The motivating purpose of the trash can is enabling users to recover deleted files. The operational principle is that any time a file is deleted it is not permanently removed, but instead placed in a special trash folder from which it can be restored until the trash is emptied.

A concept may not be entirely fit for purpose. In that case, one or more operational misfits are used to explain why. The operational misfit usually does not contradict the operational principle, but presents a different scenario in which the prescribed behavior does not meet a desired goal.

In other words, operational misfits serve to highlight weaknesses in the current conceptual design. You often don’t find them until later in the process. “A good design process, however, should aim to identify misfits as early as possible.”

How do you find misfits early? The authors recommend analysing a design along five dimensions:

  • Concept Motivation – each concept should be motivated by at least one purpose. Prune concepts which only arise from implementation concerns.
  • Concept Coherence – each concept should be motivated by at most one purpose. Concepts motivated by multiple purposes struggle to fulfil any one purpose well.
  • Purpose Fulfilment – each purpose should motivate at least one concept. If not, a purpose representing a real need will not be fulfilled.
  • Non-division of purposes  – each purpose should motivate at most one concept. When the same purpose motivates different concepts in different contexts, confusion tends to arise.
  • Decoupling – concepts should not interfere with one another’s fulfilment of purpose.

Which seems to boil down to: there is a one-to-one mapping between concepts and purposes, and concepts should be orthogonal. I’d like to see a few worked examples to explore e.g. the rule that a purpose cannot motivate more than one concept.

Let’s take a look at these ideas in the context of Git.

Operational Misfits in Git

We looked at some of these yesterday. See §3 for more details.

  • Saving changes in the middle of a long task without creating an incomplete commit.
  • Switching branches when you have uncommitted changes
  • Inadvertently ending up with a ‘detached head’ when trying to go back to an old commit
  • Renaming files that also contain significant other changes
  • Creating and adding a new file, working on it, and then committing it – this will commit the original version you added, not the latest
  • Removing a previously tracked file from tracking
  • Creating an empty directory (requires the creation of a token file)

Concepts and Purposes in Git

Why do we have version control? The authors identify six top-level purposes:

  1. To make a set of changes persistent
  2. To group logically related changes
  3. To record coherent points in the development
  4. To synchronize changes among collaborators
  5. To support parallel development
  6. To work in disconnected mode

According to their own rules therefore, we should have exactly six concepts as well.

A good concept should have a compelling purpose: that purpose is what allows users to grasp the concept, and allows developers to focus on designing the concept to fulfill that purpose. In contrast, the conventional view is that the concepts of an application fulfill one or more purposes only in aggregate; there need be no rationale for the design of any single concept except that it plays some role in the larger whole. Our view is that this amorphous relationship between concepts and purposes is what has hindered the kind of design analysis we are attempting here, and that the approach of assigning purposes to concepts not only immediately highlights some discrepancies, but also provides a decomposition that makes deeper analysis possible. In particular, when concepts seem to have no direct mapping to a given purpose, their motivation is questionable, and one might wonder whether they are needed at all

What set of concepts does Git have?

  • Repositories
  • Commits
  • Working Directory
  • Staging Area
  • Stash
  • References
  • File classifications

Which we can map to purposes as follows:

(Click for larger view)

The misfits can be explained in terms of that mapping:

  • The concept of commit is connected to two purposes (P1 and P2), and this tangling causes the ‘saving changes’ misfit.
  • The working directory, staging area, and branch concepts are not decoupled, this causes the branching misfit.
  • The stash concept turns out not to be motivated by any purpose.
  • The staging area and file classification concepts are coupled, leading to problems creating, adding, and removing files.
  • The difficulty in removing a previously untracked file from tracking stems from a single purpose (prevent committing of untracked files – yes, that wasn’t in their list of purposes…) motivating two concepts: ignored files, and untracked files.
  • The file rename and empty directory misfits are violations of the fufillment requirement – they seem to point to a missing concept.

Gitless revisited

As we saw yesterday, Gitless has a smaller number of concepts. I really wish the authors had provided the new purpose -> concept mapping for Gitless, but as it is we’re going to have to try to recover that for ourselves. The purposes remain unchanged of course. The concepts in Gitless are:

  • Repository
  • Commit
  • Working Directory
  • Branch
  • File Classification

The stash and staging area are gone, there are a reduced number of file classifications, the concept of branch is redefined and  the notion of a ‘current branch’ is introduced.  See yesterday’s write-up and the Gitless website for more details.

Misfits ‘file tracking’ and ‘untracking a file’ are prevented by the elimination of the ‘assume unchanged’ classification and the staging area. The switching branches misfit is prevented by the redefinition of the branch concept and the elimination of stashes. The current branch notion and redefinition of head prevents the detached head misfit.

How do we complete this diagram though?

To be honest, I’m not sure. And yet this would seem to be crucial to the authors’ methodology.

What’s wrong with Git? A conceptual design analysis

October 24, 2016

What’s wrong with Git? A conceptual design analysis De Rossi & Jackson Onward! 2013

We finished up last week talking about the how to find good concepts / abstractions in a software design and what good modularization looks like. Today’s paper jumps 40+ years to look at some of those issues in a modern context and a tool that many readers of this blog will be very familiar with: Git. With many thanks to Glyn Normington for the recommendation.

The best software designers already know how important it is to discover or invent the right concepts, and that rough edges in these concepts and their relationships will lead to rough edges in the delivered product… Despite this there have been few attempts to study the design of concepts and their impact in software.

The authors chose to use Git to explore the role of concepts in design – it is widely known and used, and at the same time known to be confusing and difficult to learn. Are the usability problems of git just on the surface, in the expression of the commands, or do they run deeper?

We believe that the usability problems of Git run deeper, and can only be addressed by a more fundamental reworking that considers the underlying concepts. This is not to say that such a reworking cannot be achieved in a lightweight fashion. Indeed, our own reworking hides Git behind a veneer just as these other tool do. The difference however, is that we are willing to completely reconsider the core concepts that are presented to the user.

The results of the reworking are made available in a tool called gitless, which I’ve installed on my system to try out for a few days. (Note: if you use oh-my-zsh with the git plugin then this defines an alias for gl which you’ll need to unalias). As of this paper (2013), Gitless was only just beginning as a project, but it continues to this day and tomorrow we’ll look at the 2016 paper that brings the story up to date.

The kinds of concepts the authors are interested in are those which are essential to the design, to an understanding of the workings of the system, and hence will be apparent in the external interface of the system, as well as in the implementation.

In our view, concepts have psychological content; they correspond to how users think about the application. Conceptual design for us, therefore, is about the design of user-visible behavior, and not the design of internal software structure, and we use the term ‘conceptual model’ for a specification that focuses on concepts rather than the details of behavior.

Fred Brooks described conceptual integrity as “the most important consideration in system design” and gave three principles upon which it is based:

  • orthogonality – individual functions (concepts) should be independent of one another
  • propriety – a product should have only the functions (concepts) essential to its purpose and no more, and
  • generality a single function (concept) should be usable in many ways.

The authors add a fourth: consistency – requiring for example that actions behave in a similar way irrespective of the arguments they are presented with or the states in which they are invoked.

The paper then proceeds as follows: first the authors build a conceptual model for git as it is today, and then they show how the conceptual model leads to a number of difficulties experienced by git users. Finally they introduce gitless, which is based on a simpler conceptual model.

The concepts behind git

Ideally, a description of a conceptual design should be implementation-independent; it should be easy to understand; it should be precise enough to support objective analysis; and it should be lightweight, presenting little inertia to the exploration of different ponits in the design space.

The authors’ tool of choice for this is a kind of concept-relationship diagram. Here’s the model they built for Git:

It’s kind of complicated to explain – which is really the authors’ point. Though reading through their description did remind me of some of the subtleties in Git. As I imagine many people do, I’ve evolved my own basic way of using git and try not to stray too far from that. Much of the complexity in Git’s conceptual model comes from the variety of different states and roles that files can be in/play. Files may be tracked or untracked, marked as ‘assumed unchanged’, ignored, staged, stashed and more. Distinguishing between the working version of a file and the staged version of a file is a common cause of confusion. For example, if I follow the sequence:

    git add foo.rb
    // continue working in foo.rb
    git commit foo.rb

Then the version of foo.rb that gets commited is the version as of the time of the add command (because that copies foo.rb to staging, and commit works from the copies in staging).  Here are a few of the key git commands and how they relate to the concepts of staging and working copies:

  • git add _f_ causes the working version of a file f to be copied to staging, and if the file was untracked, it also becomes tracked.
  • git commit causes all versions of files that are staged to become committed
  • git reset restores a file to a previous state. E.g. git reset HEAD f for a file f removes its association with the staged version, makes it untracked if was previously untracked before being staged, and makes it tracked if it was previously an assumed unchanged file. (The authors don’t model the variations of reset that work on history).
  • git checkout _f_ reverts local changes to file f by replacing the working version with the staged version if there is one, or with the committed version otherwise.  The same command is also used to switch branches…
  • git stash takes the working version of each file (with modifications) and makes it a stashed version in a new stash….

The very complexity of the conceptual model we have described suggests to us that something is amiss in the design of Git. Can a version control system really require so many, and such intricate concepts? And before we have even considered synchronizing multiple repositories?

How git’s ‘surprises’ are rooted in violations of conceptual integrity

Here’s one example from each of Brook’s categories:


The staged and working versions of a file are coupled in complex ways, and Git commands that one might expect to affect only one often affect the other too. One way to get into trouble is to add a file, then continue working on it before committing. If you then decide to undo the commit with a reset command then depending on the arguments not just the staged version but also the working version wil be replaced (wiping out subsequent work since the commit).

Why this happens: the concepts of staged and working versions are not orthogonal. Many commands that are primarily intended to modify one of the two modify the other too. Worse, whether or not these ripple effects occur depends on which arguments are presented to the commands.


Committing files is non-trivial in git, in large part because of the intrusion of the concept of staging.

The git commit command takes versions from staging, not the working version. The git commit -a variation almost gets around this (it performs an implicit add prior to commit), but it won’t pick up untracked files and can only be used to commit all tracked files with modifications. In short, it’s too blunt a tool.

Some Git enthusiasts make arguments for the value of the staging concept, but for most users it offers inessential functionality, and its presence complicates the use of Git considerably.


Switching branches… and those painful conflicts which can occur when there are uncommitted changes.

Branches are intended to support independent lines of development. A line of development comprise both the working versions of files and committed versions. And yet, in Git’s conceptual model, only the committed versions are organized by branch; while there are potentially multiple committed versions of a file (one per branch), there can only be one working version. There is thus a lack of generality, with the branching feature essentially available only in one area and not the other.

Gitless – a better Git?

Gitless is the OSS project maintained by the authors that first cleans up the conceptual model, and the puts a user interface on top.

The key differences between Git and Gitless are: (a) the elimination of the concept of Assumed Unchanged File, and the generalization of Untracked File to subsume it; (b) the elimination of staged versions; (c) the elimination of the concept of Stash, and stashed versions; (d) indexing of working versions by branch. Switching branches in Gitless is equivalent to always creating a stash (more precisely, of executing git stash –all) before switching to a different branch in Git, and then retrieving this stash when the user switches back to the original branch. Thus, it is as if there are multiple working directories (one for each branch), or in other words, one can think of it as a file potentially having several working versions accessible via a branch name b (noted as working[b] in the diagram). This means that the user can freely switch from branch to branch without having to stash or commit unfinished changes. We believe this lives up to the expectation of a branch being an independent line of development.

If not all of the above makes sense to you, then Gitless is probably a good choice! The conceptual model of Gitless looks like this:

At the point this paper was released, only very early evaluation studies had been done of the Gitless model, but “the elimination of the staging area was enthusiastically received as a major reduction in complexity.”

It’s an interesting project since it relates to a tool many of us use on a daily basis. But beyond the specifics of Git, one of the takeaways for me is the reminder that good usability is deeply rooted in the conceptual design, and is far from just being about the presentation layer.

Most usability experts seem to recognize the importance of a product’s conceptual model, and the problems that arise when the user’s mental model, and the correct conceptual model of the product diverge.

The classic work on this last point is of course Donald Norman’s “The Design of Everyday Things.”

A design methodology for reliable software systems

October 21, 2016

A design methodology for reliable software systems Liskov 1972

We’ve come to the end of Liskov’s list. The final paper is by Barbara Liskov herself, on the question of how best to go about designing software systems so that we can have some confidence they will work.

The unfortunate fact is that the standard approach to building systems, involving extensive debugging, has not proved successful in producing reliable software, and there is no reason to suppose it ever will.

So we’re going to need some testing, and for high levels of confidence we’ll need good coverage via:

  • a complete but minimal set of test cases, and
  • a system in which the set of relevant test cases is small, such that it is possible to generate every case

It is the system design which determines how many test cases there are and how easily they can be identified, the problems can be solved most effectively during the design process.”

And with that short introduction, the rest of the paper focuses on the questions of ‘What is a good system design?’ and ‘What process will help to ensure we produce one?’

A good system design is one where complexity is tamed by dividing it into modules (called ‘partitions’ in the paper, because the term module had already become very overloaded). As we’ve looked at previously, just dividing a system into modules isn’t enough though – it matters very much how you make those divisions. In fact,

…the division of a system into modules may introduce additional complexity… if modularity is viewed only as an aid to management, then any ad hoc modularization of the system is acceptable. However, the success of modularity depends directly on how well the modules are chosen.

A good modularity is based on levels of abstraction, and uses structural programming within modules.

Level of abstraction were first defined by Dijktsra. They provide a conceptual framework for achieving a clear and logical design for the system. The entire system is conceived as a hierarchy of levels, the lowest levels being those closest to the machine.

There are two important rules given for levels of abstraction:

  1. Each level has resources which it owns exclusively and which other levels are not permitted to access.
  2. Lower levels are not aware of the existence of higher levels and therefore may not refer to them in any way.

With good modularity, the system is broken into a hierarchy of partitions (modules), with each partition representing one level of abstraction and consisting of one or more functions which share common resources. The connections between partitions are limited as follows:

  • Control connections are limited by the rules about the hierarchy of levels of abstraction
  • Connections in data passed between partitions are limited to the explicit arguments passed from the functions of one partition to the (external) functions of another partition. Implicit interaction on common data may only occur among functions within a partition.
  • The combined activity of the functions in a partition support its abstraction and nothing more.

The definition of connections in the above follows Parnas: “The connections between modules are the assumptions which the modules make about each other.”

We know what good modularity looks like when we see it now. But how do you arrive at good modularity in the first place?

The traditional technique for modularization is to analyze the execution-time flow of the system and organize the system structure around each major sequential task. This technique leads to a structure which has very simple connections in control, but the connections in data tend to be complex.

(See Parnas again).

Select modules to support abstractions or concepts which you find helpful in thinking about the system….

Abstraction is a very valuable aid to ordering complexity. Abstractions are introduced in order to make what the system is doing clearer and more understandable; an abstraction is a conceptual simplification because it expresses what is being done without specifying how it is done.

What kinds of abstractions should we be on the lookout for?

  • Abstractions of resources – modules that map the characteristics of an abstract resource into the real underlying resource or resources
  • Abstractions that hide data storage representations
  • Abstractions that limit information:

According to the third requirement for good modularizatio, the functions comprising a partition support only one abstraction and nothing more. Sometimes it is difficult to see that this restriction is being violated, or to recognize that the possibility for identification of another abstraction exists. One technique for simplification is to limit the amount of information which the functions in the partition need to know (or even have access to).

One way to limit information is to introduce modules at a lower level, on which the higher-level module depends, which hide that knowledge.

  • Abstractions that generalize a function or group of functions. “Separating such groups is a common technique in system implementation and is also useful for error avoidance, minimization of work, and standardization.”
  • Abstractions that encapsulate areas likely to change

The design process proceeds iteratively as follows. First determine an initial set of abstractions which represent the eventual system behaviour in a very general way. Then establish the data and flow of control connections among the partitions.

The second phase occurs concurrently with the first; as abstractions are proposed, their utility and practicality are immediately investigated… A partition has been adequately investigated when its connections with the rest of the system are known and when the designers are confident that they understand exactly what its effect on the system will be. Varying depths of analysis will be necessary to achieve this confidence.

When do you start programming modules? There is a tendency to think of this as the era of the strict waterfall, but that’s not what Liskov proposes:

It is not clear exactly how early structured programming of the system should begin… The best rule is probably to keep trying to write structured programs; failure will indicate that the system abstractions are not yet sufficiently understood and perhaps this exercise will shed some light on where more effort is needed or where other abstractions are required.

Finally, a design can be considered ‘finished’ when the following criteria are met:

  • All major abstractions have been identified and partitions defined for them; the system resources have been distributed among the partitions and their positions in the hierararchy established.
  • The system exists as a structured program… this consists of several components, but no component is likely to be completely defined. Rather each component is likely to use the names of lower-level components which are not yet defined.
  • Sufficient information is available so that a skeleton of a user’s guide to the system could be written. (This was an era of much simpler user interfaces remember).

Programming with Abstract Data Types

October 20, 2016

Programming with Abstract Data Types Liskov & Zilles, ACM SIGPLAN Notices, 1974

This is paper 6/7 from Liskov’s list, and it’s by Barbara Liskov herself. What an enormous impact this paper has had on how we program today. Has there been any single advance in programming languages since this work (42 years ago) which has made such a big impact on mainstream programming? I can’t think of one.

It’s amazing how much is in here, bar inheritance you’ve got pretty much all of the core features of a modern object-oriented language, including strong type checking and support for generic types.

Structured programming attempts to impose a discipline on the programming task so that the resulting programs are ‘well-structured.’ In this discipline, a problem is solved by means of a process of successive decomposition…

You can start out with high-level procedures, and gradually fill-in the details at lower levels of abstraction to complete a program. (Program development by stepwise refinement, Hierarchical program structures). As Liskov notes, when working at a certain level you are concerned with the way a program makes use abstractions provided at lower levels, but not with the details of how they are realized. Higher-level programming languages could provide increasing numbers of abstractions through their built-in types, but could never hope to provide all of the useful abstractions for building programs.

A structured language… contains no preconceived notions about the particular set of useful abstractions, but instead must provide a mechanism whereby the language can be extended to contain the abstractions which the user requires. _A language containing such a mechanism can be viewed as a general-purpose indefinitely-high-level language.

What do we want from an abstraction? A mechanism which allows us to express all relevant details for use of the abstraction, but hides all irrelevant details – the way in which the abstraction is implemented.

An abstract data type defines a class of abstract objects which is completely characterized by the operations available on those objects. This means that an abstract data type can be defined by defining the characterising operations for that type….

When a programmer uses a built-in data type, they make use of a concept or abstraction realized at a lower level of detail – the programming language itself (and its compiler). Likewise, abstract data types are used at one level, and realized at another level – but the lower level is not part of the language itself…

… an abstract data type is realized by writing a special kind of program, called an operation cluster, or cluster for short, which defines the type in terms of the operations which can be performed on it.

I guess operation-cluster-oriented programming was always going to be a bit of a mouthful!

Once you have abstract data types, “most of the abstract operations in a program wil belong to the sets of operations characterising abstract types.” For the remaining operations not associated with an ADT, Liskov and Zilles use the term “functional abstraction.” By programming in terms of ADTs and the operations they expose it is possible to defer the implementation representation decision until a later point. And thus it is possible to program according to one of Dijkstra’s programming principles: build the program one decision at a time.

The new-fangled support for ADTs (and abstract objects) is explained by way of an example – writing a program to convert infix expressions to polish notation. Given modern familiarity with the concepts, I’m going to cut straight to a description of the impressive set of features the language had.

  • The same syntax is used to declare variables of abstract types as to declare variables of built-in types (a very powerful principle)
  • A variable declaration must simply declare the type of the variable, or it may also specify that a new object is to be created at the point of declaration. Object creation is indicated by parentheses following the type name, and any object construction arguments inside the parentheses (a constructor invocation as we would now call it):
    s: stack        // s is of type stack
    s: stack(token) // s is of type stack, 
                    // and initialized to a new stack object
  • The language is strongly typed, you can use an abstract object in only one of three ways: by calling the operations defined for it, by passing it as a parameter to a procedure, and by assigning it to a variable of the corresponding type.
  • You invoke operations on an abstract object using a type_name$operation(args) syntax. The first parameter to such an invocation is always the object to be operated on (the object.operation sugar did not exist at this time). For example: stack$push(s, t)

There are several reasons why the type name is included in the operation call. First, since an operation call may have several parameters of different abstract types, the absence of the type-name may lead to an ambiguity as to which object is actually being operated on. Second, the use of the compound name permits different data types to use the same names for operations without any clash of identifiers existing.

  • Objects may also be created outside of variable declarations simply by following the type name by parentheses: s = stack(t)
  • Operation clusters (classes) are defined in a module supporting independent compilation. A cluster definition contains three parts: the object representation (its internal state), the code used to create objects (constructor), and the operations themselves. Here’s the stack example:
    stack: cluster(element_type: type) 
      is push, pop, top, erasetop, empty:

      rep(type_param :type) = (
        tp : integer;
        e_type: type;
        stk: array[1..] of type_param;

        s : rep(element_type); := 0
        s.e_type := element_type;
       return s;

      push : operation(s : rep, v: s.e_type); := + 1
        s.stk[] := v
      pop : operation(s : rep) returns s.e_type;
  • Type parameters are supported (as seen in the above example)

The rep description defines a new type, denoted by the reserved word rep, which is accessible only within the cluster and describes how objects are viewed there… The optional (“{}”) rep-parameters make it possible to delay specifying some aspects of the type definition until an instance of rep is created.

  • Type checking is extended to incorporate type parameters
  • The language completely hides all implementation details of the abstraction from its user.
  • Program efficiency need not be sacrificed:

We believe it is helpful to associate two structures with a program: its logical structure and its physical structure. The primary business of a programmer is to build a program with a good logical structure — one which is understandable and leads to ease in modification and maintenance. However, a good logical structure does not necessary imply a good physical structure — one which is efficient to execute… We believe it is the business of the compiler to map good logical structure into good physical structure. The fact that the two structures may diverge is acceptable provided that the compiler is verified, and that all programming tools (for example, the debugging aids) are defined to hide the divergence.

And finally, through experimentation with the language, Liskov and Zilles discovered something important about the ability to create new abstractions directly in the programming language:

Of course, program quality is most dependent on good program design. Although a language can never teach a programmer what constitutes a well-designed program, it can guide him or her into thinking about the right things. We believe that abstraction is the key to good design, and we have discovered in our experiments in using the language that it encourages the programmer to consciously search for abstractions, especially data abstractions, and to think vey hard about their use and definition.

Protection in programming languages

October 19, 2016

Protection in programming languages Morris Jr., CACM 1973

This is paper 5/7 on Liskov’s list.

Experienced programmers will attest that hostility is not a necessary precondition for catastrophic interference between programs.

So what can we do to ensure that program modules are appropriately protected and isolated? We still need to allow modules to cooperate and communicate, but ideally we’d like to be able to prove that a given module has various properties without considering the rest of the system. Morris explores a set of mechanisms that provide different kinds of guarantees:

  • Procedures as objects (or as we would say today, “functions as first class citizens”).
  • Local objects (scoping)
  • Memory protection
  • Type-based protections
  • Seals and Trademarks
  • Access keys

Procedures as objects

A procedure already hides details of its implementation from a caller, so why is it necessary that we can also pass procedures as arguments?

…in order to exploit this device to its fullest extent it is useful to make procedures full-fledged objects in the sense that they can be passed as parameters or values of other procedures and assigned to variables. Then, to make a particular object accessible to someone in a restricted way (e.g. read only) we make the object local to a procedure and pass the procedure to them in its stead.

I buy the argument when the procedure offers a limited view on an encapsulated object, but in general if the procedure is just wrapping an object it must give out either a value-object or a reference-object. If it gives a reference object, then we have no more protection than if we passed this directly. If it gives a value-object, then the procedure seems to offer no value over just passing a value-object directly. See ‘Why functional programming matters‘ for some more powerful (imho!) arguments on the value of functions as first-class citizens.

Local objects

It’s desirable to know that an object is local to a particular part of a program – that is, it cannot be accessed by other parts. “Object” here is used to mean for example the address of a variable, not an object as in the OOP sense. To validate such locality we need to check:

  • that it is not passed as a parameter to a procedure not defined within the module
  • that it is not returned as a result of a procedure called from outside the module
  • that it is not assigned to a reference which is not local to the module

Pony and its deny capabilities is a modern theme along these lines.

Memory protection

We’d like to encapsulate a memory reference (e.g. use of a variable) within a module such that only statements inside the module can assign to it or take its contents. Which is very similar to the ‘local objects’ requirement, but I guess implied is the added protection that the memory itself is inaccessible from outside of the module as well.

Given such protection, we can reason about the state of the module using pre-conditions and post-conditions on all of the operations exposed by it.


Now things start to get a bit more interesting…

One of the fundamental ways in which one programmer provides services to another is to implement a representation of a new kind of object…Generally speaking the first programmer writes some subroutines which create, alter, and otherwise deal with the new kinds of objects. These objects are represented by some older kinds.

In 1973 the representations were typically just primitive types with some conventions, for example an “interval” may simply be represented as a pair of integers. Of course this still happens today, especially when programming in a functional style. We hope the recipient of such representations treats them as the module designer intended, but at the same time we open ourselves up to three kinds of ‘mishap’:

  1. Alteration – the state could be changed without the use of the primitive functions provided for the purpose
  2. Discovery – the properties of the data might be explored without using the primitive functions
  3. Impersonation data created outside of the module might be presented to a primitive function expecting it to have certain properties

There are two parties involved in such mishaps: the programmer of the primitive procedures and the user (misuser, actually) of them.

Through the lens of this paper, discovery presents no direct threat, but alteration and impersonation can both expose bugs by invalidating assumed pre-conditions.

Morris’ solution is to allow data to be tagged with a type-tag, “familiar to any implementor of a language with dynamic data types… but being made available to the general user.” Passing data not tagged with the appropriate type-tag should result in a program violation.

These tags have a couple of interesting properties vs static typing:

First, they are ‘cascadable’ : a tagged item may be retained. This feature reflects our belief that representation can be a multilevel affair. Second, they are completely dynamic in the sense that a running program could generate an arbitrarily large number of different type tags. The practical virtues of this generality are not entirely clear (!).

Nowadays we’d probably use domain types.

Seals and Trademarks

A seal provides a way of wrapping an object such that it cannot be seen or tampered with until it is unsealed. A sealed object may safely be passed as a parameter, but its contents cannot be inspected by anyone not in possession of the unseal key. (Morris just uses an integer, ‘which changes every time create_seal is called.’) The seal mechanism provides for authentication (you know its an object you created, if you can unseal it), and access control. Local objects that are sealed remain local even if the seal escapes the module.

Instead of an integer, today I immediately think of digests (it hasn’t been changed), signatures (I created it), and encryption keys (or key-pairs) (it can’t be seen). Though these are more heavyweight constructs than Morris intends I believe.

Trademarks are a more general form of seals that wrap objects in something that testifies to their authenticity (again, implemented using ever-changing serial-numbers in Morris’ world). This is an early nod to the value of signatures:

If the reader doubts the utility of this device, stripped of any access limiting power, we urge him or her to consider how trademarks and other authenticating marks are used in the commercial/bureaucratic sphere. A mark from the Underwriters’ Laboratories attests that a particular device may be used without risk, a fact that a user might discover only by extensive experience. A letter saying “You have been drafted,” is much more likely to be believed if it bears the mark of a draft board….

Morris’ seals and trademarks are lighter-weight constructs than the cryptographic operations I’ve associated them with though. In usage, he seems to envision them as something very similar to type-tags:

We can invent trademarks to attest to these various properties and attach them to objects in any order we like. For example, a complex number represented in the cartesian system might bear four marks: pair, point-in-plane, cartesian, complex. Programs may examine these marks or not as they please: e.g. A routine that prints pairs is unconcerned with all but the pair tag.

Access keys

Now let us example the implications of making the seal operation public and leaving unseal private…. if one puts such a seal on an object and leaves it in a public place, he/she is guaranteed that only the restricted group of programs holding unseal can access it…

Sounds very much like a public/private key pair!