Learning to protect communications with adversarial neural cryptography Abadi & Anderson, arXiv 2016

This paper manages to be both tremendous fun and quite thought-provoking at the same time. If I tell you that the central cast contains Alice, Bob, and Eve, you can probably already guess that we’re going to be talking about cryptography (that or reading the paper title 😉 ). But this isn’t cryptography as you know it, and nor is it cryptography intended to actually be used to protect any information – to criticise the paper on that front would be to miss the point in my view. Instead what we get is a really interesting twist on adversarial network training and a further demonstration of the kinds of things that such networks are able to learn.

Here’s the basic setup: there are three neural networks, let’s call them Alice, Bob, and Eve. Alice is given an n-bit message (encoded as a vector of -1.0 or 1.0 values to represent 0 and 1 respectively), and a similarly encoded n-bit key. The job of the Alice network is to produce an n-bit output, which we will call the ciphertext.

Alice’s network has a fully connected layer that is able to mix the message and key together, followed by four 1-dimensional convolutional layers which together reduce the inputs down to an n-bit ciphertext output.

The FC layer is followed by a sequence of convolutional layers, the last of which produces an output of a size suitable for a plaintext or ciphertext. These convolutional layers learn to apply some function to groups of the bits mixed by the previous layer, without an a priori specification of what that function should be. Notably, the opposite order (convolutional followed by FC) is much more common in image-processing applications. Neural
networks developed for those applications frequently use convolutions to take advantage of spatial locality. For neural cryptography, we specifically wanted locality—i.e., which bits to combine—to be a learned property, instead of a pre-specified one.

The Bob network receives as input the ciphertext, and same key as was given to Alice. The job of the Bob network is to recover the original message. (Alice and Bob don’t really know that their job is to encode and decode the message, but we can connect the two networks together and use a loss function incorporating the reconstruction error. The structure of Bob’s network is identical to Alice’s:

If this is all we did, Alice and Bob would quickly learn to ignore the key inputs and just pass the original message through. But now we get to introduce the adversarial member of the cast: Eve. Eve the eavesdropper is also a neural network, but Eve only gets the ciphertext as input, and not the key.

Eve’s loss function is setup as the L1 distance from the original message – in other words, Eve is being trained to reconstruct the message from the ciphertext.

The loss function for Alice and Bob is updated so that it also takes into account Eve’s reconstruction score (the better the job Eve does in recovering the message, the larger Alice and Bob’s loss).

The whole thing put together looks like this:

Training begins with the networks initialised randomly.

We do not compute the “optimal Eve” for a given value of θA, but simply approximate it, alternating the training of Eve with that of Alice and Bob. Intuitively, the training may for example proceed roughly as follows. Alice may initially produce ciphertexts that neither Bob nor Eve understand at all. By training for a few steps, Alice and Bob may discover a way to communicate that allows Bob to decrypt Alice’s ciphertexts at least partly, but which is not understood by (the present version of) Eve… After a bit of training, however, Eve may start to break this code. With some more training, Alice and Bob may discover refinements, in particular codes that exploit the key material better. Eve eventually finds it impossible to adjust to those codes.

Once training of Alice and Bob has stopped, many instances of Eve are further trained to try and break the cryptosystem.

Now isn’t this a beautiful chart:

The red line traces Bob’s reconstruction error, and the green line is Eve’s. You can see that somewhere around 8000 training steps both Bob and Eve start to be able to reconstruct the original message (though Bob does better than Eve). Somewhere around 10,000 training steps the Alice and Bob networks seem to figure this out and Eve’s error rate climbs again. By 15,000 training steps the training goals are effectively achieved.

This training graph does not look like a typical result for neural networks, where monotonicity in the number of steps is generally expected. Instead, the dynamics of this adversarial training appear somewhat more reminiscent of evolutionary processes. These dynamics appear to depend on somewhat random-seeming change to cause the bits to mix slightly, but once there is some mixing, the gradient descent can rapidly drive it farther.

Manual inspection of one of the trained Alice networks showed that the effect of changing a key bit or a plaintext bit was spread across multiple elements in the ciphertext (and not constrained to a single bit as it would be with an XOR). Typically a one-bit change significantly impacted 3-6 elements in the output, with smaller changes in other elements.

A refinement: selective protection

We consider the question of whether neural networks can learn what information to protect, given confidentiality objectives described in terms of an adversary.

A dataset is artificially constructed with four-tuples (A,B,C,D). It is loosely inspired by real-world tasks such as having attributes of a user A,B,C, and using them to make a prediction of D. The objective is to make this prediction as accurate as possible, without revealing any information that would allow C to be recovered (i.e., C is the confidential information we want to protect).

The adversary Eve is trained to minimise loss when recovering C from the ciphertext and public prediction of D that are the outputs of Alice. Bob has the ciphertext, public prediction of D, and of course the key shared with Alice, it is Bob’s job to recover the true value of D as closely as possible. The loss function for Alice and Bob is based on how close to the true D the outputs of both Alice and Bob are, as well has how close to the true C Eve is able to get.

Update: in the diagram below the ‘Alice’ in red on the right-hand side should say ‘Eve’. Thanks to Ron for the catch.

… our goal is to understand whether a neural network can learn to hide information selectively, given only a high-level specification (“do not leak information about C, but let people with the right key learn as much as possible about D”).

Here are the results:

The red line is the privacy-preserving prediction of D (D-public), and the green line is Bob’s attempt to recover the true value of D. The public error becomes low, but not as low as Bob’s. The blue dots show Eve’s improvement in recovering C vs blind guessing. After sufficient training, Eve can gain no more information about C than she would have by simply knowing the distribution of values of C (though that of course is still some potentially meaningful information).

Neural networks are generally not meant to be great at cryptography… Nevertheless, as we demonstrate, neural networks can learn to protect the confidentiality of their data from other neural networks: they discover forms of encryption and decryption, without being taught specific algorithms for these purposes. Knowing how to encrypt is seldom enough for security and privacy. Interestingly, neural networks can also learn what to encrypt in order to achieve a desired secrecy property while maximizing utility.

The kind of network setup shown here is a general pattern for learning goals of the form $A \wedge \neg B$ in which we want to maximise performance in task A without permitting task B to be accomplished.

In researching this work, I also found a very nice write-up with an implementation in Theano, and another implementation in TensorFlow, which makes for quite a nice comparison of the two.

Value Iteration Networks Tamar et al., NIPS 2016

‘Value Iteration Networks’ won a best paper award at NIPS 2016. It tackles two of the hot issues in reinforcement learning at the moment: incorporating longer range planning into the learned strategies, and improving transfer learning from one problem to another. It’s two for the price of one, as both of these challenges are addressed by an architecture that learns to plan.

In the grid-world domain shown below, a standard reinforcement learning network, trained on several instances of the world, may still have trouble generalizing to a new unseen domain (right-hand image).

(This setup is very similar to the maze replanning challenge in ‘Strategic attentive writer for learning macro actions‘ from the Google DeepMind team that we looked at earlier this year. Both papers were published at the same time).

… as we show in our experiments, while standard CNN-based networks can be easily trained to solve a set of such maps, they do not generalize well to new tasks outside this set, because they do not understand the goal-directed nature of the behavior. This observation suggests that the computation learned by reactive policies is different from planning, which is required to solve a new task.

Planning is not a new problem – the value iteration algorithm based on Markov decision processes (MDP) has been known since 1957! What Tamar et al. do in this work though, is embed a value iteration (VI) planning component inside the overall neural network architecture. And the breakthrough insight is that the VI algorithm itself can be encoded by a specific type of CNN, which means it is differentiable.

By embedding such a VI network module inside a standard feed-forward classification network, we obtain an NN model that can learn the parameters of a planning computation that yields useful predictions. The VI block is differentiable, and the whole network can be trained using standard backpropagation.

It really is pretty cool – you give the network the machinery that can be used for planning, and it figures out all by itself the best way to use it.

Using the approach, Tamar et al. show that value iteration networks (VINS) generalize better to new grid-world scenarios than either CNNs following the DQN architecture, or fully convolutional networks (FCNs):

(Note there is no comparison to the contemporary STRAW architecture from the DeepMind team that also extends DQNs with planning).

Importantly, note that the prediction loss for the reactive policies is comparable to the VINs, although their success rate is significantly worse. This shows that this is not a standard case of overfitting/underfitting of the reactive policies. Rather, VIN policies, by their VI structure, focus prediction errors on less important parts of the trajectory, while reactive policies do not make this distinction, and learn the easily predictable parts of the trajectory yet fail on the complete task.

They also demonstrated planning success using Mars landscape images for Mars Rover navigation, planning in a physical simulation setting, and planning in the WebNav setting which requires navigating links of a web site towards a goal page.

What I’d love to see is how well the VIN architecture performs on the Frostbite Challenge.

Let’s take a closer look at how it all works, starting with the value iteration algorithm itself, then how to encode that in a NN, before finally putting it all together in a complete architecture.

Standard value iteration

“A standard model for sequential decision making and planning is the Markov Decision Process (MDP).”

You have a set of states $s \in S$, a set of actions $a \in A$, a reward function $R(s,a)$ that gives the anticipated reward for taking action $a$ in state $s$, and a transition kernel, $P(s'|s,a)$ that encodes the probability of the next state given the current state and action. A policy $\pi(a|s)$ prescribes the action distribution for each state.

(Note the similarity between this structure and the action matrix of STRAW).

The goal in an MDP is to find a policy that obtains high rewards in the long term.

You can consider the value of a state under some policy as the expected discounted sum of rewards when starting from that state and following the policy. A optimal policy will find the maximal long-term return possible from a given state. Value iteration computes the rewards by iterating over the action steps ($\gamma \in (0,1)$ is a discount factor):

Encoding value iteration in a neural network

Our starting point is the VI algorithm (1). Our main observation is that each iteration of VI may be seen as passing the previous value function Vn and reward function R through a convolution layer and max-pooling layer. In this analogy, each channel in the convolution layer corresponds to the Q-function for a specific action, and convolution kernel weights correspond to the discounted transition probabilities. Thus by recurrently applying a convolution layer K times, K iterations of VI are effectively performed.

This idea leads to the following network structure:

A reward ‘image’ $\bar{R}$ (to follow the more normal CNN formulation of working with images) is fed into convolutional layer $\bar{Q}$ with $\bar{A}$ channels. Each channel corresponds to $\bar{Q}(\bar{s},\bar{a})$ for action $\bar{a}$. The layer is max-pooled along the actions channel to produce the next-iteration value function layer. This is stacked with the reward $\bar{R}$ and fed back in K times, to perform K iterations of value iteration.

The full Value Iteration Network model

The value-iteration module we just described can now be embedded into a full value iteration network as follows:

In many systems, if you’re in a given state, and you take a given action, the set of possible states you end up in is much smaller than the overall universe of states. More precisely, the the states for which $\bar{P}(\bar{s'}|\bar{s},\bar{a}) > 0$ is a small subset of $\bar{S}$.

In NN terminology, this is a form of attention, in the sense that for a given label prediction (action), only a subset of the input features (value function) is relevant. Attention is known to improve learning performance by reducing the effective number of network parameters during learning.

This is the purpose of the attention module added into the feedback loop in the diagram above. With the inclusion of the CNN-based value iteration module, everything in the value iteration network is differentiable:

This allows us to treat the planning module as just another NN, and by back-propagating through it, we can train the whole policy end-to-end.

To implement a VIN, you need to specify the state and action spaces for the planning module ($\bar{S}$ and $\bar{A}$), the reward functions $f_R$ and $f_P$, and the attention function. The authors call this the process of VIN design.

Once a VIN design is chose, implementing the VIN is straightforward, as it is simply a form of CNN. The networks in our experiments all required only several lines of Theano code.

Does the online card payment landscape unwittingly facilitate fraud? Ali et al., IEEE Security & Privacy 2017

The headlines from this report caused a stir on the internet when the story broke in December of last year: there’s an easy way to obtain all of the details from your Visa card needed to make online purchases in seconds (4 seconds to be precise). Using the discovered card details to make an international money transfer took just 27 minutes from creating the transfer account to cash in hand (in this case in India, from funds initiating in the UK). That’s fast enough that there’s very little time for a bank to detect fraud and reverse the payment.

Digging a little deeper though, there are also some interesting lessons to be learned about unintended emergent behaviours in complex systems, misaligned incentives, and the state of card payment security in general.

How the attack works

There are two principles at play here: leaking one piece of information can often quickly be escalated into leaking everything, and if you allow enough guesses (or queries), most systems will fall.

An online credit card payment uses up to four fields for validation: the credit card number (the PAN, Primary Account Number), the expiry date, the CVV2 verification number of the back of the card, and the card holder’s address (strictly, just the digits from postcode and house number). At a bare minimum, the PAN and expiry date are required. Note that expiry dates and CVV2 numbers come from relatively small domains: if we assume cards are issued with a lifetime no more than 5 years, that’s 60 different month/year expiry dates; a three-digit CVV2 number has 1000 different possible values.

Assume you have a credit card number (many ways to get hold of one!). We can use the fact that some merchants only require PAN + expiry date to first learn the expiry date in at most 60 attempts. Then we can use merchants that require PAN + expiry data + CVV2 to learn the CVV2 number in at most 1000 attempts. With both of those in hand, and a little investigation surrounding location where the card number was obtained / used, we can crack the address digits (if indeed we need to). Ignoring address for the moment, note that expiry date + CVV2 gives us 60,000 combinations to guess, but by cracking them in stages we reduce it to only 1,060.

All we need now is a way to make lots of guesses. The good news is, there are lots of merchant sites! Start with a list of websites that only require PAN + expiry date (60 such websites would be a handy number). Submit a trial transaction to each of them (you might as well fire all these off in parallel to save time). The one that succeeds tells you the expiry date. Now go to your list of websites that require PAN + expiry date + CVV2. Submit a bunch of requests in parallel again to find out the verification code.

We implemented a set of software tools to carry out the distributed guessing attack, using the research team’s own cards to verify that it is indeed possible and practical to obtain all the information of the card. Included are seven Visa cards with a spread of PAN, expiry date, and CVV2 values. We selected 400 Alexa top rated commercial websites for our investigation.

You only need to obtain PAN + expiry date to use the card on some websites, but more detail is better. On the dark web, you can purchase lists of credit card details. Credit cards numbers on their own are one price, those with accompanying expiry dates a little more, those with CVV2 values more again, and so on.

The experiments run by the authors showed that it is possible to run multiple bots at the same time on hundreds of payment sites without triggering any alarms in the payment system. With a bot configured to use just 30 sites, it took just 4 seconds to obtain all the information for a card.

There is clearly huge potential for abuse here – at scale botnets could be used to e.g. purchase plain credit card numbers on the darknet, enrich them with additional information, and sell them back pocketing the profit. Alternatively, by using them for international money transfers, cash can be in hand in less than half-an-hour. Or using NFC-skimming in busy transport and retail locations your card can be skimmed, details recovered, and cash extracted just by the attacker standing close by.

Each individual merchant could lock-down the number of guesses allowed (it turns out many don’t!), but that doesn’t help as it’s the ability to spread the guesses across multiple sites that does the damage.

State of the practice

From 389 of the top 400 Alexa websites, 26 sites use only two fields for card payment (can be used for cracking expiry date), 291 use three fields (for subsequent cracking of CVV2), and 25 sites use four fields. Most commonly these sites allow 6-10 guesses per transaction before locking the user out, but a significant number (33) allow up to 50 guesses, and 6 allow unlimited guesses. Among these, one of the top-ten most visited websites using only PAN + expiry date, and another of the top-ten sites allowed unlimited attempts to guess the CVV2!

47 of the 389 sites had implemented the 3D Secure payment system. Under the 3D Secure scheme the issuing bank has visibility of all transactions for a card, even when distributed across many websites. This enables the distributed guessing attack to be detected and prevented.

The authors disclosed their results to the top 36 affected (in terms of website traffic) websites. The story of their responses also makes sad reading – 28 of them made no change at all. 6 added delay or velocity filters to make automated repeated guessing harder, and 2 added a requirement for address information.

Perhaps surprisingly, none of the sites reacted by simply putting a hard limit on the number of allowed attempts.

Why don’t merchants tighten security?

There are two problems here: one is that it is not obviously in their interest to tighten security, and the other is that the actions of a single merchant can’t solve the issue.

Here’s one thing a merchant could do: refuse to take Visa cards and only accept MasterCard payments (the MasterCard network has centralised processing that defeats the attack). I pick that as an extreme example to bring out the trade-offs: clearly a merchant that refuses to accept Visa cards is going to lose a lot of business!

Implementing tighter security such as 3D Secure has similar consequences though not as severe: in one study in the USA up to 43% of consumers abandon transactions when the 3D Secure screen is presented, and in China up to 55%. If your systems are being used to help crack card details, but no fraudulent purchases are made on your own site, implementing 3D Secure and taking such a high abandon rate looks like a tough business decision that has high costs to you without any obvious benefit. Even if you are falling victim to fraudsters some percentage of the time, is the loss through fraud greater than the loss through increased cart abandon rates?

Suppose a responsible merchant does add 3D Secure (or other mechanisms that similarly must have the effect of making checkout more cumbersome, and hence increasing abandon rate), this still doesn’t prevent the guessing attacks unless every other merchant does it too. So again the business decision is to hurt your own bottom line, presumably drive traffic to more lax competitors, and still be vulnerable to fraud using the stolen the card once the full details have been obtained elsewhere anyway.

We can drop down a level and try to address the problem in payment gateways. Payment gateways for example can use IP address velocity filters to detect repeated invalid attempts made within a certain time span from the same IP address. (Such a filter would not protect against a distributed botnet trying to crack cards). The problem still remains that although there are less payment gateways than merchants, there are still enough of them to make circumventing velocity filters relatively easy.

How could the problem be fixed?

To prevent the attack, either standardisation or centralisation can be pursued (some card payment networks already provide this). Standardisation would imply that all merchants need to offer the same payment interface, that is, the same number of fields. Then the attack does not scale anymore. Centralisation can be achieved by payment gateways or card payment networks possessing a full view over all payment attempts associated with its network. Neither standardisation nor centralisation naturally fit the flexibility and freedom of choice one associates with the Internet or successful commercial activity, but they will provide the required protection. It is up to the various stakeholders to determine the case for and timing of such solutions

Finding security bugs in web applications using a catalog of access control patterns Near & Jackson, ICSE 2016

If you had a formal specification of the desired security attributes of your web application, and could map that to the source code, you’d be able to verify that it did indeed satisfy the specification. But let’s face it, not many developers of web apps have the time and inclination to go and build a formal spec.. In today’s paper from ICSE 2016 Near and Jackson explore a really interesting compromise. They do all the hard work of building formal models of seven common security patterns, and all the app developer has to do is provide a lightweight mapping from their application code to those patterns. The system has been implemented for Ruby on Rails apps, and while it won’t be able to capture every kind of security bug, it looks like a pretty useful return on investment from the developer’s perspective: of the top 50 most watched Rails apps on GitHub, 30 include some kind of access control; of those 30, eight of the them had security bugs (23 in total) found by the tool (called SPACE – Security PAttern CheckEr), that’s over 25%! One caveat here is that those top 50 apps seem to include a lot of starter projects and samples, and less real-world apps. You can find SPACE at http://www.cs.berkeley.edu/~jnear/space.

From my perspective, there are two interesting angles in this paper. One is simply the security pattern catalog, which though high level is broadly applicable to web apps in general. Then there’s the specific tool chain that automates the process for Rails apps.

A Security Pattern Catalog

Each pattern in our catalog models a common access control use case in web applications. Our catalog is based on the assumption that developers usually intend for a particular pattern to be applied uniformly to all uses of a given resource type – an assumption supported by our experience with real-world applications.

The overall approach is based on whitelisting, so the patterns say what is allowed, and everything else is denied.

1. Ownership. This pattern models resources created by a user that ‘belong’ to that user. The owning user can read and write objects they own.
2. Public objects are objects that can be read by anyone (e.g. blog posts in a blogging app).
3. Authentication models a subset of users that are currently logged in, and provides read access to authenticated objects only for logged in users.
4. Explicit permission – this pattern is used to model situations where the application explicitly represents permissions for a resource. Permit relations model giving permission to a user to perform a specific operation on a specific object.
5. User profiles. A special case of ownership. Programmers frequently forget checks requiring that the user updating a user profile must be the owner of that profile. Now they won’t.
6. Administrators. A special class of users that have full permissions on all objects.
7. Explicit Roles. This pattern captures the specification of roles that can be assigned to users, and that allow or deny permission to perform operations.

Checking Rails apps using the catalog

The user needs to help define the mapping between their application resources and the security patterns, which are based on a Role-based access control (RBAC) model under the covers. Resource types themselves, and session management can be inferred automatically.

The map relation can be used to define public and permission objects, but we must define a separate mapping from field names of resources to the corresponding relations they represent in our security patterns.

Consider the MediumClone Rails app (which is a clone of Medium). The populated mapping looks like this, where FieldNames specifies application defined field names and mapfields maps these to pattern roles.:

(The user actually specifies the mapping in Ruby code, see an example in the next section).

SPACE extracts the data exposures from an application using symbolic execution, specializes the constraints on those exposures to the types of role-based access control using the mapping provided by the user, and exports the specialized constraints to an Alloy specification. The, SPACE uses the Alloy Analyzer – an automatic bounded verifier for the Alloy language – to compare the specialized constraints to our pattern catalog (which is also specified in Alloy).

Rather than building a standalone symbolic executor, the authors coerce Ruby’s standard interpreter into performing symbolic execution. This involves defining Ruby classes for symbolic objects and expressions and wiring them in using good old method_missing. SPACE also provides an implementation of the ActiveRecord API that ignores the real database and instead returns symbolic values. Finally, it transparently wraps the Rails rendering API to record the symbolic objects referenced when evaluating view templates.

A MediumClone example

Here’s the code for the user controller taking from the MediumClone app:

Notice that the correct_user filter is not applied to the update operation, so any logged in user can update any other user’s profile (violating pattern 5).

In our experience, this kind of mistake is common: the developer assumes the user will use the interface provided by the site (in this case, the edit page), and will not craft malicious requests to constructed URLs. Developers often fail to consider alternative pats not accessible through the interface, and therefore omit vital security checks.

The User type in the MediumClone app represents the User type in the RBAC pattern catalog. Posts are owned RBAC objects. The mapping the user needs to provide is specified simply as:

Space.analyze do
mapping User: RABCUser,
Post: OwnedObject(user: owns)
end


SPACE requires the mapping above and MediumClone’s source code – it needs no further input or guidance from the user.

SPACE will find the missing check and produce a counterexample to demonstrate the security vulnerability – see (a) below:

SPACE also finds a second bug whereby the PostController forgets to apply the signed_in_user before filter on the update action. A counterexample is shown in (b) above.

Password managers: Attacks and defenses Silver et al. USENIX 2014

TL;DR – don’t use autofill.

The evil coffee shop attacker

The attacker is assumed to be able to enact an active man-in-the-middle network attack – i.e., to interpose and modify arbitrary network traffic originating from or to a user’s machine. However, there is no requirement that the user explicitly visit or login to any particular site in order to steal the credentials for that site.

A rogue wifi router in a coffee shop (for example) is all that is needed – connect to it and your passwords could be gone.

We call this type of attacker the evil coffee shop attacker. These attacks require only temporary control of a network router and are much easier and thus more likely to happen in practice…. In many of our attacks the user need not interact with the victim web site and is unaware that password extraction is taking place.

Sweep attacks

The basic sweep attack works against any password manager that supports autofill of password fields. The target user connects to the WiFi hotspot controlled by the attacker.

When the user launches the browser, the browser is redirected to a standard hotspot landing page asking for user consent to standard terms of use. This is common behavior for public hotspots. Unbeknownst to the user however, the landing page contains invisible elements that implement the attack.

In other words, by the time you’re looking at the fully loaded landing page, most of your credentials could already be gone – in tests, about ten passwords can be extracted per second.

There are three basic ways the attacker can use the landing page to sweep passwords:

1. In an iFrame sweep attack the landing page contains invisible iFrames pointing to arbitrary pages at multiple target sites. When the browser loads the frames, the attacker injects a login form and javascript into each of them (we’ll look at ways of doing that next). An autofilling password manager kindly auto-populates the corresponding password field with the user’s password.
2. Instead of using iFrames, the attacker can use multiple windows instead. by requiring a window to open before the user can gain access to the wifi network, the user can be encouraged to disable any popup blocker. Multiple windows will be more noticeable than invisible iFrames, but injected javascript can e.g., minimise them, or hove them to the edge of the screen. They can be closed as soon as the password has been stolen.
3. The third approach is to use a chain of redirects. When the user requests some page, the attacker responds with a redirect to a site for which the attacker wishes to learn the password. The injected javascript adds a login form, and hides the page details. As soon as the password manager autofills the password and it has been exfiltrated, the browser is redirected to the next target site, and so on in a chain, eventually loading the user’s originally requested page. The user sees a slow wifi connection…

As of 2014, the following table shows the tested password managers and which were vulnerable to these sweep attacks:

Injection

Sweep attacks rely on the attacker’s ability to modify a page on the victim site by tampering with network traffic. The attacks are simplest when the vulnerable page is the login page itself. However, any page that is same-origin with the login page is sufficient, as all password managers associated saved passwords with domains and ignore the login page’s path.

One easy setup to attack is sites that serve a login form over HTTP (bad practice), and only use HTTPS for the submission. As of October 2013, 17% of Alexa Top 500 sites with login forms did this. I’d like to think the number is less today but I don’t have the data.

Any HTTPS webpage with active content fetched over HTPP is also vulnerable (most browsers block this). Any XSS vulnerability on any page of the victim site will also work (even if the login page is served over HTTPS). In fact, an XSS vulnerability anywhere on the site enables the attack without even needing a rogue WiFi – so long as the web attacker can lure the victim into visiting a site the attacker controls.

Broken HTTPS connection (e.g. bad certificates) also lead to vulnerabilities as an attacker can serve the modified login page using a self-signed certificate. The browser will complain, but the user will often click through the warnings (especially when they occur as part of logging onto a WiFi network – or so they think).

A special prize goes to embedded devices which serve login pages over HTTP expecting to be on a private network protected by WiFi encryption, or (e.g.,) home routers that serve login pages over HTTPS but use self-signed certificates.

Exfiltration

Once the javascript in the attackers page has the desired password, exfiltration is pretty straightforward. One approach is to load an invisible iFrame and pass the credentials as parameters, another is to modify the action of a login form to submit to an attacker-controlled site.

All of the attacks described thus far take advantage of automatic autofill password managers to work when the user does not interact with the login form. However, the exfiltration techniques we described work regardless of how the login form was filled. If the user’s password manager requires user input to fill password and an attacker can trick the user to interact with the login form without them realizing it, the same exfiltration techniques can be used to steal the password as soon as the password form is filled.

The authors describe a clickjacking attack that can work in this scenario – although of course we are now limited to stealing only one password at a time.

A number of password manager behaviours beyond simple autofilling help the attacker, these mostly seem to fall into the camp of password managers trying to be robust to changes in site implementation details. The following table provides a short summary, see section 2 in the paper for the longer explanation regarding each column.

Defenses

The main proposed defence is secure filling, which requires a modified browser (and modified password managers to work with the modified browser).

Secure filling requires:

2. When a login form is autofilled by a password manager, it becomes unreadable by JavaScript (hence the requirement for a modified browser).
3. If username or password fields are modified (by the user or JavaScript) while an autofill is in progress, the autofill aborts clearing the password from the password field and making the field readable agin.
4. Once a form with autofill is submitted and after all JavaScript code that is going to be run has run, the browser checks the form’s action matches the stored one and only submits if so.

Note this would also mean, for example, that we need to treat initial registration pages specially as they often include client-side validation of password strength (requiring js access).

We disclosed our results to the password manager vendors, prompting several changes to autofill policies. Due to our findings, LastPass will no longer automatically fill password fields in iFrames, and 1Password will no longer offer to fill passwords from HTTPS pages on HTPP pages.

There’s a recent and relevant post from Khad Young at 1Password explaining why 1Password doesn’t offer autofill capabilities that also discusses sweep attacks.

Dynamics on expanding spaces: modeling the emergence of novelties Loreto et al., ArXiv 2017

Something a little bit left field today to close out the week. I was drawn into this paper by an MIT Technology Review article entitled “Mathematical model reveals the patterns of how innovations arise.” Who wouldn’t want to read about that!? The article (and the expectations set by the introduction to the paper itself) promise a little more than they deliver in my view – but what we do concretely get is a description of a generative process that can produce distributions like those seen in the real world, with new / novel items appearing at the observed rates and following observed distributions. Previous models have all fallen short in one way or another, so the model does indeed seem to teach us something about the process of generating the new.

Novelties are part of our daily lives. We constantly adopt new technologies, conceive new ideas, meet new people, experiment with new situations. Occasionally, we as individuals, in a complicated cognitive and sometimes fortuitous process, come up with something that is not only new to us, but to our entire society so that what is a personal novelty can turn into an innovation at a global level. Innovations occur throughout social, biological and technological systems and, though we perceive them as a very natural ingredient of our human experience, little is known about the processes determining their emergence. Still the statistical occurrence of innovations shows striking regularities that represent a starting point to get a deeper insight in the whole phenomenology.

The plan for today’s post is a little bit different to normal: we’ll start by looking at some of the laws that real world data sets seem to follow under certain conditions, then we’ll jump straight to the part of today’s paper that explains the generative model (skipping the 10+ pages of descriptions of previous models that didn’t quite cut it for one reason or another) before closing out with a brief look at the related (in my mind at least) Social Physics model of Andy Pentland et al. which explains how ideas spread once conceived. The post will therefore be a little bit longer than usual, but I think you’ll find the tour quite interesting!

Benford’s Law

The most counter-intuitive of the laws is Benford’s law, which says that if you look at a real-world distribution of numerical data (for example, population of cities) then you’ll observe the following phenomenon: numbers beginning with 1 are the most common (about 30% of the time), and numbers beginning with 9 are the least common (about 5% of the time). The likelihood of a number beginning with the digit d is $\log \frac{k+1}{k}$. Yes, that’s just weird!

(Source: wikipedia)

The dataset should follow four conditions for the law to hold:

1. Values are positive numbers
2. Values range over many different orders of magnitude
3. Values arise from a complicated combination of largely independent factors
4. Values have not been rounded, truncated or otherwise constrained in size

The law has been shown to work in many different scenarios – e.g., city populations, heights of the world’s tallest structures, lengths of rivers, figures in accounts, and so on.

The phenomenon was again noted in 1938 by the physicist Frank Benford, who tested it on data from 20 different domains and was credited for it. His data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of Reader’s Digest, the street addresses of the first 342 persons listed in American Men of Science and 418 death rates. The total number of observations used in the paper was 20,229. – Wikipedia

It is independent of the units used for the values (e.g., km vs miles) and even of the base (i.e., we don’t have to be using base 10). There’s a good New Scientist article on the law from 1999 entitled ‘The power of one.’

Why it works is complicated. But if the values range over several orders of magnitude, then we could consider that we are drawing random samples from a log scale. Look what happens when you take e.g., values from 1 to 2 inclusive with 0.1 increments and treat them as base 10 logs:

Note that we get numbers that start with the digit ‘1’ all the way up to 1.3 – i.e., about 30% of the time.

Zipf’s Law

Zipf’s law essentially tells us that things which grow large are comparatively rare. In text corpora, it’s the famous result that the frequency of any word is inversely proportional to its rank in the frequency table (so e.g., the 2nd most frequent word appears 1/2 as often as the most frequent, and so on). In the more general form, the nth largest value should be approximately $Cn^{-\alpha}$ where $C$ is the size of the largest value and $\alpha$ is a tuneable parameter, often close to 1 in real-world datasets.

The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. – Wikipedia

Heaps’ Law

Heaps’ law concerns the rate at which we discover new things. The initial formulation is again in the context of words in text documents. Let the number of distinct words in a text of length $n$ be $V_{R}(n)$, then

$V_{R}(n) = Kn^{\beta}$

For English text, K is typically between 10 and 100, and Β is between 0.4 and 0.6.

You can think of Heaps’ law as telling us that the more of a given space we have explored, the less likely it is that we’ll encounter something new. “Under mild assumptions, the law is asymptotically equivalent to Zipf’s law…” (Wikipedia). Think of the very long tail of very rare things in a Zipfian distribution – we have to take larger and larger samples in the hope of ‘catching’ one of them…

Heaps’ law also applies to situations in which the “vocabulary” is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps’ law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling. – Wikipedia

Pareto distributions

The 80/20 rule (e.g., 20% of the population earn 80% of the total income, and 20% of that 20% earn 80% of that 80% and so on) is a special case of a Pareto distribution.

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power law probability distribution that is used in description of social, scientific, geophysical, actuarial, and many other types of observable phenomena” – Wikipedia

If X is a random variable with a Pareto distribution then the above formula gives probability that X will be greater than some number $x$, where $x_m$ is the (positive) minimum possible value of X, and $\alpha$ is a tuneable positive parameter.

(source: wikipedia)

In the 80/20 rule, α is approximately 1.161.

Once again, the Pareto distribution tells us something about the relative distribution of large and small entities. Some of the many places it shows up, as listed in Wikipedia, include: the sizes of human settlements, file sizes, hard drive error rates, the values of oil reserves, sizes of sand particles, numbers of species per genus, and so on.

An alternative way of looking at the Pareto distribution is as follows:

The proportion of X with as least m digits (before the decimal point), where m is above the median number of digits, should obey an approximate exponential law, i.e., be approximately of the form c.10-m/α for some c, α > 0. In many cases, α is close to 1.

The adjacent possible and four examples

Now we can turn our attention back to the paper! The authors introduce a ‘mathematical model of the dynamics of novelties’ that is based on an idea called ‘the adjacent possible.’

Originally introduced in the framework of biology, the adjacent possible metaphor includes all those things, ideas, linguistic structures, concepts, molecules, genomes, technological artifacts, etc., that are one step away from what actually exists, and hence, can arise from incremental modifications and/or recombinations of existing material.

Here’s my intuition – imagine you’re wandering around the ‘land of the known.’ Most of the time you’re somewhere in the interior of the territory (and the larger the territory, the more likely it is that this will be so), but occasionally you find yourself at a border. There are no border signs, so you won’t necessarily know this is a border, but walking in a random direction from where you now are, you have a chance of venturing outside of the land of the known.

The model predicts the statistical laws for the rate at which novelties happen (Heaps’ law) and for the frequency distribution of the explored regions of the space (Zipf’s law), as well as the signatures of the correlation process by which one novelty sets the stage for another. The predictions of this model were tested on four data sets of human activity: the edit events of Wikipedia pages, the emergence of tags in social annotation systems, the sequence of words in texts, and listening to new songs in on-line music catalogs.

The model itself is based on ‘Polya’s Urns’…

Polya’s Urns

Consider an urn filled with $N_0$ balls, each of a different colour…

These elements represent songs we have listened to, web pages we have visited, inventions, ideas, or any other human experiences or products of human creativity.

Let $S$ be the sequence of things (ball colours) we have drawn from the urn so far. Initially $S$ is empty.

At each time step proceed as follows:

1. Draw an ball $b$ from the urn with uniform probability and add it to $S$
2. Put the ball $b$ back into the urn, together with $\rho$ additional balls of the same colour. This part models the ‘rich get richer’ phenomenon making it more likely that we return to ball colours we have previously seen.
3. If $b$ is a colour that we have never seen before, then add $v+1$ new balls to the urn, each of a brand new colour. (Assume we have as many colours as we like, and can distinguish among them all!).

A variant of this process only adds the additional copies in step 2 if $b$ is a colour we have already seen at least once.

The authors show that both variants of the urn model follow Heaps’ law for the number of distinct elements after t time steps, and a fat-tailed frequency rank distribution. It’s interesting because it shows how we can grow such a distribution step by step and thus potentially models how things grow in the real world. If you just wanted to generate a data set of a certain size (all at once) following the laws of real-world datasets I can think of easier ways.

By providing the first quantitative characterization of the dynamics of correlated novelties, these results provide a starting point for a deeper understanding of the adjacent possible and the different nature of triggering events (timeliness, spreading. individual vs. collective properties) that are likely to be important in the investigation of biological, linguistic, cultural, and technological evolution.

Social Physics

Alex Pentland runs a ‘Social Physics’ group at MIT, based at its core on a model of how ideas spread between people. Thus this model also offers some explanation for how a few things can become very popular, and others languish relatively unknown. Here’s the gist of the idea, taken from the 2014 Social Physics book.

Imagine a universe with $C$ people. Each $c \in C$ is an independent actor, and their observable behaviour at time $t$, $O_{t}^{(c)}$ is presumed to be based upon some hidden ideas in their head, $h_{t}^{(c)}$.

The likelihood of a given observable action by person $c$ given a particular hidden state can be expressed as $P(O_{t}^{(c)}|h_{t}^{(c)})$. Furthermore, let’s assume that the beliefs of a person at time $t$ are influenced by their own belief as well as the beliefs of everyone else in the population, at time $t-1$.

$P(h_{t}^{(c')}|h_{t-1}^{(1)},...,h_{t-1}^{(C)}) = \sum_{c=1}^{C} R^{c',c} \times P(h_{t}^{(c')}|h_{t-1}^{(c)})$

$R^{c',c}$ is the influence matrix that captures the influence strength of person $c$ over $c'$ . A good way to estimate the influence strength is simply to measure the amount of interaction between the two people.

One of the most important consequences of this model is that it lets us take raw observations of behavior and gives us the social network parameters we need to get a numerical estimate of idea flow, which is the proportion of users who are likely to adopt a new idea introduced into the social network. (Social Physics, p83).

Fencing off Go: Liveness and safety for channel-based programming, Lange et al. POPL 2017

In the true spirit of POPL (Principles of Programming Languages), I present today’s summary of ‘Fencing off Go’ :

What more do you need to know?

Let’s try again 🙂

Fencing off Go: Liveness and safety for channel-based programming, Lange et al. POPL 2017

POPL papers can be very intimidating for those not steeped in type theory – and I’m sure they’re still pretty hard work even for those who are! In Fencing off Go though, Lange et al. have done something which should be of wide and practical interest to Go programmers.

This work develops a static verification framework for liveness and safety in Go programs, able to detect communication errors and partial deadlocks in a general class of realistic concurrent programs, including those with dynamic channel creation, unbounded thread creation and recursion.

And it’s more than just a pretty paper – the tool chain is also available for you to try on your own programs. The test programs used by the authors are all pretty small though (max 112 loc!):

Background: concurrency, liveness, and safety in Go

Go is a statically typed language that uses channels and goroutines (lightweight threads) for concurrency. Instead of chains of asynchronous callbacks, concurrent Go programs use logically structured flows of messages on channels. This avoids typical problems with locks and callback-hell…

On the other hand, Go inherits most problems commonly found in concurrent message-passing programming such as communication mismatches and deadlocks, offering very little in terms of compile-time assurances of correct structuring of communication.

The Go runtime does include a global deadlock detector, but this can’t detect partial deadlocks (involving only a strict subset of a program’s goroutines), and is ‘ultimately inadequate for complex, large scale applications that may easily be undermined by trivial mistakes or benign changes to the program structure.’

The Go type system can ensure that the values sent and received on channels are of appropriate types, but it cannot give any static guarantees about liveness or safety. Consider the following Go implementation of a concurrent prime sieve. It has several elements making it hard to reason about, including unbounded iteration (L3, 6, 13), dynamic channel creation (L11, 15), and spawning of concurrent threads (L12, 16).

GoInfer and Gong combine to provide static verification of liveness and the absence of communication errors in such programs. GoInfer extracts concurrent behavioural types from a Go program, and Gong then performs analysis on those types for verification.

Type inference with GoInfer

GoInfer is written in Go, using the go/ssa (Static Single Assignment) package. The SSA intermediate representation is transformed into a system of type equations by converting each SSA block into an individual type equation.

To give a flavour, for the prime sieve program above we end up with the following type equations:

• The overall program has the type $(new a)(\mathbf{g}\langle a \rangle | \mathbf{r} \langle a \rangle)$. Interpret this as ‘create a new channel ‘a’, and then behave as the generator process ‘g’ and a recursive process ‘r’ in parallel’.
• The generator process has type $\mathbf{g}(x) \hat{=} \bar{x};\mathbf{g}\langle x \rangle$. (Output a value on channel ‘x’, then behave as g(x) again).
• The recursive process ‘r’ has type $\mathbf(r) \hat{=} x;(new b)(\mathbf{f} \langle x, b \rangle | \mathbf{r} \langle b\rangle)$. (Receive a value on channel ‘x’, then spawn a new channel b and behave as a filter process ‘f’ in parallel with ‘r’, both using the new channel ‘b’).
• The filter process ‘f’ has type $\mathbf{f}(x,y) \hat{=} x;(\hat{y};\mathbf{f}\langle x,y \rangle \bigoplus \mathbf{f} \langle x,y \rangle)$ (Receive a value on channel ‘x’ and then either behave as a process that outputs a value on channel ‘y’, followed by filtering, or as a filtering process (without outputting a value on ‘y’ first).

Underpinning these type equations is a language called MiGo (mini-Go) which models the concurrency passing features of Go itself. MiGo uses the familiar ‘!’ and ‘?’ for message send and receive and has the following compact syntax which I present without further explanation in the interests of space:

Expressed in MiniGo, the prime sieve program looks like this:

Given MiniGo, the authors define an operational semantics and a behavioural typing system for it.

Go’s channel types are related to those of the Π-calculus, where the type of a channel carries the type of the objects that threads can send and receive along the channel. Our typing system augments Go’s channel types by also serving as a behavioural abstraction of a valid MiGo program, where types take the form of CCS processes with name creation.

(CCS = Calculus of Communicating Systems)

Verification with Gong

[Gong] checks for liveness and detects communication errors even in the presence of highly dynamic and unconstrained communication topologies, containing infinitely many recursive processes and channels, which can often be out of scope for existing behavioural type-based analyses.

Gong is written in Haskell (of course it is 😉 ), and inputs a system of type equations representing a Go program’s concurrent behaviour (as produced by GoInfer). First it checks that the system is fenced, and if so it generates and checks all of the (k) reachable terms for liveness and channel safety.

If (the type of) a program is fenced, then we know that even if it spawns infinitely many processes, the program will actually consist of a finite number of communication patterns (that may themselves be repeated infinitely many times). The coloured regions in the diagram below show the fenced regions for the concurrent prime sieve example:

Given that a program is fenced, a symbolic semantics can be defined in terms of a labelled transition system (LTS) which we know will have finite state. Liveness is shown for three classes of programs:

1. Those with a path to termination. Programs in this class that are typeable with a live type can always satisfy liveness.
2. Those that do not contain infinitely occurring conditional branches – if such a process is assigned a live type, then it is itself live.
3. Infinitely running programs which contain recursive variables in conditional branches. At the type level, the choice as to which branch to take is abstracted away, so we don’t know what branch will really be taken based on the evaluation of the expression presented to the conditional. If the program is live when non-deterministically reducing to either branch of the conditional, then it must itself be live.

Where next?

In future work we plan to extend our approach to account for channel passing, and also lock-based concurrency control, enabling us to verify all forms of concurrency present in Go. The results of §4 suggest that it should be possible to encode our analysis as a model checking problem, allowing us to: (1) exploit the performance enhancements of state of the art model checking techniques; (2) study more fine-grained variants of liveness; (3) integrate model checking into the analysis of conditionals to, in some scenarios, decide the program class (viz. § 5.3).

Type checking has come a long way… automated liveness and safety checking for all forms of concurrency in Go would be an incredibly useful tool, no doubt saving hours of frustration trying to diagnose and debug errant concurrent behaviours.