Same-different problems strain convolutional neural networks Ricci et al., arXiv 2018

Since we’ve been looking at the idea of adding structured representations and relational reasoning to deep learning systems, I thought it would be interesting to finish off the week with an example of a problem that seems to require it: detecting whether objects in a scene are the same or different.

This image containing a flute was correctly classified by a CNN trained on millions of photographs. On ImageNet the network even surpassed the accuracy of a human observer.

This image contains two shapes that are the same, a relationship that is immediately obvious to a human observer. “Yet, the CNN failed to learn this relation even after seeing millions of training examples.

The above is an example of a same-different (SD) visual relation problem (output whether the objects in the scene are the same, or different). Spatial relation (SR) problems ask whether objects follow a certain spatial relation, e.g. in a line, horizontally stacked, vertically stacked, and so on. For example:

The synthetic visual reasoning test (SVRT) contains a collection of 23 binary classification problems along these lines. In each case opposing classes differ based on whether their stimuli obey an abstract rule. If you train a bunch of CNNs (with different depths, filter sizes, etc.) on these tasks an interesting pattern pops out. The CNNs really struggled on problems where the abstract rule required detecting whether things were the same or different (congruent up to some transformation), whereas they achieved good accuracy on spatial relation problems.

The resulting dichotomy across the SVRT problems is striking. CNNs fare uniformly worse on SD problems than they do on SR problems. Many SR problems were learned satisfactorily, whereas some SD problems (e.g. problems 20 and 7) resulted in accuracy not substantially above chance.

For SR problems, all the CNNs did pretty well, regardless of network configuration. But for SD problems larger networks performed noticeably better than smaller ones. This suggests that something about the SD problems is straining the capacity of the CNNs.

### Probing further

To dig deeper into this apparent difference between same-different and spatial-relation problems the authors construct a new visual-relation benchmark called PSVRT. The dataset is parameterised so that the size of the items in the scene, the number of scene items, and the size of the whole image can all be controlled. Scene items are just binary bit patterns placed on a blank background. For any given configuration of parameters, the resulting scene can be used for both an SD problem and an SR problem, simply based on labelling.

Our goal was to examine how hard it is for a CNN architecture to learn relations for visually different but conceptually equivalent problems. If CNNs can truly learn the “rule” underlying these problems, then one would expect the models to learn all problems with more-or-less equal ease. However, if the CNN only memorize the distinguishing features of the two image classes, then learning should be affected by the variability of the example images in each category.

A baseline architecture was established with four convolutional layers, that was able to easily learn both the same-different and spatial-relation PSVRT problems with item size 4, image size 60, and two items in the image. This baseline CNN was then trained from scratch on a variety of PSVRT problems, each time using 20 million training images and a batch size of 50. There were three sub-experiments:

• Fixing item size (m) at 4, number of items (k) at 2, and varying image size (n) between 30 and 180.
• Fixing image size at 60, number of items at 2, and varying item size between 3 and 7.
• Fixing image size at 60, item size at 4, and varying the number of items between 2 and 6.

In all conditions, we found a strong dichotomy between SD and SR conditions. In SR, across all image parameters and in all trials, the model immediately learned at the start of training and quickly approached 100% accuracy, producing consistently high and flat mean ALC curves. In SD, however, we found that the overall ALC was significantly lower than SR.

Digging deeper, when learning did occur in SD, increasing item size never strained performance. But increasing the overall image size, or increasing the number of items did. (Gray bars in the above figures indicate the number of trials in which learning failed). The results suggest that straining is not simply a direct outcome of an increase in image variability. Using CNNs with more than twice the number of kernels (wide), or twice as many layers (deep) did not change the observed trend.

### What’s going on?

The authors hypothesise that the CNNs learn ‘subtraction templates’ when tackling SD problems: filters with one positive region and one negative region. Each relative arrangement of items requires a different subtraction template since each item must lie in on of the template’s two regions. If identical items lie in opposing regions, they are subtracted by the synaptic weights. The difference is used to choose the appropriate same/different label.

A strategy like this doesn’t require memorizing specific items, so item size doesn’t make much of a difference. However, image size (the biggest straining factor) exponentially increases the possible number of arrangements of items.

Our results indicate that visual-relation problems can quickly exceed the representational capacity of feedforward networks. While learning templates for individual objects appears to be tractable for today’s deep networks, learning templates for arrangements of objects becomes rapidly intractable because of the combinatorial explosion in the number of features to be stored… Given the vast superiority of humans over modern computers in their ability to detect visual relations, we see the exploration of attentional and grouping mechanisms as an important next step in our computational understanding of visual reasoning.

Relational inductive biases, deep learning, and graph networks Battaglia et al., arXiv’18

Earlier this week we saw the argument that causal reasoning (where most of the interesting questions lie!) requires more than just associational machine learning. Structural causal models have at their core a graph of entities and relationships between them. Today we’ll be looking at a position paper with a wide team of authors from DeepMind, Google Brain, MIT, and the University of Edinburgh, which also makes the case for graph networks as a foundational building block of the next generation of AI. In other words, bringing back and re-integrating some of the techniques from the AI toolbox that were prevalent when resources were more limited.

We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representation and computations are key to realizing this objective… We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and the rules for composing them.

### Relational reasoning and structured approaches

Human’s represent complex systems as compositions of entities and their interactions. We use hierarchies to abstract away fine-grained differences, manage part-whole associations and other more general relationships, and can solve problems by composing familiar skills and routines.

… the world is compositional, or at least, we understand it in compositional terms. When learning, we either fit new knowledge into our existing structured representations, or adjust the structure itself to better accommodate (and make use of ) the new and the old.

Modern deep learning methods have done incredibly well following an end-to-end design philosophy that avoids explicit structure and ‘hand-engineering’. Such an approach seems to be finding its limits when it comes to challenges such as complex language and scene understanding, reasoning about structured data, transferring learning, and learning from small amounts of experience.

Instead of choosing between nature (end-to-end) and nurture (hand-engineering), can we use them jointly just as nature does, to build wholes greater that the sum of their parts?

A structured representation captures a composition of known building blocks, with structured computations operating over the elements and their composition as a whole. Relational reasoning operates over structure representations combining entities and relations, using rules for how they can be composed.

Recently, a class of models has arisen at the intersection of deep learning and structured approaches, which focuses on approaches for reasoning about explicitly structured data, in particular graphs…. What these approaches all have in common is a capacity for performing computation over discrete entities and the relations between them. What sets them apart from classical approaches is how the representations and the structure of the entities and relations — and the corresponding computations — can be learned, relieving the burden of needing to specify them in advance.

### Relational inductive biases and graphs

We need some way of guiding a learning algorithm towards good solutions. The fancy term for this is inductive bias, the bias part being how an algorithm prioritises one solution over another, independent of the observed data. A regularization term is a form of inductive bias for example.

A relational inductive bias imposes constraints on relationships and interactions among entities in a learning process. Viewed through a relational lens, we can see relational inductive biases at work in several of the standard deep learning building blocks. We just need to uncover in each what are the entities, what are the relations, and what are the rules for composing them.

As an example, in a fully connected layer the entities are the units in the network, the relations are all-to-all, and the rules are specified by the weights and biases. “The implicit relational inductive bias in a fully connected layer is thus very weak: all input units can interact to determine any output unit’s value, independently across outputs.

Within the building blocks we also see aspects of sharing (e.g. using a local kernel function across all inputs in a convolutional layer) and locality.

While the standard deep learning toolkit contains methods with various forms of relational inductive biases, there is no “default” deep learning component which operates on arbitrary relational structure. We need models with explicit representations of entities and relations, and learning algorithms which find rules for computing their interactions, as well as ways of grounding them in data.

A natural structure for this representation is a graph. Graphs support arbitrary (pairwise) relational structure, and computations over graphs afford a strong relational inductive bias. Many problems are easily modelled using a graph representation. For example:

### Introducing graph networks

There is a rich body of work on graph neural networks (see e.g. Bronstein et al. 2017) for a recent survey. Gilmer et al. introduced the message-passing neural network (MPNN) which unifies various graph neural network and graph convolutional approaches. Wang et al. introduced non-local neural networks (NLNN) unifying various ‘self-attention’ style methods. The graph network (GN) framework outlined in this paper generalises and extends graph neural networks, MPNNs, and NLNNs.

The main unit of computation in the GN framework is the GN block, a “graph-to-graph” module which takes a graph as input, performs computations over the structure, and returns a graph as output… entities are represented by the graph’s nodes, relations by the edges, and system-level properties by global attributes.

A graph is thus a three-tuple $G = (\mathbf{u}, V, E)$ where $\mathbf{u}$ is a global attribute, V is a set of nodes (each represented by attributes), and E is a set of edges (also represented by attributes).

GN blocks have three update functions $\phi$ and three aggregation functions $\rho$.

$\phi^{e}$ is mapped across all edges to compute per-edge updates, and then the $\rho^{e \mapsto v}$ function reduces the edge states to a single value $\mathbf{\bar{e}'}$ for each node. $\phi^{v}$ is then mapped across all nodes and $\rho^{e \mapsto u}$ reduces the node states to a single value. Finally the global attributed is update using $\phi^{u}$ and $\rho^{v \mapsto u}$. The $\rho$ functions must be invariant to permutations of their inputs, and should take variable numbers of arguments (e.g., elementwise summation, mean, max etc.). Computations proceed from the edge, to the node, to the global level.

The full update algorithm is given below.

Consider an example graph where nodes represent rubber balls and edges are springs connecting them. We want to predict the movements of the balls in an arbitrary gravitational field.

• The edge update step might compute the forces or potential energies between two connected balls.
• The edge reduction step might sum all the forces or potential energies acting on the ith ball.
• The node update step might compute the update position, velocity, and kinetic energy (attributes) of each ball.
• The node reduction step aggregates all of the edge updates into a node, for example computing the summed forces (which should be zero in this case).
• The global reduction step might then compute the total kinetic energy of the system
• The global update step might compute something analogous to the net forces and total energy of the physical system.

Other update orders are possible (e.g. from global to node to edges).

Graph networks have (at least) three properties of interest:

1. The nodes and the edges between provide strong relational inductive biases (e.g. the absence of an edge between two nodes means that they should not directly influence each other).
2. Entities and relations are represented as sets, and thus are invariant to ordering and permutations. This is a desirable property in many contexts (e.g. objects in a scene do not have a natural ordering).
3. Since the per-edge and per-node functions are reused across all edges and nodes respectively, GNs automatically support a form of combinatorial generalisation. A single GN can operate on graphs of different sizes and shapes.

### Graph network architectures

So far it seems like we’ve just been re-inventing a general graph computation framework. And indeed the framework is agnostic to specific representations and functional forms. But the fun stuff starts when we focus on deep learning architectures allowing GNs to act as learnable graph-to-graph function approximators. The three design principles behind graph networks which support this are:

• flexible representations
• configurable within-block structure, and;
• composable multi-block architectures

Flexible representations refers to the choice of representation for the attributes of edges, nodes, and global state, as well as the form of the outputs. For deep learning implementations, real-valued vectors and tensors would be a likely choice. Edge and node outputs would typically correspond to lists of vectors or tensors, and global outputs to a single vector or tensor. Thus the output of a GN block can be passed to other deep learning building blocks. The relational structure (the graph) can be explicitly specified in the inputs, or it can be inferred or assumed.

Configurable within-block structure refers to flexibility in how the internal structure and functions of a GN block are configured. The $\phi$ functions could themselves be implemented as neural networks for example (e.g. MLPs for vectors, or CNNs for tensors). There are GN configurations that correspond to many existing works in the literature. The following figure shows some examples.

Composable multi-block architectures refers to the fact that the output of one GN block can be fed into another GN block as an input. In general arbitrary numbers of GN blocks can be composed.

One common design is the encode-process-decode configuration in which an input graph is transformed into a latent representation by an encoder GN, a shared core block is applied multiple times to process the encoding, and then a decoder GN produces the final output.

### What’s next

Realizing the full potential of graph networks will likely be far more challenging than organizing their behaviour under one framework, and indeed, there are a number of unanswered questions regarding the best ways to use graph networks. One pressing question is: where do the graphs come from that the graph networks operate over?

For example, how should we convert raw sensory data into structured graph representations?

And beyond just graphs, “one takeaway from this paper is less about graphs themselves and more about the approach of blending powerful deep learning approaches with structured representations.

… learnable models which operate on graphs are only a stepping stone on the path toward human-like intelligence. We are optimistic about a number of other relevant, and perhaps underappreciated, research directions, including marrying learning-based approaches with programs, developing model-based approaches with an emphasis on abstraction, investing more heavily in meta-learning, and exploring multi-agent learning and interaction as a key catalyst for advanced intelligence.

The authors state an intention to release an open-source implementation for graph networks later this year.

With thanks to @osmandros for sending me a link to this paper on twitter.

In this technical report Judea Pearl reflects on some of the limitations of machine learning systems that are based solely on statistical interpretation of data. To understand why? and to answer what if? questions, we need some kind of a causal model. In the social sciences and especially epidemiology, a transformative mathematical framework called ‘Structural Causal Models’ (SCM) has seen widespread adoption. Pearl presents seven example tasks which the model can handle, but which are out of reach for associational machine learning systems.

### The three layer causal hierarchy

A useful insight unveiled by the theory of causal models is the classification of causal information in terms of the kind of questions that each class is capable of answering. This classification forms a 3-level hierarchy in the sense that questions at level i (i = 1, 2 ,3 ) can only be answered if information from level j (j &geq; i) is available.

The lowest (first) layer is called Association and it involves purely statistical relationships defined by the naked data. This is the layer at which most machine learning systems operate.

Level two, Intervention involves not just seeing what is, but reasoning about the effects of actions you might take (interventions). I would argue that reinforcement learning systems operate at this level (e.g., ‘what will happen if I move my knight to this square?’). RL systems tend to operate in very well defined environments though, whereas the intervention layer envisioned here encompasses much more open challenges too. As an example of an intervention, Pearl provides the question “What will happen if we double the price?” (of an item we are selling).

Such questions cannot be answered from sales data alone, because they involve a change in customers behaviour, in reaction to the new pricing.

Personally, I would have thought that if we had sales data showing the effects of previous price increases (on the same or similar items) then it might well be possible to build a predictive model from that data alone. Pearl’s counter-argument is that unless we replicate precisely the market conditions that existed the last time the price reached double its current value, we don’t really know how customers will react).

The highest level of causal reasoning is called counterfactuals and addresses what if? questions requiring retrospective reasoning. On a small scale, this is what I see sequence-to-sequence generative models being capable of doing. We can ‘replay’ the start of a sequence, change the next data values, and see what happens to the output.

The levels form a hierarchy in the sense that interventional questions cannot be answered from purely observational information, and counterfactual questions cannot be answered from purely interventional information (e.g., we can’t re-run an experiment on subjects who were given a drug to see what would have happened had they not been given it). The ability to answer questions at level j though implies that we can also answer questions at level i &leq; j.

This hierarchy, and the formal restrictions it entails, explains why machine learning systems, based only on associations, are prevented from reasoning about actions, experiments, and causal explanations.

### Structural Causal Models

Structural Causal Models (SCM) combine graphical modeling, structural equations, and counterfactual and interventional logic.

These tools permit us to express causal questions formally, codify our existing knowledge in both diagrammatic and algebraic forms, and then leverage our data to estimate the answers. Moreover, the theory warns us when the state of existing knowledge or the available data are insufficient to answer our questions; and then suggests additional sources of knowledge or data to make the questions answerable.

An SCM ‘inference engine’ takes as input assumptions (in the form of a graphical model), data, and a query.

For example, the following graph shows that X (e.g. taking a drug) has a causal effect on Y (e.g. recovery), and a third variable Z (e.g. gender) affects both X and Y.

It reminds me of models I have seen in works on Bayesian decision making.

There are three outputs:

• The Estimand is a mathematical formula providing a recipe for answering the Query from any hypothetical data whenever available, given the assumptions.
• The Estimate is the answer to the query, along with statistical estimates of confidence.
• A set of Fit Indices measure how compatible the data are with the assumptions conveyed by the model. If the encoded assumptions do not have any testable implications then the set will be empty.

When the query cannot be answered given the model assumptions it is declared ‘unidentifiable’.

Fortunately, efficient and complete algorithms have been developed to decide identifiability and to produce estimands for a variety of counterfactual queries and a variety of data types.

### What can SCM do for us?

Rather than dive deeper into the details of how SCM works, Pearl instead enumerates seven tools of causal reasoning enabled by the SCM framework.

#### 1. Transparency and testability

Transparency enables analysts to discern whether encoded assumptions are plausible, and stems from the compact graphical representation.

Testability is facilitated through a graphical criterion called d-separation, which provides the fundamental connection between causes and probabilities. It tells us, for any given pattern of paths in the model, what pattern of dependencies we should expect to find in the data.

#### 2. Do-calculus and the control of confounding

Confounding here seems to refer to the presence of latent variables which are the unobserved causes of two or more observed variables. How to select covariates to control for confounding was determined back in 1993, with the later do-calculus predicting the effects of policy interventions whenever feasible, and exiting with failure when the assumptions don’t permit predictions.

#### 3. Counterfactuals

One of the crowning achievements of modern work on causality has been to formalize counterfactual reasoning within the graphical representation, the very representation researchers use to encode scientific knowledge. Every structural equation model determines the truth of every counterfactual sentence. Therefore, we can determine analytically if the probability of the sentence is estimable from experimental or observational studies, or combination thereof.

#### Mediation analysis

Mediation analysis concerns the discovery of intermediate mechanisms through which causes are transmitted into effects. We can ask queries such as “What fraction of the effect of X on Y is mediated by variable Z?”.

#### Adaptability, external validity, and sample selection bias

The problem of robustness [to change in environmental conditions], in its broadest form, requires a causal model of the environment, and cannot be handled at the level of Association… The do-calculus now offers a complete methodology for overcoming the bias due to environmental changes. It can be used both for re-adjusting learned policies to circumvent environmental changes and for controlling bias due to non-representative samples.

#### Recovering from missing data

Using SCM causal models it is possible to formalise the conditions under which causal and probabilistic relationships can be recovered from incomplete data and, whenever the conditions are satisfied, produce a consistent estimate of the desired relationship.

#### Causal discovery

The d-separation criterion let us detect and enumerate the testable implications of a given model. We can also infer the set of models compatible with the data. There are also methods for discovering causal directionality .

### In conclusion

On the one hand, the article reads like a sales pitch for SCM, teasing us with possibilities and encouraging us to go deeper. That ‘associational machine learning’ methods are tied to the association level in the hierarchy seems to follow from the definition! (Are there machine learning methods which have escaped the purely associational? I hinted at a couple of techniques that seem to have done so earlier in the piece). On the other hand, the rich theory of causal reasoning does indeed seem to have much to complement traditional approaches to machine learning. Pearl certainly seems to think so!

…given the transformative impact that causal modeling has had on the social and medical sciences, it is only natural to expect a similar transformation to sweep through the machine learning technology, once it is enriched with the guidance of a model of the data-generating process. I expect this symbiosis to yield systems that communicate with users in their native language of cause and effect and, leveraging this capability, to become the dominant paradigm of next generation AI.

For an introduction to the theory for the general reader, Pearl’s recent book “The book of why: the new science of cause and effect” may be of interest. (Disclaimer: I haven’t read it yet, but I’m going to add it to my list!).

An empirical analysis of anonymity in Zcash Kappos et al., USENIX Security’18

As we’ve seen before, in practice Bitcoin offers little in the way of anonymity. Zcash on the other hand was carefully designed with privacy in mind. It offers strong theoretical guarantees concerning privacy. So in theory users of Zcash can remain anonymous. In practice though it depends on the way those users interact with Zcash. Today’s paper choice, ‘An empirical analysis of anonymity in Zcash’ studies how identifiable transaction participants are in practice based on the 2,242,847 transactions in the blockchain at the time of the study.

We conclude that while it is possible to use Zcash in a private way, it is also possible to shrink its anonymity set considerably by developing simple heuristics based on identifiable patterns of usage.

The analysis also provides some interesting insights into who is using Zcash and for what as well. Founders and miners combined account for around 66% of the value drawn from the shielded pool.

The code for the analysis is available online at https://github.com/manganese/zcash-empirical-analysis

### Zcash guarantees and the shielded pool

Zcash is based on highly regarded research including a cryptographic proof of the main privacy feature of Zcash, the shielded pool. Not all transactions are required to go through the shielded pool though: Zcash also supports transparent transactions with similar properties to transactions in Bitcoin. Transparent transactions reveal the pseudonymous addresses of senders and recipients as well as the amount being sent.

All newly generated coins are required to pass through the shielded pool before being spent further. Based on this the Zcash developers concluded that the anonymity set for users spending shielded coins is all generated coins. This paper shows that in practice the anonymity set is much smaller.

To support transparent and shielded transactions Zcash has two types of addresses: transparent addresses (t-address) and shielded addresses (z-address). Addresses are supplied for the inputs and outputs of transactions, yielding four possible combinations:

• Transparent transactions move funds from t-addresses to t-addresses.
• Shielded transactions move funds from t-addresses to z-addresses
• Deshielded transactions move funds from z-addresses to t-addresses
• Private transactions move funds between z-addresses.

Their are four main types of actor in the Zcash ecosystem. Founders are onto a nice little number and receive 20% of all newly generated coins. Founder addresses are specified in the Zcash parameters. Miners maintain the ledger and receive block rewards and transaction fees. Services are entities that accept ZEC as a form of payment, for example exchanges and trading platforms. There are also individual participants who hold and transact in ZEC at a personal level. (Charities and other organisation accepting Zcash are included in this last category).

As of January 2018 258,472 blocks had been mined and 3,106,643 ZEC generated (621,182 ZEC of which went to the founders). Across all blocks there were 2,242,847 transactions, broken down as show in the following table.

As the following chart shows, transparent transaction usage is growing disproportionately over time to make a larger and larger percentage of the overall transaction volume.

As mentioned previously, these transactions offer essentially the same privacy as Bitcoin (i.e., not great), and can be de-anonymised using the same techniques as used for Bitcoin.

1,740,378 distinct t-addresses had been used, of which 8,727 had acted as inputs in at least one t-to-z transaction, and 330,780 have acted as outputs in at least one z-to-t transaction.

The overall value held in the shielded pool is increasing over time, with noticeable shielding and deshielding spikes that the analysis will show is due to the actions of miners and founders.

Only 25% of all t-addresses hold a non-zero balance, and the top 1% hold 78% of all ZEC. The richest address has a higher balance than the entire shielded pool.

### Heuristics

We can use similar heuristics to those used in de-anonymising Bitcoin transactions, but adapted to account for the differences between t- and z-addresses.

1. If two or more t-addresses are inputs in the same transaction (whether that transaction is transparent, shielded, or mixed), then they are controlled by the same entity.
2. If one (or more) address in an input t-address in a vJoinSplit transaction and a second address is an output t-address in the same vJoinSplit transaction, then if the size of zOut is 1 (i.e., this is the only transparent output address), the second address belongs to the same user who controls the input addresses.

(Heuristic 2 is the ‘change address’ heuristic).

Using just the first heuristic it is possible to discover clusters of addresses, and by finding a known entity associated with any one address in a cluster, assign all of the cluster addresses to that entity. Using this method, here are the top ten identified Zcash exchanges according to volume traded (the deposits and withdrawal columns indicated the number of transactions initiated by the authors to discover seed addresses):

Identifying exchanges is important, as it makes it possible to discover where individual users may have purchased their ZEC. Given existing and emerging regulations, they are also the one type of participant in the Zcash ecosystem that might know the real-world identify of users.

The publicised addresses of founders, and of mining pools, also act as seeds to discover larger clusters of addresses controlled by these entities. In this manner 123 founder addresses were uncovered, and 110,918 mining pool addresses.

### Who uses the shielded pool?

The previous section looked at t-addresses, where users should at least have less expectation of privacy. So what’s really interesting is what we can learn about usage of the shielded pool.

Deposits and withdrawals into the shielded pool closely mirror each other, with most users not only withdrawing the exact number of ZEC they deposit into the pool but doing so very quickly after making a deposit.

The main participants putting money into the pool are miners. The consensus rules for Zcash dictate that miners and founders must put their block rewards into the shielded pool before spending them further.

The intent of the shielded pool is to provide an anonymity set so that when users withdraw coins it is not clear whose coins they are. However, if a t-to-z transaction can be linked to a z-to-t transaction then those coins can be ruled out of the anonymity set.

Founders it turns out have predictable behaviour that enables many of their transactions to be linked. Known founder addresses identify deposits into the pool, and furthermore the deposits follow a predictable pattern of depositing 249.9999 ZEC – the reward for 100 blocks. That suggests that withdrawals might follow a predictable pattern too, and lo-and-behold there are 1,953 withdrawals of exactly 250.0001 ZEC. Both deposits and withdrawals happen with a period of 6-10 blocks, following a step-like pattern.

This leads to heuristic three:

• Any z-to-t transaction carrying 250.0001 ZEC in value is done by the founders

This heuristic leads to the identification of a further 48 founder addresses.

Miner deposits into the pool are also predictable since they immediately follow coin generation. Flypool and F2Pool are the biggest:

Miners don’t solely use t-addresses associated with deposits for withdrawals, but they use enough of them that output addresses can be linked using the heuristics.

• If a z-to-t transaction has over 100 output t-addresses, one of which belongs to a known mining pool, then we label the transaction as a mining withdrawal (associated with that pool), and label all non-pool output t-addresses as belonging to miners.

Using this heuristic 110,918 addresses were tagged as belonging to miners, allowing a signifiant portion of z-to-t transaction to be linked:

Outside of founders and miners, any time there is exactly one t-to-z transaction carrying value v, followed by exactly one z-to-t transaction for the exact same amount within a small number of blocks, then those transactions are linked too.

There were 6,934 private (z-to-z) transactions, with timing that suggests a smaller number of users make many transactions each.

### The Shadow Brokers

Using their heuristics, and looking at deposits matching the price of NSA tool dumps made by the hacker collective ‘The Shadow Brokers’ (TSB), the authors were able to 24 clusters of addresses potentially associated with TSB purchases.

### In conclusion

… our study has shown that most users are not taking advantage of the main privacy features of Zcash at all. Furthermore, the participants who do engage with the shielded pool do so in a way that is identifiable, which has the effect of significantly eroding the anonymity of other users by shrinking the overall anonymity set.

QSYM: a practical concolic execution engine tailored for hybrid fuzzing Yun et al., USENIX Security 2018

There are two main approaches to automated test case generated for uncovering bugs and vulnerabilities: fuzzing and concolic execution. Fuzzing is good at quickly exploring the input space, but can get stuck when trying to get past more complex conditional causes (i.e., when randomly generated inputs are unlikely to satisfy them). Concolic execution, which we saw in action earlier in the week, uses symbolic execution to uncover constraints and pass them to a solver. It can handle complex branch conditions, but it’s much slower. Hybrid fuzzers combine both coverage-guided fuzzing and concolic execution, bringing in the big guns (concolic) when the fuzzer gets stuck. In non-trivial real-world applications though, even the hybrid approach has been too slow. Until now.

For me, the attention grabbing paragraph in this paper is to be found on page 8 (752) in section 5.1. Google’s OSS-Fuzz was previously used to test a number of important real-world applications and libraries including libjpeg, libpng, libtiff, lepton, openjpge, tcpdump, file, libarchive, audiofile, ffmpeg, and binutils.

It is worth noting that Google’s OSS-Fuzz generated 10 trillion test inputs a day [using hundreds of servers] for a few months to fuzz these applications, but QYSM ran them for three hours using a single workstation.

QSYM is a hybrid fuzzer developed in this work which combines a state-of-the-art fuzzer, AFL, with a new concolic executor. QSYM found 13 previously unknown bugs in these eight programs and libraries, as shown in the table below. As noted, these programs had already been thoroughly tested by other state-of-the-art fuzzers including OSS-Fuzz and AFL.

That’s a dramatic improvement in efficiency and effectiveness. The secret is in a careful reconsideration of the received wisdom for designing concolic executors, taking advantage of the fact that the executor will be used in the context of hybrid fuzzing where it doesn’t always have to be perfect, just useful.

QSYM is available in open source at https://github.com/sslab-gatech/qsym.

### Why are existing concolic executors slow?

…the community believes that symbolic and concolic executions are slow due to path explosion and constraint solving…

But careful measurement shows that it’s actually the emulation itself which introduces most of the overhead. The table below shows the emulation overhead in the KLEE and angr symbolic executors. Compared to native execution they are 3-5 orders of magnitude slower.

To manage complexity, existing symbolic executors adopt an IR (intermediate representation). Since the IRs (e.g., the LLVM IR) have much smaller sets of instructions, simulating them is a much easier task than full native instruction simulation. The translation to IR is a source of overhead though. Because of this, existing symbolic executors try to avoid it whenever they can, making block level decisions about whether or not to execute a block in the emulator based on the symbolic variables involved. Analysis shows that even within a block though, only about 30% of instructions truly require symbolic execution. If switching between native and symbolic overhead had lower overhead, we could use symbolic execution at a finer-grained level and hence do less of it.

When exploring multiple paths, conventional concolic execution engines also make heavy use of snapshotting to reduce the overhead of re-execution. This is effective when exploring multiple paths from a branch. In a hybrid fuzzing context with randomly generated inputs the opportunity for re-use is much less. Snapshotting also causes problems with external environments the program interacts with that may also maintain state (e.g., the kernel). One solution is to model the external environment (likely to be imperfect), another is to use (slow) full system concolic execution.

Finally, existing concolic executors try to guarantee soundness by generating complete constraints. This can be expensive and may also end up over-constraining some paths, limiting the ability to find future paths.

### The design of QSYM

QSYM bites the bullet and does away with the IR layer altogether, performing symbolic execution on native instructions. Given the fast symbolic execution that results and the hybrid fuzzing context, QSYM is also able to eliminate the snapshot mechanism and use concrete execution to model external environments. Instead of generating complete constraints, QSYM collects an incomplete set and sometimes solves only a portion of the constraints if a path is overly-constrained.

… QSYM first instruments and then runs a target program utilizing Dynamic Binary Translation (DBT) along with an input test case provided by a coverage-guided fuzzer. The DBT produces basic blocks for native execution and prunes them for symbolic execution, allowing us to quickly switch between two execution models. Then, QSYM selectively emulates only the instructions necessary to generate symbolic constraints, unlike existing approaches that emulate all instructions in the tainted basic blocks. By doing this, QSYM reduced the number of symbolic emulations by a significant maginitude (5x) and hence achieves a faster execution speed.

### Fast symbolic execution

QSYM’s efficient DBT makes it possible to implement fine-grained instruction level taint-tracking. As show in the following figure, by emulating only the required instructions, and not whole blocks, QSYM can avoid expensive emulations of other instructions.

QSYM runs both native and symbolic executions in a single process making mode switches extremely lightweight. This plus the avoidance of snapshots makes it possible to use concrete execution for external environments, treating them as a black box.

Unlike existing symbolic executors, QSYM generates new test cases by applying the solved constraints to the original input (e.g., Driller can generate new test cases that look radically different to the input). Only constraints relevant to the target branch we want to flip are considered.

Combined, these techniques mean that QSYM symbolically executes only about 1/5th of instructions executed by Driller.

### Optimistic solving

Concolic execution is susceptible to over-constraint problems in which a target branch is associated with complicated constraints generated in the current execution path… existing solvers give up too early (i.e., timeout) without trying to utilize the generated constraints, which took most of their execution time.

In a hybrid fuzzing context, all we’re trying to go is get past conditional guards and go deeper into the program logic. So optimistically selecting and solving a portion of the constraints and passing the results to the fuzzer for evaluation is much better than timing out. When exploring a path using optimistic solving, QSYM chooses the last constraint in a path to solve for. Such test cases at least meet the local constraints when reaching the target branch.

In the evaluation, optimistic solving greatly increases the number of bugs QSYM can find in a given amount of time.

### Basic block pruning

We observed that constraints repetitively generated by the same code are not useful for finding new code coverage in real-world software. In particular, the constraints generated by compute-intensive operations are unlikely solvable (i.e, non-linear) at the end even if their constraints are formulated.

QSYM measures the frequency of execution for each basic block and prunes those that have been executed too frequently. Basic blocks are pruned using exponential back-off, only generating constraints when the execution count is a power of two. A configurable parameter enables multiple block executions to count as one execution group for the purposes of frequency counting. This helps to avoid QSYM pruning too hard in some situations.

Using basic block pruning QSYM finds more bugs in less time:

### Representative bugs

Here’s a simplified representation of a bug that QSYM found in ffmpeg. It’s nearly impossible for a fuzzer to get past all the constraints (lines 2-9), by QSYM was able to do so by modifying seven bytes of the input. This meant that the AFL fuzzer was able to pass the branch with the new test case and eventually reach the bug.

Here’s another example of a bug found in file. Line 5 is a tautology due to the incorrect use of the OR operator. To reach this point, QSYM had to learn how to generate a valid ELF file with a note section. For a fuzzer to do this and generate a note section starting with ‘GNU’ is highly improbable.

QSYM makes hybrid fuzzing scalable enough to test complex, real-world applications… QSYM found 13 previously unknown bugs in the eight non-trivial programs, such as ffmpeg and OpenJPEG, which have been heavily tested by the state-of-the-art fuzzer, OSS-Fuzz, on Google’s distributed fuzzing infrastructure.

tags:

NAVEX: Precise and scalable exploit generation for dynamic web applications Alhuzali et al., USENIX Security 2018

NAVEX (https://github.com/aalhuz/navex) is a very powerful tool for finding executable exploits in dynamic web applications. It combines static and dynamic analysis (to cope with dynamically generated web content) to find vulnerable points in web applications, determine whether inputs to those are appropriately sanitised, and then builds a navigation graph for the application and uses it to construct a series of HTTP requests that trigger the vulnerability.

It also works at real-world scale: NAVEX was used on 26 PHP applications with a total of 3.2M SLOC and 22.7K PHP files. It generated 204 concrete exploits across these applications in a total of 6.5 hours. While the current implementation of NAVEX targets PHP applications, the approach could be generalised to other languages and frameworks.

In this paper, our main contribution is a precise approach for vulnerability analysis of multi-tier web applications with dynamic features… our approach combines dynamic analysis of web applications with static analysis to automatically identify vulnerabilities and generate concrete exploits as proof of those vulnerabilities.

Here’s a example of what NAVEX can do. From the 64K SLOC in osCommerce 2.3.4, NAVEX is able to demonstrate a vulnerability in the code below:

Furthermore, it also generates a concrete exploit that successfully triggers the vulnerability:

### Challenges

To work its magic, NAVEX has to overcome three main challenges: finding reachable sinks (points of vulnerability), coping with dynamically generated content, and scaling to large code bases.

Finding potentially vulnerable individual statements is not that hard, but figuring out whether those statements can be reached with unsanitised inputs is much more challenging. The analysis has to take into account the data flow paths through the application, as well as the strength of any sanitisation applied to inputs along those paths. We can’t solely use static analysis for this since many applications include significant portions of dynamically generated content. Searching for and generating executable exploits that may span multiple modules, and many execution paths within modules could easily cause a state-space explosion. NAVEX is designed to support aggressive pruning of the search space to maintain scalability.

### Identifying vulnerable sinks

The first task is to identify vulnerable sinks in an application. Each module is analysed independently, and only those with potential vulnerabilities will be considered as sinks for further analysis.

NAVEX first builds a graph model of each module’s code, then it discovers the paths that contain data flows between sources and sinks. Finally, it uses symbolic execution to generate a model of the execution as a formula and constraint solving to determine which of those paths are potentially exploitable.

The process starts with an attack dictionary containing analysis templates for a wide range of vulnerabilities. The initial implementation contains templates for SQLI, XSS, file inclusion, command injection, code execution, and EAR (execution-after-redirect) attacks. The attack dictionary can be used to find candidate sinks within modules. The dictionary also includes sanitisation patterns and attack strings for each attack.

The next step is to build a graph model representing the possible execution paths through the application. This is based on Code Property Graphs, which combine ASTs, control-flow graphs, and data dependence graphs. Nodes in the CPG are augmented with sanitisation and database constraint tags. Santisation tags store information about the sanitisation status of each variable at a node. Attribute values are either unsan-X or san-X where X represents a particular vulnerability, e.g. san-sql. Database constraints (e.g. NOT NULL) are gathered from the schema and attached to the root node.

Given the graph and the candidate sink nodes, backwards traversal from a sink node can find vulnerable (unsanitised) paths leading back to a source node. For each node representing a sink, the data dependency edges are followed backwards for all variables used in that sink. If the path leads to a function, then all call sites leading to that function will be analysed. The full algorithm is given below:

In an EAR attack, execution continues after a redirect. To find EAR vulnerabilities NAVEX uses a forwards traversal from sources to sinks, where the sources are redirect instructions.

At this point we have a set of paths through the application that are potentially vulnerable. The final step in the static analysis is to generate exploit strings to use in an attack. Each vulnerable path is modelled as a logical formula together with any constraints derived from the database. This is augmented with information from the attack dictionary with constraints over variables at the sink representing values that can lead to an attack.

The augmented formula (i.e. $F_{path} \wedge F_{db} \wedge F_{attack})$ is next sent to a solver, which provides a solution (if it exists) over the values of the input variables, that is an exploit string. This solution contains the values of the input variables, which, after the path and sanitizations executions, cause the attack string to appear at the sink.

### Generating concrete exploits

Even if the solver is able to spit out an exploit string, it doesn’t necessarily mean that the attack is feasible. The final hurdle is to discover a set of HTTP requests that can be sent to the application to trigger the path and execute the attack described by the exploit string(s).

NAVEX combines crawling with dynamic discovery of server-side constraints to build a navigation graph . From the navigation graph, NAVEX can produce a series of HTTP requests that constitute a concrete exploit for the vulnerable sink.

The crawler starts with a set of seed URLs and explores HTML links, forms, and JavaScript code. For valid form submission the crawler extracts the form input fields, buttons, and action and method attributes to generate a set of implied constraints over the form values. JavaScript code is extracted and analysed using concrete symbolic execution:

The code is first executed concretely and when the execution reaches a conditional statement that has symbolic variables, the execution forks. Then, the execution resumes concretely. After the execution stops for all forks, a set of constraints that represent each execution path that returns true is generated.

The resulting formula for the form is sent to the solver to find a solution.

Even with all of this analysis on the client-side, there may be additional server-side constraints not captured by the process as described so far. An example in the paper is a conditional check in the form processing code on the server side on the length of a submitted string (which is not tested for in the client-side validations). Without an understanding of these constraints, we can’t build form submissions that will be accepted. NAVEX uses execution tracing on the server side to see if a form submission was successful, as indicated by a change of state or performance of sensitive operations such as database queries. If a form submission is deemed to have failed, then NAVEX uses the trace to perform concolic execution:

In particular, it first retrieves the executed statements including the conditional statements. Then, the collected conditional statements are transformed automatically to solver specifications and negated to uncover new execution paths. The newly created specifications are then sent to the solver to generate new form inputs. This process is continuously repeated until the form submission is successful.

The end result is a navigation graph such as this:

From the graph, NAVEX searches for paths from public endpoints to the exploitable modules. The search algorithm is as follows:

… the challenging problem of finding sequences of HTTP requests that execute an exploit is transformed into a simple graph search problem, which is efficient.

### NAVEX in action

NAVEX is evaluated on 26 real-world PHP apps with a combined codebase of 3.2M SLOC and 22.7K PHP files. The list of apps and their respective sizes is given in the table below.

NAVEX was able to find 204 exploits. 195 of these are injection exploits, and 9 relate to logic vulnerabilities. The longest exploit paths contained 6 HTPP requests. It takes from a few seconds to a few hours to complete the analysis of an application and build the navigation graph:

The following figure shows the total time to find exploitable sinks and generate concrete exploits, broken down by exploit type:

NAVEX compares very favourable to state-of-the-art vulnerability detection and exploit generation systems such as Chainsaw and RIPS (see section 5.3 in the paper for details).

We demonstrated that NAVEX significantly outperforms prior work on the precision, efficiency, and scalability of exploit generation.

Unveiling and quantifying Facebook exploitation of sensitive personal data for advertising purposes Cabañas et al., USENIX Security 2018

Earlier this week we saw how the determined can still bypass most browser and tracker-blocking extension protections to track users around the web. Today’s paper is a great example of why you should care about that. Cabañas et al. examine the extent to which the profile Facebook builds on its users includes sensitive personal data made available to advertisers for targeting. The work was done just prior to the GPDR coming into force, which makes for very interesting timing from a legal perspective. The headline result is that it looks like Facebook is holding sensitive data on about 40% of the EU population, and that this data can be used by third-parties to target individuals in sensitive demographics and even identify them at a cost of as little as €0.015 per user.

### What kinds of sensitive data are we talking about?

The GDPR definition of sensitive personal data is “data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life of sexual orientation.

As we’ve seen in previous editions of The Morning Paper, Facebook build profiles of their users based on their activities on Facebook and on third-party sites with Facebook trackers. Using the Facebook Ads Manager, you can target users based on their interests (aka ‘ad preferences’). The authors collected a total of 126,192 unique ad preferences which Facebook had assigned to real users. In amongst these 126K preferences are some that pertain to sensitive personal data. To find out how many, the authors selected five of the relevant GDPR categories and set out to match ad preferences against these categories.

1. Data revealing racial or ethnic origin
2. Data revealing political opinions
3. Data revealing religious or philosophical beliefs
4. Data concerning health
5. Data concerting sex life and sexual orientation

One reason for believing that Facebook is tracking such information is that they were previously fined in Spain in 2017 for violating the Spanish implementation of the EU data protection directive 95/46EC (a GDPR predecessor). The Spanish Data Protection Agency claimed that Facebook ‘collects, stores and processes sensitive personal data for advertising purposes without obtaining consent from users.’ The amount of the fine? A paltry €1.2M — compare that to Facebook’s 2017 revenues of \$40.6B!

### The Data Validation Tool for Facebook users (FDVT)

Data for the research is based on (consent-giving) users of the Data Validation Tool for Facebook Users (FDVT). This is a browser extension that shows users how much money Facebook is earning based on their activities on the Facebook site. The video on the linked page makes for very interesting viewing. We all know that advertising powers much of the internet, but it really brings it home to see the revenue you’re generating for Facebook ticking up in real-time as you use the site. The FDVT collects the ad preferences that Facebook has assigned to its users. The 126K unique ad preference categories were obtained from 4,577 users of FDVT between October 2016 and October 2017. Each FDVT user is assigned a median of 474 preferences.

### Assessing the extent of potentially sensitive ad preferences

Starting with the set of 126K ad preferences, the team applied a two-stage filter to find those that reveal sensitive personal data. The initial filter is done using the NPL techniques to find interests whose name or disambiguation category (provided by Facebook) suggests they might be sensitive. A dictionary of 264 sensitive keywords was built and then the spaCy python package was used to compute the semantic similarity between interest name and disambiguation category and the keywords. 4,452 ad preferences gave a high enough similarity score to pass this first filter.

A panel of twelve researchers with some knowledge in the area of privacy then manually classified these 4,452 remaining ad preferences. Each panelist classified the ad preferences into one of five politically sensitive categories (Politics, Health, Ethnicity, Religion, Sexuality) or ‘Other’. Based on majority voting, 2,092 ad preferences were finally labeled as sensitive.

58.3% of the sensitive preferences relate to politics, 20.8% to religion, 18.2% to health, 1.5% to sexuality, and 1.1% to ethnicity.

The Facebook Ads Manager API was then used to quantify how many Facebook users in the EU country have been assigned at least one of the potentially sensitive ad preferences.

### The Results

Results are reported as a percentage of Facebook users per country and in the EU overall (FFB(country, N) for the percentage of FB users assigned at least one of the top N sensitive categories), and also as a percentage of the overall population of each country (FC(country, N)). Here’s the data for every EU country with N=500.

We observe that 73% of EU FB users, which corresponds to 40% of EU citizens, are tagged with some of the top 500 potentially sensitive ad preferences in our dataset. If we focus on individual countries, FC(C,N=500) reveals that in 7 of them more than half of their citizens are tagged with at least one of the top 500 potentially sensitive ad preferences… These results suggest that a very significant part of the EU population can be targeted by ad campaigns based on potentially sensitive personal data.

### Expert verification

To further verify that the set of potentially sensitive ad preferences contained those likely to be relevant under the GDPR, an expert from the Spanish DPA reviewed a subset of 20 ad preferences that all panelists classified as sensitive. The expert confirmed the sensitive nature of the attribute in all cases. The following tables show the percentage of Facebook users, and overall citizens, tagged with at least one of these expert-verified sensitive attributes.

### Exploitation

So far we know that Facebook has sensitive data, but it doesn’t allow third-parties to identify individual users directly based on this. You can of course (that’s the whole point from Facebook’s perspective) conduct advertising campaigns using it though.

The authors give two examples of how the advertising mechanism could be abused. Firstly, an attacker could create hate campaigns, targeting specific sensitive social groups. Campaigns can reach thousands of users at a very low cost (e.g., the authors reached more than 26K FB users with a spend of only €35). Such a campaign would of course be a violation of Facebook’s advertising policies and you would hope that the review process would catch such campaigns before they are live. At least in the context of dark ads though some undesirable adverts do seem to slip through the net.

The second example is an advertiser using a legitimate looking advertising campaign targeting a sensitive group, which is actually a phishing-like attack. When users click on an advert from the campaign they can be taken to a phishing page persuading the user to give away some information that reveals their identity. A recent study cited by the authors saw phishing attack success rates of as high as 9%. At this level, it would have costs the authors around €0.015 per user to identify an arbitrary member of a group.

In summary, although Facebook does not allow third parties to identify individual users directly, ad preferences can be used as a very powerful proxy to perform identification attacks based on potentially sensitive personal data at low cost. Note that we have simply described this ad-based phishing attack but have not implemented it due to the ethical implications.

### FDVT and sensitive data

The FDVT has been extended to enable users to see the sensitive personal data attributes Facebook has associated them with, along the reason Facebook gives.

The results of our paper urge a quick reaction from Facebook to eliminate all ad preferences that can be used to infer the political orientation, sexual orientation, health conditions, religious beliefs or ethnic origin of a user…