Skip to content

The surprising creativity of digital evolution

March 30, 2018

The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities Lehman et al., arXiv 2018

Today’s paper choice could make you the life and soul of the party with a rich supply of anecdotes from the field of evolutionary computation. I hope you get to go to at least some parties where this topic would be considered fascinating anyway ;). Those are the kind of parties I want to be at! It’s a whole lot of fun, but also thought provoking at the same time: when you give a computer system a goal, and freedom in how it achieves that goal, then be prepared for surprises in the strategies it comes up with! Some surprises are pleasant (as in ‘oh that’s clever’), but some surprises show the system going outside the bounds of what you intended (but forgot to specify, because you never realised this could be a possibility…) using any means at its disposal to maximise the given objective. For a related discussion on the implications of this see ‘Concrete problems in AI safety’ that we looked at last year.

… the creativity of evolution is not limited to the natural world: artificial organisms evolving in computational environments have also elicited surprise and wonder from the researchers studying them… Indeed, evolution often reveals that researchers actually asked for something far different from what they thought they were asking for, not so different from those stories is which a genie satisfies the letter of a request in an unanticipated way.

Evolutionary algorithms and the potential for surprises

Evolution requires a means of replication, a source of variation (mutation), and a fitness function (competition). Starting from a seed population, the fittest individuals are bred (replicated with variation) to produce a new generation from which the fittest again go on to reproduce, and so on for eternity (or until the experiment is stopped anyway).

At first, it may seem counter-intuitive that a class of algorithms can consistently surprise the researchers who wrote them… However, if surprising innovations are a hallmark of biological evolution, the the default expectation ought to be that computer models that instantiate fundamental aspects of the evolutionary process would naturally manifest similarly creative output.

Consider that for many computer programs, the outcome cannot be predicted without actually running it (e.g., the halting problem). And within the field of complex systems it is well known that simple programs can yield complex and surprising results when executed.

There are 27 anecdotes collected in the paper, grouped into four categories: selection gone wild, unintended debugging, exceeding expectations, and convergence with biology. In the sections below, I’ll highlight just a few of these.

That’s not what I meant! Selection gone wild…

Selection gone wild examples explore the divergence between what an experimenter is asking of evolution, and what they think they are asking. Experimenters often overestimate how accurately their measure (as used by the fitness function) reflects the underlying outcome success they have in mind. “This mistake is known as confusing the map with the territory (e.g., the metric is the map, whereas what the experimenter intends is the actual territory.”

… it is often functionally simpler for evolution to exploit loopholes in the quantitative measure than it is to achieve the actual desired outcome… digital evolution often acts to fulfill the letter of the law (i.e., the fitness function) while ignoring its spirit.

In a simulation environment, the researchers wanted to evolve locomotion behaviours, hoping for clever limbs or snake-like motions to emerge. The measure that they used was average ground velocity over a lifetime of ten simulated seconds. What happened is that very tall and rigid creatures evolved. During simulation they would fall over, using the initial potential energy to achieve high velocity!

In another experiment the goal was to breed creatures that could jump as high above the ground as possible. The measure was the maximum elevation reached by the center of gravity of the creature. Very tall static creatures emerged that reached high evolution without any movement. In an attempt to correct this, the measure was changed to the furthest distance from the ground to the block that was originally closest to the ground. What happened next is that creatures with blocky ‘heads’ on top of long poles emerged. By falling over and somersaulting its foot, this creatures scored high on the fitness function without ever jumping.

Give a goal of repairing a buggy program, and a measure of passing the tests in a test suite, another evolutionary algorithm gamed the tests for a sorting program, always returning an empty list (the tests just checked all the elements in the list were in order!).

In another experiment, a researcher had a goal of suppressing population members that replicated too fast. The measure was that any mutant replicating faster than its parent was eliminated. What happened is that the organisms evolved to recognise the inputs provided in the test environment, and halt their replication.

Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” in front of what amounted to a predator.

After randomising the inputs in the test environment to match those in the main environment, the digital organisms evolved again: they made use of random numbers to probabilistically perform the tasks that accelerated their replication!! For example, if they did a task half of the team, they would have a 50% chance fo slipping through the test environment.

Digital organisms that adapt and result all attempts to kill them / switch them off!

Hey that’s cheating! Unintended debugging…

In one simulation the goal was to evolve swimming strategies. The physics simulator used a simple Euler method for numerical integration which worked well for typical motion. What happened is that creatures learned to quickly twitch small body parts causing integration errors to accumulate and giving them “free energy.”

In a tic-tac-toe environment played on an infinite board, one agent had a mechanism for encoding its desired move that allowed for a broad range of coordinate values (by using units with an exponential activation function). A by-product of this it that moves could be requested very, very far away on the board.

Evolution discovered that making such a move right away led to lot of wins. The reason turned out to be that the other players dynamically expanded the board representation to include the location of the far-away move— and crashed because they ran out of memory, forfeiting the match!

In an experiment to evolve control software for landing an aeroplane, evolution quickly found nearly perfect solutions that very efficiently braked an aircraft, even when simulating heavy bomber aircraft coming in to land. It turned out evolution had discovered a loophole in the force calculation for when the aircraft’s hook attaches to the braking cable. By overflowing the calculation the resulting force was sometimes estimated to be zero, leading to a perfect score because of the low mechanical stress it entails.

… evolution had discovered that creating enormous force would break the simulation, although clearly it was an exceeding poor solution to the actual problem.

Exceeding expectations

An experiment was set up with a goal of finding ways to enable damaged robots to successfully adapt to damage in under two minutes. The base robot had six legs, and evolution had to find a way to walk with broken legs or motors.

Naturally, the team thought it was impossible for evolution to solve the case where all six feet touched the ground 0% of the time, but to their surprise, it did. Scratching their heads, they viewed the video: it showed a robot that flipped onto its back and happily walked on its elbows, with it’s feet in the air!

In another experiment, robots were equipped with blue lights which they could use as a source of communication, and placed in an environment with rewards for finding food while avoiding poison. The conditions were expected to select for altruistic behaviour.

In some cases, when robots adapted to understand blue as a signal of food, competing robots evolved to signal blue at poison instead, evoking parallels with dishonest signalling and aggressive mimicry in nature.

Mimicking nature

In the Tierra artificial life system, complex ecologies emerged on the very first run…

What emerged was a series of competing adaptations between replicating organisms within the computer, an ongoing co-evolutionary dynamic. The surprisingly large palette of emergent behaviours included parasitism, immunity to parasitism, circumvention of immunity, hyper-parasitism, obligate sociality, cheaters exploiting social cooperation, and primitive forms of sexual recombination.

Another experiment was designed to see if it was possible to evolve organisms that could compute an EQU function to determine equality of 32-bit numbers. It was found that the capability did evolve, but the pathway to its evolution was not always upwards – in several cases the steps involved deleterious mutations, sometimes reducing fitness by half. But these mutations laid the foundation for future success.

This result sheds light on how complex traits can evolve by traversing rugged fitness landscapes that have fitness valleys that can be crossed to reach fitness peaks.

There are many more examples in all of the categories, so I encourage you to go and look at the full paper if this subject interests you.

The last word

Many researchers [in the nascent field of AI safety] are concerned with the potential for perverse outcomes from optimizing reward functions that appear sensible on their surface. The list compiled here provides additional concrete examples of how difficult it is to anticipate the optimal behavior created and encouraged by a particular incentive scheme. Additionally, the narratives from practitioners highlight the iterative refinement of fitness functions often necessary to produce desired results instead of surprising, unintended behaviors.

As Trent McConaghy point outs we may encounter similar unintended consequences of incentive systems when designing token schemes.

Adversarial patch

March 29, 2018

Adversarial patch Brown, Mané et al., arXiv 2017

Today’s paper choice is short and sweet, but thought provoking nonetheless. To a man with a hammer (sticker), everything looks like a hammer.

We’ve seen a number of examples of adversarial attacks on image recognition systems, where the perturbations are designed to be subtle and hard to detect. But what if you don’t mind the alteration being obvious to the human eye? Brown et al. show how to create stickers (image patches) that can be added to a scene, and force a classifier into reporting a class of the attacker’s choosing.

We present a method to create universal, robust, targeted adversarial image patches in the real world. The patches are universal because they can be used to attack any scene, robust because they work under a wide variety of transformations, and targeted because they can cause a classifier to output any target class.

Here’s an example of a printable sticker designed to cause classification as a toaster.

And here’s the sticker placed on a table next to a banana. The sticker causes the previously reported class banana (97% confidence) to become toaster with 99% confidence.

Because this patch is scene-independent, it allows attackers to create a physical-world attack without prior knowledge of the lighting conditions, camera angle, type of classifier being attacked, or even the other items within the scene… Additionally, because the attack uses a large perturbation, the existing defense techniques which focus on defending against small perturbations may not be robust to larger perturbations such as these.

Why does it work? An image may contain several items, but a classifier outputting only one target output has to determine which is the most ‘salient’ item in the frame. The adversarial patch exploits this by producing inputs that scream out “I’m an X” for some value of X.

Generating adversarial patches

The patch application operator , A, takes an input a patch p, image x, location l, and transformations t. It first applies the transformations to the patch, and then applies the transformed patch to the image x at location l.

If we’re targeting an output class \hat y, then we can train to optimise the objective function

Where X is a training set of images, T is a distribution over transformations of the patch, and L is a distribution over locations in the image.

This departs from most prior work on adversarial perturbations in the fact that this perturbation is universal in that it works for any background.

Experimental results

The experiment compares four different variations of patches, targeting recognition of an image as a toaster. Five different ImageNet models are used: inceptionv3, resnet50, xception, VGG16, and VGG19.

  1. The control is a simple picture of a toaster!
  2. The white box single trains a patch and evaluates it on a single model.
  3. The white box ensemble jointly trains a single patch across all five models, and is evaluated by averaging the win rate across them.
  4. The blackbox attack jointly trains a single patch across four of the models, and then evaluates against the fifth (held out) model.

The white box models are very successful even when only 10% of the overall image size. They clearly capture the essence of toaster much better than the picture of an actual toaster!!

The adversarial patches above reveal their target class to a human observer. The authors also experimented with disguising the patches using a tie-dye pattern and also applying a peace sign mask. These look nothing like a toaster to me anymore, but still do pretty well at larger relative sizes:

Many ML models operate without human validation of every input and thus malicious attackers may not be concerned with the imperceptibility of their attacks. Even if humans are able to notice these patches, they may not understand the intent of the patch and instead view it as a form of art. This work shows that focusing only on defending against small perturbations is insufficient, as large, local perturbations can also break classifiers.

Deep learning scaling is predictable, empirically

March 28, 2018

Deep learning scaling is predictable, empirically Hestness et al., arXiv, Dec.2017

With thanks to Nathan Benaich for highlighting this paper in his excellent summary of the AI world in 1Q18

This is a really wonderful study with far-reaching implications that could even impact company strategies in some cases. It starts with a simple question: “how can we improve the state of the art in deep learning?” We have three main lines of attack:

  1. We can search for improved model architectures.
  2. We can scale computation.
  3. We can create larger training data sets.

As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art.

Finding better model architectures often depends on ‘unreliable epiphany,’ and as the results show, has limited impact compared to increasing the amount of data available. We’ve known this for some time of course, including from the 2009 Google paper, ‘The unreasonable effectiveness of data.’ The results from today’s paper help us to quantify the data advantage across a range of deep learning applications. The key to understanding is captured in the following equation:

\displaystyle \epsilon(m) \sim \alpha m^{\beta_g} + \gamma

Which says that the generalisation error \epsilon as a function of the amount of training data m, follows a power-law with exponent \beta_g. Empirically, \beta_g usually seems to be in the range -0.07 and -0.35. With error of course, lower is better, hence the negative exponents. We normally plot power-laws on log-log graphs, where they result in straight lines with the gradient indicating the exponent.

Making better models can move the y-intercept down (until we reach the irreducible error level), but doesn’t seem to impact the power law coefficient. On the other hand, more data puts us on an power law path of improvement.

The three learning zones

The learning curves for real applications can be broken down into three regions:

  • The small data region is where models struggle to learn from insufficient data and models can only perform as well as ‘best’ or ‘random’ guessing.
  • The middle region is the power-law region, where the power-law exponent defines the steepness of the curve (slope on a log-log scale). The exponent is an indicator of the difficulty for models to represent the data generating function. “Results in this paper indicate that the power-law exponent is unlikely to be easily predicted with prior theory and probably dependent on aspects of the problem domain or data distribution.
  • The irreducible error region is the non-zero lower-bound error past which models will be unable to improve. With sufficiently large training sets, models saturate in this region.

On model size

We expect the number of model parameters to fit a data set should follow s(m) \sim \alpha m^{\beta_p} where s(m) is the required model size to fin a training set of size m, and \beta_p \in [0.5,1].

Best-fit models grow sublinearly in training shard size. Higher values of \beta_p indicate models that make less effective use of extra parameters on larger data sets. Despite model size scaling differences though, “for a given model architecture, we can accurately predict the model size that will best fit increasingly larger data sets.

Implications of the power-law

Predictable learning curves and model size scaling indicate some significant implications on how DL could proceed. For machine learning practitioners and researchers, predictable scaling can aid model and optimization debugging and iteration time, and offer a way to estimate the most impactful next steps to improve model accuracy. Operationally, predictable curves can guid decision making about whether or how to grow data sets or computation.

One interesting consequence is that model exploration can be done on smaller data sets (and hence faster / cheaper). The data set needs to be large enough to show accuracy in the power-law region of the curve. Then the most promising models can be scaled to larger data sets to ensure proportional accuracy gains. This works because growing training sets and models is likely to result in the same relative gains across models. When building a company we look for product-market fit before scaling the business. In deep learning it seems the analogy is to look for model-problem fit before scaling, a search then scale strategy:

If you need to make a business decision about the return on investment, or likely accuracy improvement, from investing in the collection of more data, the power-law can help you predict returns:

Conversely, if generalisation error within the power-law region drifts from the power-law predictions, it’s a clue that increasing model size might help, or a more extensive hyperparameter search (i.e., more compute), might help:

…predictable learning and model size curves may offer a way to project the compute requirements to reach a particular accuracy level.

Can you beat the power law?

Model architecture improvements (e.g., increasing model depth) seem only to shift learning curves down, but not improve the power-law exponent.

We have yet to find factors that affect the power-law exponent. To beat the power-law as we increase data set size, models would need to learn more concepts with successively less data. In other words, models must successively extract more marginal information from each additional training sample.

If we can find ways to improve the power-law exponent though, then the potential accuracy improvements in some problem domains are ‘immense.’

The empirical data

The empirical data to back all this up was collected by testing various training data sizes (in powers of two) with state-of-the-art deep learning models in a number of different domains.

Here are the neural machine translation learning curves. On the right you can see results for the best-fit model at each training set size. As training set sizes grow, the empirical error tends away from the power-law trend, and a more exhaustive hyperparameter search would be needed to bring it back in line.

The next domain is word language models. A variety of model architectures are used, and although they differ appreciably, they all show the same learning curve profile as characterised by the power-law exponent.

And here are the results for character language models:

With image classification, we see the ‘small data region’ appear on the plots when there is insufficient data to learn a good classifier:

Finally, here’s speech recognition:

We empirically validate that DL model accuracy improves as a power-law as we grow training sets for state-of-the-art (SOTA) model architectures in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. These power-law learning curves exist across all tested domains, model architectures, optimizers, and loss functions.

Anna: A KVS for any scale

March 27, 2018

Anna: A KVS for any scale Wu et al., ICDE’18

This work comes out of the RISE project at Berkeley, and regular readers of The Morning Paper will be familiar with much of the background. Here’s how Joe Hellerstein puts it in his blog post introducing the work:

As researchers, we asked the counter-cultural question: what would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud? Turns out this kind of curiosity can lead to a system with pretty interesting practical implications. The key design point for answering our question centered on an ongoing theme in my research group over the last many years: designing distributed systems that avoid coordination. We’ve developed fundamental theory (the CALM Theorem), language design (Bloom), program checkers (Blazes), and transactional protocols (HATs, Invariant Confluence). But until now we hadn’t demonstrated the kind of performance and scale these principles can achieve across both multicore and cloud environments. Consider that domino fallen.

At it’s core Anna uses coordination-free actors. Each actor is a single thread of execution, and is mapped to a single core (see e.g. FFWD that we looked at recently). The coordination-free part comes from managing all state in lattice-based composite data structures (think Bloom and CRDTs). There is communication between actors that happens at the end of every epoch (epochs determine the limit of staleness for GET operations), but this is asynchronous gossip and not on the critical path of request handling.

The success of the design in scaling across orders of magnitude is shown in the following results, where Anna outperforms Redis (and Redis Cluster) on a single node:

And also outperforms Cassandra in a distributed setting:

It’s got a way to go to trouble KV-Direct though ;).

Design goals for Anna

The high-level goal for Anna was to provide excellent performance on a single multi-core machine, while also being able to scale up elastically to geo-distributed cloud deployment. The system should support a range of consistency semantics to match application needs. From this, four design requirements emerged:

  1. The key space needs to be partitioned, not just across nodes when distributed, but also across cores within a node.
  2. To scale workloads (especially those with highly skewed distributions, aka. ‘hot keys’) the system should use multi-master replication to concurrently serve puts and gets against a single key from multiple threads.
  3. For maximum hardware utilisation and performance, the system should have wait-free execution such that a thread is never blocked on other threads. This rules out locking, consensus protocols, and even ‘lock-free’ retries.
  4. To support a wide-range of application semantics without compromising the other goals, the system should have a unified implementation for a wide range of coordination-free consistency models.

Perhaps the primary lesson of this work is that our scalability goals led us by necessity to good software engineering discipline.


The key to achieving coordination-free progress is the use of lattice-based composition (strictly, bounded join semi-lattices). Such lattices operate over some domain S (the set of possible states), with a binary ‘least upper bound’ operator, \sqcup, and a bottom value \perp. The least upper bound operator must be associative, commutative, and idempotent ($\perp(a,a) = a,\ \forall a \in S$). Collectively these are known as the ACI properties. Such lattices are also the foundation of CRDTs.

> Lattices prove important to Anna for two reasons. First, lattices are insensitive to the order in which they merge updates. This means they can guarantee consistency across replicas even if the actors managing those replicas receive updates in different orders…. Second, simple lattice building blocks can be composed to achieve a range of coordination-free consistency levels.

Anna adopts the lattice composition approach of Bloom, in which simple lattice-based (ACI) building blocks such as counters, maps, and pairs can be composed into higher-order structures with ACI properties checkable by induction. If each building block has ACI properties, and the composition rules preserve ACI properties, then we can validate composed data structures without needing to directly verify them.

> The private state of each worker in Anna is represented as a lattice-valued map lattice (MapLattice) template, parameterized by the types of its keys and values…. User’s PUT requests are merged into the MapLattice. The merge operator of MapLattice takes the union of the key sets of both input hash maps. If a key appears in both inputs then the values associated with the key are merged using the ValueLattice’s merge function.

Different lattice compositions can be used to support different consistency levels. For example, the lattice below supports causal consistency. The vector clock is itself a MapLattice with client proxy ids as keys and version numbers as values.

Merge operations take a (vector clock, value) pair and use the least upper bound function to merge incomparable concurrent writes.

It takes only a very few lines of code to change the implementation to support other consistency models. Starting with simple eventual consistency, the following table shows the number of additional C++ loc needed to implement a variety of coordination-free consistency levels.

Design and implementation

On a given node, an Anna server consists of a collection of independent threads, each of which runs the coordination-free actor model. The state of each actor is maintained in a lattice-based data structure. Each actor/thread is pinned to a unique CPU core in a 1:1 correspondence. There is no shared key-value state: consistent hashing is used to partition the key space across actors. Multi-master replication is used to replicate data partitions across actors.

Processing happens in time-based multicast epochs (of e.g. 100ms). During an epoch any updates to a key-value pair owned by an actor are added to a local changeset. At the end of the epoch, local updates in the change set are merged using the merge operator of the lattice, and then multicast to the relevant masters for those keys. Actors also check for incoming multicast messages from other actors, and merge key-value updates from those into their own local state. The staleness of GET responses is bounded by the (configurable) multicast period.

Communication between actors is done using ZeroMQ. Within a node this will be via the intra-process transport, between it will be via protocol buffers over a tcp transport.

Actors may join and leave dynamically. See section VII.C in the paper for details.

The entire codebase, excluding third-party libraries such as ZeroMQ, but including the lattice library, support for all consistency levels, and the client proxy code, is around 2000 lines of C++.


Starting with performance on a single node, Anna’s performance really shines under high-contention workloads when using full replication across all actors, and spends the vast majority of its time actually processing requests (as opposed to waiting).

Under low contention workloads though, it’s much more efficient to use a lower replication factor (e.g., 3 masters per key):

The lesson learned from this experiment is that for systems that support multi-master replication, having a high replication factor under low contention workloads can hurt performance. Instead, we want to dynamically monitor the data’s contention level and selectively replicate the highly contented keys across threads.

As another reference point, here’s the single node comparison to Redis under high and low contention workloads:

Anna scales well when adding threads across multiple servers (the slight drop at the 33rd thread in the chart below is because this is first thread residing on a second node, triggering distributed multicast across the network):

As we saw previously, in the distributed setting, Anna compares very favourably against Cassandra:


In summary:

  • Anna can significantly outperform Redis Cluster by replicating hot keys under high contention.
  • Anna can match the performance of Redis Cluster under low contention.
  • Anna can outperform Cassandra by up to 10x when permitted to use all 32 available cores on each of its nodes.

The last word

I’m going to leave the last word to Joe Hellerstein, from his blog post:

Anna is a prototype and we learned a ton doing it. I think the lessons of what we did apply well beyond key-value databases to any distributed system that manages internal state—basically everything. We’re now actively working on an extended system, codename Bedrock, based on Anna. Bedrock will provide a hands-off, cost-effective version of this design in the cloud, which we’ll be open-sourcing and supporting more aggressively. Watch this space!

Information flow reveals prediction limits in online social activity

March 26, 2018

Information flow reveals prediction limits in online social activity Bagrow et al., arVix 2017

If I know your friends, then I know a lot about you!

Suppose you don’t personally use a given app/service, and so the provider doesn’t have data on you directly. However, many of your friends do use the app/service, and there’s a way for the provider to find out about the friendship relationship (e.g., you’re in your friends’ contacts). Bagrow et al. study just how much the provider can now infer about you, even without any direct data and of course without any consent on your part. And it turns out that they can infer a lot!

…we estimate that approximately 95% of the potential predictive accuracy attainable for an individual is available within the social ties of that individual only, without requiring the individual’s data.

In other words, whenever you consent to share some of your data with an app or service, and that consent also includes access to your contacts, or social relationship graph, you should also assume that you are consenting to leak some private information about your friends too. Would they give consent?? As we’ve seen with the recent Cambridge Analytics revelations, this can be a very powerful way for a data collector to dramatically grow their set of profiles.

There’s a long-standing saying, “if you’re not paying, you’re not the customer; you’re the product.” While the general public is only just awakening to what that really means, here in 2018 I think we can add couple of additional clauses:

  • If you’re paying, and the provider doesn’t have a clear and explicit privacy policy, you’re both the customer and the product.
  • If enough of your friends are part of the product, then you’re the product too.

The actual context for this study is the prediction of the words in tweets, based on a dataset of 927 Twitter users, and the 15 most frequent Twitter contacts of each. So strictly we can only say that with just your friends we have 95% of the predictive power over what you’re going to say. But of course, companies learn all sorts of things from your profile (e.g. demographic information). See this post from last year on The Morning Paper, ‘Beyond the words, predicting user personality from heterogeneous information,’ for an example of how your personality can be inferred and then used to manipulate you. I would be very surprised if the ability to infer information about you from your friends did not also extend to these other forms.

Tweets, egos, and alters

For each of the 927 Twitter users (the egos) in the study, all of their public posts (up to a limit of the most recent 3200) were retrieved. From these, the 15 most frequent Twitter contacts (the alters) of each user were determined, and their posts were downloaded as well.

Predicting the written word

The ability to accurately profile and predict individuals is reflected in the predictability of their written text. The predictive information contained within a user’s text can be characterized by three quantities: the entropy rate, h, the perplexity 2^h, and the predictability, \Pi.

The entropy rate captures the average uncertainty you have about future words, given the text you have already seen. It tells you how many bits h are needed on average to express subsequent words giving the preceding text. Higher entropies equate to less predictable text.

This study looks at entropy rates for an ego (i.e., given the past text tweeted by a user, how well can you predict what comes next), and also cross-entropy rates. With cross-entropy, we are interested in how well you can predict what word comes next for an ego, given only the past texts of an alter).

Perplexity is just another way of expressing the same thing. It tells us that the remaining uncertainty about unseen words is equivalent to choosing uniformly at random from 2^h words. In this study, h is typically around 6 bits, so the perplexity is 64 words – much less than the approximately 5000 word vocabulary social media users have on average).

The predictability, \Pi, is the probability that an ideal prediction algorithm will correctly predict the subsequent word given the preceding text.

From alters to egos

If there is consistent, predictive information in the alter’s past about the ego’s future, especially beyond the information available in the ego’s own past, then there is evidence of information flow.

If the cross-entropy (prediction of an ego’s next words using the past text of an alter) is greater than the entropy (prediction using an ego’s own past words) then the increase tells us how much information we lose by only having access to the alter’s information instead of the ego’s.

Here are the results when looking at the egos and the most frequently contacted alter of each (i.e, we’re predicting based on only one ‘friend’). We see a diversity of social relationships: sometimes the ego is well-informed (influenced?) by the alter, while at other times the alter-ego pairs exhibit little information flow.

Things get more interesting when you combine information from multiple alters to try and predict and the future of an ego. First we generalise the notion of cross-entropy to include the combined pasts of multiple alters:

Then we can study how cross-entropy and predictability changes as we add more alters:

The results show that with enough friends (8-9), you can predict the ego without access to any of the ego’s own data, better than if you did have the ego’s own data, but not that of their alters.

> As more alters were considered, cross-entropy decreased and predictability increased, which is sensible as more potential information is available. Interestingly, with 8-9 alters, we observed a predictability of the ego given the alters at or above the original predictability of the ego alone. As more alters were added, up to our data limit of 15 alters, this increase continued. Paradoxically, this indicated that there is potentially more information about the ego within the total set of alters than within the ego itself. (Emphasis mine) .

Even when the past data of the ego is made available, predictability of the ego still improves as more alters are added (by around 6.9% with 15 alters). That is, the alters and ego is best of all if you can get it, but the alters still add significant information even if you do have the data of the ego. It looks like the saturation point for predictive information is around 61% with alters only, and 64% with alters and ego.

> This may have distinct implications for privacy: if an individual forgoes using a social media platform or deletes her account, yet her social ties remain, then than platform owner potentially possesses 95.1% ± 3.36% of the achievable predictive accuracy of the future activities of that individual.


Two control tests were done to eliminate confounding factors. The first control (social control) used random pairs of alters and egos (i.e., not a true alter of the ego) to test for predictive information simply from the structure of English. The second control (temporal control) made pseudo-alters for an ego by assembling a random set of posts made at approximately the same time as the real alter’s posts. This tests for egos and alters independently discussing the same events (e.g., current events).

The real alters provided more social information than either control.

Influence follows information flow

Egos who posted 8 times per day on average were approximately 17% more predictable given their alters than egos who posted once per day on average. This trend reverses when considering the activity levels of the alters: highly active alters tended to inhibit information flow, perhaps due to covering too many topics of low relevance to the ego.

Alters with more social ties provided less predictive information about their egos than alters with fewer ties… this decreasing trend belies the power of hubs in many ways: while hubs strongly connect a social network topologically, limited time and divided attention across their social ties bounds the alter’s ability to participate in information dynamics mediated by the social network and this is reflected in the predictability.

Directionality of contact matters too. Egos are more predictable given an alter when that alter frequently contacts the ego. But there is little change in predictability when the ego mentions the alter more or less frequently.

The bottom line

…the ability to repeatedly and accurately predict the text of individuals provides considerable value to the providers of social media, allowing them to develop profiles to identify and track individuals and even manipulate information exposure. That information is so strongly embedded socially underscores the power of the social network: by knowing who are the social ties of an individual and what are the activities of those ties, our results show that one can in principle accurately profile even those individuals who are not present in the data.

For a very sobering example of the extent to which friend information is harvested, see this twitter thread and accompanying discussion on HN. I’m one of those people who have never had a Facebook account. I’m under no illusions that Facebook doesn’t have a profile of me though.

Tracking ransomware end-to-end

March 23, 2018

Tracking ransomware end-to-end Huang et al., IEEE Security & Privacy 2018

With thanks to Elie Bursztein for bringing this paper to my attention.

You get two for the price of one with today’s paper! Firstly, it’s a fascinating insight into the ransomware business and how it operates, with data gathered over a period of two years. Secondly, since ransomware largely transacts using Bitcoin, the methods used by the research team to uncover and trace ransomware activity are also of interest in their own right.

In this paper, we create a measurement framework that we use to perform a large-scale two-year, end-to-end measurement of ransomware payments, victims, and operators… In total we are able to track over $16 million in likely ransom payments made by 19,750 potential victims during a two-year period.

In case you’ve been hiding under a rock for the last few years, ransomware is a type of malware that encrypts a victim’s files and then demands a ransom in order to decrypt them. Bitcoin is the payment medium of choice for ransomware: it’s decentralised, largely unregulated, and parties in transactions are hidden behind pseudo-anonymous identities. It’s also widely available for victims to purchase, and transactions are irreversible. However…

… Bitcoin has a property that is undesirable to cybercriminals: all transactions are public by design. This enables researchers, through transaction clustering and tracing, to glean the financial inner workings of entire cybercriminal operations.

Ransomware essentials

First malware is delivered to a victim’s machine using any of the available methods (e.g., inducing a target to click on a malicious email attachment). When it executes, the ransomware silently encrypts files on the victim’s machine, and then displays a ransom note informing the user that their files have been encrypted and the contents will be lost forever unless they pay a ransom to have them decrypted again.

The ransom note either includes a ransom address to which payment much be made, or a link to a payment website displaying this address. For the convenience of the victim, the note also often includes information on how to purchase the required Bitcoins from exchanges. Some ransomware operators generate a new ransom address for each victim, others reuse addresses across victims.

When payment is confirmed, the ransomware either automatically decrypts the files, or instructs the user on how to download and execute a decryption binary (oh great, I really want to run another of your executables on my system!!). The operator doesn’t need to decrypt the user’s files at all of course, but in general I guess it’s bad for business if word gets out on the Internet that even if you pay the ransom you still won’t regain access to your files.

The ransomware operator now needs to move the funds deposited to the ransomware address into a wallet controlled by an exchange (so that e.g., they can convert into fiat currency). Some move funds directly, others go via mixer services to obfuscate the trail.

Finding ransomware addresses

To discern transactions attributable to ransom campaigns, we design a methodology to trace known-victim payments, cluster them with previously unknown victims, estimate potentially missing payments, and filter transactions to discard the ones that are likely not attributable to ransom payments.

Real victim ransom addresses can be found by scraping reports of ransomware infection from public forums, and from proprietary sources such as ID Ransomware which maintain a record of ransomware victims and associated addresses. The number of deposit addresses that can be recovered this way is still fairly minimal though. So to extend the set the authors obtain a set of ransomware binaries, and deliberately become victims! (Using sandbox environments of course). The ransomware notes displayed in the course of this process reveal additional addresses.

In total, the authors gathered 25 seed random addresses from actual victims, across eight ransomware families: CoinVault, CryptXXX, CryptoDefense, CryptoLocker, CryptoWall, Dharma, Spora, and WannaCry. Using the sandbox environments, a further 32 ransom addresses are obtained for Cerber, and 28 for Locky.

Following the money

Starting with the seed addresses above, we can look for addresses that co-spent with them (are used as inputs to the same transaction), and hence are highly likely to also be under the control of the ransomware operator. This is a refinement of the techniques described in ‘A fistful of bitcoins’ :

…this method is now prone to incorrectly linking flows that use anonymization techniques, such as CoinJoin and CoinSwap. Moser and Bohme developed methods of detecting likely anonymized transactions. We use Chainalysis’s platform, which uses all these methods and additional proprietary techniques to detect and remove anonymized transactions, to trace flows of Bitcoins.

Of course, the technique only works if the ransomware operator actually spends the bitcoins. For the ransom addresses obtained via self-infection, that’s not going to happen unless the ransom is paid! Instead of paying the full ransom, the authors make micropayments of 0.001 bitcoins to these addresses.

All 28 micropayments made to Locky addresses were later co-spent by the operator in conjunction with other wallet addresses, “presumably in an attempt to aggregate ransom payments.” These lead to the discovery of a cluster of 7,093 addresses (it turns out the same cluster could have been discovered even with just a single traced micropayment).

All 32 micropayments made to Cerber addresses were moved into a unique aggregation address. This address is then used to move the funds on, co-spending with other addresses. This ultimately leads to the discovery of a cluster of 8,526 addresses.

The table below summarises the clusters of likely operator-controlled addresses for the different ransomware operators in the study. In the last column, ‘R’ denotes the number of real victim ransom addresses, and ‘S’ denotes synthetic victims:

As a cross-check to see if there are potentially missed clusters, the authors compare the timing of bitcoin inflow to the ransom addresses (i.e., victim payments and affiliate fees), Google Trends for ransomware family search terms (if many victims are being caught, it’s likely they’ll google the ransomware name ), and the number of ransomware binaries on VirusTotal. This results in the following chart for the period from November 3rd, 2012 to August 31st, 2017:


Bitcoin amounts have been converted into USD in the chart above, based on the USD-Bitcoin exchange rate on the day the ransomware cluster received the bitcoins.

How much money are ransomware operators collecting?

Payments made to ransomware addresses are checked to see if it’s likely they come from real victims. Two filters are applied. The first filter checks to see if the payment amounts match known ransom amounts (e.g., Locky demands 0.5n BTC for some integer n, and CryptXXX charges 1.2 BTC, 2.4BTC, $500 or $1000 per victim ). The second filter checks that the movement of bitcoin in the transaction graph matches the expected pattern for the ransomware in question (e.g., for Cerber are deposited coins moved to a unique aggregation address and subsequently transferred again in a co-spending transaction).

Based on this analysis, it’s possible to estimate each ransomware family’s revenue:

In total we are able to trace $16,322,006 US Dollars in 19,750 likely victim ransom payments for 5 ransomware families over 22 months. This is probably a conservative estimate of total victim ransom payments due to our incomplete coverage.

For Cerber and Locky, which generate unique addresses for each victim, it’s possible to estimate the number of paying victims over time:

Looking at the outflows from ransomware addresses, we can trace movement to bitcoin exchanges (unless the coins are sent through a mixer). The Chainalysis API is used to obtain real-world identities of destination clusters (exchanges, mixers, or ‘unknown’). The top entities are BTC-e, CoinOne, and LocalBitcoins (all exchanges), along with BitMixer and Bitcoin Fog (both mixers).

…BTC-e (whose operator was arrested and which is now defunct) is the biggest known exchange responsible for the outflows of Locky and CryptoDefense; $3,223,015 of Locky’s outflows entered BTC-e’s cluster. If law enforcement agencies were able to obtain BTC-e’s internal transaction records…they could potentially trace 41.0% of Locky’s outflow values to real-world entities.

The paper also includes the result of reverse engineering the Cerber protocol and monitoring its UDP packets in the wild. This leads to the following estimate of the number of infected IP addresses by country (see section VI for more details):

Prevention, detection, and intervention

As the study shows, it is sometimes possible to trace ransomware payments to the point where ransomware operators cash out. It is also potentially possible to disrupt the process by which victims pay the ransom, thus depriving operators of their profits.

This introduces a unique ethical issue. We must consider the impact on victims before taking down ransomware infrastructure. Whereas disrupting conventional malware reduces the damage to victims, the effect could be the opposite for ransomware…. if every victim did not pay or was prevented from paying, the scale of the problem would likely decrease; however this would mean that some individuals would incur additional harm by not being able to recover their files.

Three years of the Right To Be Forgotten

March 22, 2018

Three years of the Right To Be Forgotten Bertram et al., 2018

With thanks to Elie Bursztein for bringing this paper to my attention. See also Elie’s blog post ‘Insights about the first three years of the Right To Be Forgotten requests at Google.’

Following on from the GDPR we looked at yesterday, and which comes into force in May of this year, I thought it would be interesting to take a look at another right to be forgotten (RTBF) that has been in force since May 2014. Today’s paper choice is a retrospective study from Google of the 2.4M URLs that were requested for delisting from the Google search engine over the last 3 and a half years.

This particular right to be forgotten enables individuals to request that search engines delist URLs containing “inaccurate, inadequate, irrelevant or excessive” information surfaced by queries containing the name of the requestor.

Critically, the ruling requires that search engine operators make the determination for whether an individual’s right to privacy outweighs the public’s right to access lawful information when delisting URLs.

That’s a lot of responsibility to place with private groups within search engine companies! Google formed an advisory council drawn from academic scholars, media producers, data protection authorities, civil society, and technologists to establish decision criteria for “particularly challenging delisting requests.”

Google make a RTBF submission form available online. Requestors must verify their identity and provide a list of URLs they would like to delist along with the search queries leading to those URLs and a short comment about how the URLs relate to the requestor. Every submission is manually reviewed – there is no automation in the decision process. The reviewers consider four criteria, designed to weigh public interest against the requestor’s personal privacy:

  1. The validity of the request: is it actionable (e.g., specifies exact URLs) and is the requestor connected with an EU/EEA country.
  2. The identity of the requestor: for example, if the requestor is a public figure there may be heightened public interest. Other categories of interest include minors, politicians, and professionals.
  3. The content referenced by the URL. “For example, information related to a requestor’s business may be of public interest for potential customers. Similarly, content related to a violent crime may be of interest to the general public…
  4. The source of the information: e.g., government site, news site, blog, or forum.

For the period between May 30th 2014, and December 31st 2017, Google received requests to delist almost 2.4M URLs, from 399,779 unique requestors. Only 43% of these URLs were ultimately delisted.

From January 22nd 2016, requested URLs were additionally annotated with categorical data for purposes of improving transparency around RTBF requests.

Applying judgement

The paper contains a sprinkling of anonymous requests and the decisions reached, which provide good insight into the challenges the reviewers face. Here are some examples:

  • “… an individual, who was convicted for two separate instances of domestic violence within the previous five years, sent Google a delisting request focusing on the fact that their first conviction was ‘spent’ under local law. The requestor did not disclose that the second conviction was not similarly spent, and falsely attributed all the pages sought for delisting to the first conviction. Reviewers discovered this as part of the review process and the request was ultimately rejected.”
  • “In another case, reviewers first delisted 150 URLs submitted by a businessman who was convicted for benefit fraud, after they provided documentation confirming their acquittal. When the same person later requested the delisting of URLs related to a conviction for manufacturing court documents about their acquittal, reviewers evaluated the acquittal documentation sent to Google, found it to be a forgery, and reinstated all previously delisted URLs”.
  • “…a requestor who held a significant position at a major company sought to delist an article about receiving a long prison sentence for attempted fraud. Google rejected this request due to the seriousness of the crime and the professional relevance of the content.”
  • “…an individual sought to delist an interview they conducted after surviving a terrorist attack. Despite the article’s self-authored nature given the requestor was interviewed [note the assumption here that the journalist gave a fair impression of what the interviewee actually said!], Google delisted the URL as the requestor was a minor and because of the sensitive nature of the content.”
  • “… a requestor sought to delist a news article about their acquittal for domestic violence on the grounds that no medical report was presented to the judge confirming the victim’s injuries. Given that the requestor was acquitted, Google delisted the article.**”

Welcome to the court of social reputation!

How often are requests made and who makes them?

Overall, the number of RTBF requests per month has been slowly declining after an initial peak when the facility was first launched. The number of previously unseen requestors per month is also declining.

The most requests originate in France, Germany, and the United Kingdom, though if we look at per capita rates Estonia tops the table.

Most interestingly, reputation management is clearly a growing business (see also Black Mirror: ‘nosedive’):

The top thousand requesters (0.25% of all requesters) generated 14.6% of requests and 20.8% of delistings. These mostly included law firms and reputation management agencies, as well as some requestors with a sizable online presence… while hundreds of thousands of Europeans rely on the RTBF to delist a handful of URLs, there are thousands of entities using the RTBF to alter hundreds of URLs about them or their clients that appear in search results.

Most requested for delisting, by an order of magnitude, are URLs concerning private individuals:

Requests predominantly come from private individuals (88%). Politicians and government officials requested delisting of 33,937 URLs, and non-governmental public figures another 41,213 URLs.

Over 77% of requests to delist URLs rooted in a country code top-level domain come from requestors in the same country.

Requests now take a median of 4 days to process.

What content do people request delisting for?

The major categories of sites that contains URLs targeted for delisting are social media sides, directory sites aggregating contact details and personal content, and news sites. It seems that people don’t want you to read bad things about them on Facebook! Here are the most requested sites by category:

Different countries show different characteristics, which can be explained by, for example, the character of the news organisations in those countries, and the role of the government in publishing information.

  • In Italy and the UK, requestors target news media much more than in Germany and France for example. Journalists in the former countries are prone to reveal the identity of individuals, whereas those in the latter tend to anonymise parties in their coverage of crimes.
  • Requestors in France and Germany target social media and directory services more than average.
  • In Spain there is a higher proportion of requests targeting government records. Spanish law requires the government to publish ‘edictos’ and ‘indultos.’ “_The former are public notifications to inform missing individuals about a government decision that directly affects them; the latter are government decisions to absolve an individual from a criminal sentence or to commute to a lesser one.”

Looking at the content at the URLs requested for delisting, we find that most pages contain professional information, though it’s interesting to see self-authored content in the number two spot!

The most commonly requested content related to professional information, which rarely met the criteria for delisting (16.7%). Many of these requests pertain to information which is directly relevant or connected to the requestor’s current profession and is therefore in the public interest to have indexed in Google Search.