Twitter Heron: Towards extensible streaming engines Fu et al., ICDE 2017
We previously looked at the initial Twitter Heron paper that announced Heron to the world. In this ICDE 2017 paper, the team give us an update based on the work done since as a result of open-sourcing Heron.

… we discuss the challenges we faced when transforming Heron from a system tailored for Twitter’s applications and software stack to a system using an extensible, modular architecture which provides flexibility to adapt to various environments and applications.

We get a high level look at the architecture and an investigation into its performance (short version: having a modular architecture doesn’t mean you have to sacrifice performance). The part I find the most interesting though is the comparisons between Twitter Heron’s design and those of Storm and Spark Streaming. (Would be great to see Apache Flink included as well).

Twitter’s daily operations rely heavily on real-time processing of billions of event per day… Heron is now the de facto stream data processing system in Twitter and is used to support various types of applications such as spam detection, real time machine learning, and real time analytics, among others.

### Why a modular architecture?

Moving Heron from an internal project to an open source project meant that Heron could no longer be tied to any element of Twitter’s stack. To be externally useful to a broad audience, it needs to consider different environments (public/private cloud), a variety of software stacks, and diverse application workloads. A good example here is Twitter’s use of Aurora for scheduling, whereas many other organisations may already have an alternate scheduler (e.g., YARN).

Heron allows the application developer or system administrator to create a new implementation for a specific Heron module and plug it into the system without disrupting the remaining modules or the communication between them.

The structure also allows different applications built on top of Heron to use different module implementations while operating on the same underlying resources.

### The high-level architecture of Heron

Heron’s architecture is inspired by that of microkernel-based operating systems… due to the heterogeneity of today’s cloud environments and Big Data platforms, we decided to design Heron using extensible self-contained modules that operate on top of a kernel which provides the basic functionality needed to build a streaming engine.

There are seven main modules, as shown in the following figure:

At runtime, everything is container-based.

`Heron Instances` are essentially the spouts or bolts that run on their own JVM. The `Metrics Manager` collects process metrics, and the `Topology Master` manages the directed graph of spouts and bolts (a topology). Let’s now take a look at the remaining modules in more detail.

#### Resource Manager

The resource manager is invoked on demand to manage resource assignments (CPU, memory, disk) for a particular topology. It does this by assigning `Heron Instances` to containers through a packing plan. An initial packing plan is created when a topology is first submitted, and can be repacked in response to user requests to scale up and down.

Different resource management policies can be selected for the different topologies running on the same cluster.

A generated packing plan is passed to the `Scheduler`.

#### Scheduler

The `Scheduler` module interacts with an underlying scheduling framework such as YARN or Aurora to allocate the resources needed by a packing plan. Heron accommodates both stateful and stateless schedulers, depending on the level of support provided by the underlying scheduling framework.

A stateful `Scheduler` regularly communicates with the underlying scheduling framework to monitor the state of the containers of the topology. In case a container has failed, the stateful `Scheduler` takes the necessary actions to recover from failure… A stateless `Scheduler` on the other hand, is not aware of the state of the containers while the topology is running. More specifically, it relies on the underlying scheduling framework to detect container failures and take the necessary actions to resolve them.

Heron today has support for Aurora and YARN. A Mesos scheduler is being developed in the community.

#### State Manager

The `State Manager` is used for distributed coordination and storing topology metadata. It stores for example topology definitions, packing plans, host and port information for all containers, and the location of the underlying scheduling framework. Heron provides a ZooKeeper based implementation, as well as a local file system version for local development and testing.

#### Stream Manager

The `Stream Manager` handles all inter-process communication. It’s implemented in C++ to provide tighter control over the memory and CPU footprint, and to avoid copying data between the native and JVM heaps.

The Stream Manager uses several techniques to achieve high performance:

• Protocol Buffers are allocated in memory pools and reused, avoiding new/delete operations
• In-place updates of Protocol Buffers are performed
• Lazy deserialization is used whenever possible

The communication layer offers two important configuration parameters to tune its behaviour for a given deployment: max spout pending determines the maximum number of tuple that can be pending on a spout task at any point in time, and cache drain frequency determines how often the tuple cache is drained. The tuple cache store incoming and outgoing data tuples before routing them to the appropriate `Heron Instances`.

### Heron vs Storm

Heron is designed with the goal of operating in a cloud environment on top of a scheduling framework such as Aurora or YARN (although it can also run in local mode). As a result, it leverages the resource isolation mechanisms implemented by these frameworks. Storm, on the other hand implements parts of the functionality of the Heron `Resource Manager`, the Heron `Scheduler` and the underlying scheduling framework in the same abstraction.

In Storm, this creates confusion between Storm’s scheduling decisions and those of any underlying scheduler. Furthermore, in Storm all resources for a cluster must be obtained up front, whereas Heron acquires resources on demand. Thus Storm clusters will tend to be over-provisioned.

Heron provides resource isolation between topologies (through the underlying scheduling framework), and also between processes of the same topology. Storm can do neither – it packs multiple spout and bolt tasks into a single executor, and several executors share the same JVM.

Finally, Heron’s `Stream Manager` handles all data transfers separately from processing units, which helps to make the system scalable. Storm shares communication threads and processing threads in the same JVM. “As a result, it is much harder to isolate performance bottlenecks and thus optimize the overall performance.

### Heron vs Spark Streaming

Spark Streaming depends on Spark itself for extensibility. Because Spark supports a wide diversity of use cases, it is not easy to customize it particularly for streaming.

It is worth noting that Spark (and, as a result, Spark Streaming) has a similar architecture to Storm that limits the resource isolation guarantees it can provide… each executor process can run multiple tasks in different threads. As opposed to Heron, this model does not provide resource isolation among the tasks that are assigned to the same executor.

All Spark Streaming communication relies on Spark, and is not customisable.

### Performance evaluation

If we compare Heron and Storm on a workload chosen to expose framework overheads we see that Heron beats storm on both throughput and latency:

The above figures are with acks enabled.

Heron outperforms Storm by approximately 3-5x in terms of throughput and at the same time has 2-4x lower latency.

With acks disabled, the throughput of Heron is 2-3x higher than Storm’s.

The following two charts show the impact of the max spout pending configuration parameter. As you might expect, allowing more queuing tasks increases throughput up to a point, at the expense of latency.

See section V in the paper for a more detailed performance breakdown.

Despite the benefits of general-purpose architectures, such as Heron’s modular architecture, a common belief is that specialized solutions tend to outperform general-purpose ones because they are optimized for particular environments and applications. In this paper, we show that by carefully optimizing core components of the system, Heron’s general-purpose architecture can actually provide better performance than specialized solutions such as Storm.

An experimental security analysis of an industrial robot controller Quarta et al., IEEE Security and Privacy 2017

This is an industrial robot:

The International Federation of Robotics forecasts that, by 2018, approximately 1.3 million industrial robot units will be employed in factories globally, and the international market value for “robotized” systems is approximately 32 billion USD. In all of their forms, robots are complex automation devices that heavily interact with the physical world…

Most of these control systems were born in an era when they were assumed to be isolated from the network, but are now gaining new interconnections. And hey, guess what:

Unfortunately, even a simple Shodan query shows that sometimes industrial robots are exposed on the Internet without being properly secured.

In this paper, the authors undertake a systematic analysis of the attack surface and potential impacts of cyber attacks against industrial robots. Their findings are sadly not surprising, yet at the same time some of the things you’re about to read may leave you open-mouthed in disbelief. It’s a perfect case study in how not to do things!

### Welcome to Industry 4.0 and the world of connected robots

Industrial robots are “connected” primarily for programming and maintenance purposes – a use case specified by ISO standards. For instance, in a large car production plant developed by KUKA Robotics, all the 259 robots are connected to central control and monitoring systems.

The industrial robot ecosystem is blooming, there are “Robot Web Service” HTTP REST APIs, a Robot App Store (http://www.robotappstore.com/), and venues where users can exchange models, videos, add-ins and “apps” (https://robotapps.robotstudio.com). The authors make three high level observations about the ecosystem:

1. The increased connectivity of computer and robot systems is (and will be) exposing robots to cyberattacks. Indeed, nowadays, industrial robots – originally designed to be isolated – are exposed to corporate networks and to the Internet.
2. The safety systems governing robots are increasing implemented in software.
3. Awareness of security risks within the ecosystem is very low (confirmed by both a small scale survey undertaken by the authors, and the shocking state of security in practice as we’ll see).

In order to attack industrial robots, you need to know something about them of course. Thankfully that knowledge is easy to obtain:

Realistically [attackers] can rely on publicly available information (e.g., controller software and firmware available for download from the vendor’s website), and some reverse engineering. As a matter of fact, we learned most of the details described in this paper by reading freely available technical documentation. Therefore, an attacker can do the same.

If you want access to equipment to verify your attacks before deploying them you’ve got two routes open to you. Firstly, vendors distributed simulators – which at least in one case share most of the code (and thus most of the vulnerabilities) with the firmware of the controller’s computer. Secondly you can buy parts or even complete robots on the second hand market (for costs ranging from $13K-$36K).

Robot controllers may be accessible over the network in one of several ways:

• At the base level, they may just be connected to the factory LAN, which an attacker can gain access to through a variety of methods (out of scope for this paper).
• Sometimes controllers are directly accessible from the Internet – using Shodan and ZoomEye to look for string patterns in the FTP banners of top robot manufactures found several hits. (And as we’ll see, you can change the robot configuration by uploading files via that FTP service).
• Most commonly, robots embed proprietary remote access devices used by the vendor for remote monitoring and maintenance. In the industry jargon, these are known as industrial routers. We could call them ‘backdoors.’

Industrial routers provide a helpful attack surface to gain access to a robot controller. For example, an attacker could target a widespread vendor of such appliances, whose products are also resold by robotics OEMs as part of their support contracts. Among these vendors, eWON is quite representative. A simple Shodan query for the default banner of the embedded web server (Server: eWON) yielded 1,044 results, without accounting for customized banners. The web-based configuration console is easily “fingerprintable,” and attacker could exploit vulnerabilities or misconfigurations in the router

### Threat scenarios

Given the nature of the tasks that industrial robots undertake, they require accuracy, safety, and integrity.

• A robot should read precise values from sensors and should issue correct and accurate commands to the actuators so that movements are performed within acceptable error margins. “A violation of this requirement could translate into small defects in the outcome. For example, if a robot is used for welding, a minimal change in how the weld is carried out could structurally undermine the workpiece, which in the case of a car body could possibly mean tragic consequences for the end user safety.
• Safety requirements are governed by ISO standards, and require that operators always be able to take safe and informed decisions, and engage emergency procedures when needed.
• A robot controller also needs to minimise the risk of badly written control logic damaging its parts.

At a high level attackers can attempt to violate these requirements to a variety of ends. For example, injecting faults and micro-defects in production, which could lead to immediate or delayed financial loss, damaged reputation, and possibly even fatalities depending on what is being manufactured. An attacker could also damage machinery or cause injuries to factory workers by disabling or altering safety devices. Then there are ‘denial of production’ attacks which introduce downtime: “the vice president of product development at FANUC stated that ‘unplanned downtime can cost as much as $20,000 potential profit loss per minute, and$latex 2 million for a single incident.” (That’s a whole new kind of ransomware waiting to happen right there). Finally it might just be good old industrial espionage – leaking sensitive data about production schedules and volumes, source code etc..

### Attacks

To make things a bit more concrete, here are five attack vectors.

1. Configuration tampering. An attacker able to access a configuration file can modify parameters affecting robot movements. Closed loop control systems make a controlled variable follow a reference signal as closely as possible. The parameters affect how well they are able to do this. Sub-optimal tuning can result in the robot being slow to reach the desired position, violating accuracy requirements (think poor welds, milling too much material away). Bad parameter settings can read to controller instability, overshooting of set points, and violation of safety properties and ultimately potential for physical damage. There are also open loop control systems designed to smooth the signals generated by closed loop controls. Fiddling with their settings can amplify resonance effects leading to violations of integrity requirements and safety boundaries. The configuration also contains core settings specific to the type of manipulator (controller models are often shared across different robots). Change these and you can alter the amount of force applied (to exceed safety limits), or simply destroy either the workpiece or its surrounding environment. Finally, several kinds of safety limits are also specified in the configuration file, allowing them to be changed at will by the attacker.
2.  User-perceived robot state alteration. “The impact of a mere UI-modification attack is remarkable. Altering the UI could hide or change the true robot status, fooling operators into a wrong evaluation of risk, and , consequently, creating a substantial safety hazard.”
3.  Robot state alteration. Beyond altering the perceived state as above, an attacker can also alter the actual state of the robot. For example, silently switching between manual mode (which gives a set of safeguards for teaching the robot) and automatic mode when humans should not be nearby. Furthermore, “Some vendors implement safety features, such as emergency stop buttons in software. Worse modern teach pendants (which include an e-stop button) are wireless…”
4.  Production logic tampering. If the end to end integrity of the file defining the task the robot is to carry out is not protected, then it is possible for an attacker to arbitrarily alter production logic. “For example, the attacker could insert small defects, trojanize the workpiece, or fully compromise the manufacturing process.”
5.  Calibration parameters tampering. Calibration data is passed to the controller during system boot, but can also be manipulated later on. This can cause a servo motor to move erratically or unexpectedly.

### A case study

As an example, the authors evaluate one specific system and demonstrate attacks against. I have a suspicion they could have picked many other targets and obtained very similar findings. You can find the full details in section VI of the paper. I’m just going to highlight here some of the more jaw-dropping discoveries.

The robot uses FTP to share files and system information between the robot and the internal network and custom services. There is also a ‘RobAPI’ (TCP protocol) that offers a more comprehensive network attack surface, and a UDP-based discovery service to help discover robot controllers on the network (how thoughtful!). There is an optional User Authentication System, which when enable uses a simple role-based access control scheme.

#### The authentication is useless

Firstly, it’s disabled during system boot, when a set of hard-coded default static credentials can be used to access the shared file system. Secondly, there is a default user, without a password, than cannot be changed or removed. Then there is a another specific user that can ‘only’ access FTP paths relating to the`/command` device driver (see the next section), whose credentials are embedded in the firmware and cannot be changed (the same for every instance of course).

If using any of these sets of credentials is too much hassle, it’s also possible to completely bypass authentication via both the FTP and RobAPI access due to an implementation flaw.

#### The controller can be completely reconfigured via FTP upload

Any file written over FTP to the path `/command/command` or `/command/command.cmd` is interpreted as a script executed during booting. The remote service box for example using this functionality to automatically configure itself. If you include the command `shell uas_disable` in this file the user authentication system will be disabled.

#### The “cryptography” is a joke

encryption schemes are used to safeguard the integrity of some specific and safety-critical configuration files, such as the ones containing sensitive control loop parameters. We found such schemes to be weak obfuscation/integrity mechanisms rather than proper encryption: keys are derived from the file name and, in some cases, part of the file content. By reverse engineering the controller firmware, we found all the information needed to reconstruct the encryption keys: An attacker who can access a firmware update, or who has file system access, is able to read and modify safety- and accuracy-critical configuration files.

Here’s an example, the UAS configuration including the plaintext passwords of all users) is stored in an XML file obfuscated through a bit-wise XOR operation with a random key. In case you need it, the key is stored at the beginning of the obfuscated file itself. To quote the authors, “the obfuscation is completely useless.

#### There is no protection against memory errors

… it is far easier to exploit memory corruption vulnerabilities in the robot we analyzed than in mainstream OSs. Industrial robots, like many embedded systems, employ real-time operating systems (RTOSs) with poor or no hardening features.

Given that there is also no privilege separation between processes or between user space and kernel space, exploiting memory corruption is trivial. And did the authors find such exploitable memory errors in the code? Of course they did!

The fact that we found textbook vulnerabilities in one of the most widely deployed robot controllers is an indicator that not even basic static checking was in place.

#### Nothing is signed

The boot image that the system downloads is not signed, and can be easily modified by an attacker who can reverse engineer the file format (hint: it’s a zip-compressed binary). There are no code-signing mechanisms for vendor firmware, program task code, or custom software developed on top of the provided SDKs.

#### Security checks aren’t checked

The compliance tool that checks code doesn’t perform a set of restricted operations such as use of reflection and raw access to the filesystem only makes checks within the development environment. And even then the checks it does are too simplistic (e.g., an incomplete blacklist for file operations, that still permits full unrestricted access to the shared file system). But it gets worse:

Alas, the version of the compliance tool provided with the SDK does not enforce these restrictions at all. It is also evident that an attacker could bypass this by simply modifying the compliance tool itself: there is no way to perform these security checks on the programmer’s side in a completely safe manner.

#### Everything is out of date

For backwards compatibility reasons, the ‘FlexPendant’ (human interface) “will continue running an old, unsupported, and potentially vulnerable version of the .NET framework.”

#### And so…

It should come as no surprise to you that the authors were able to successfully demonstrate attacks taking complete control over the controller.

### Cyber security and safety standards

The authors give a set of guidelines for hardening industrial robot systems which you can find in section VII. Unfortunately, it looks like a long road ahead.

…none of the [existing industrial robots standards] account for cyber-security threats: although some of them have mild security implications, they do not explicitly account for adversarial control during risk assessment.

You may think some of the attacks described, such as manipulating welds so that they are deliberately weak, sound a little far-fetched. But I have reasons to believe that similar attacks have previously been demonstrated in the wild.

Hijacking Bitcoin: routing attacks on cryptocurrencies Apostolaki et al., IEEE Security and Privacy 2017

The Bitcoin network has more than 6,000 nodes, responsible for up to 300,000 daily transactions and 16 million bitcoins valued at roughly \$17B.

Given the amount of money at stake, Bitcoin is an obvious target for attackers.

This paper introduces a new class of routing attacks on the network. These aren’t supposed to be feasible since Bitcoin is a vast peer-to-peer network using random flooding. However, look a little closer and you’ll find:

1. The Internet infrastructure itself is vulnerable to routing manipulation (BGP hijacks), and
2. Bitcoin is really quite centralised when viewed from a routing perspective.

Suppose for a minute it was practical to intercept or otherwise manipulate connections (you’ll soon see that it is):

As Bitcoin transactions are routed over the Internet – in clear text and without integrity checks – any third-party on the forwarding path can eavesdrop, drop, modify, inject, or delay Bitcoin messages such as blocks or transactions.

Seventeen billion dollars, clear text, no integrity checks!

It turns out that if you can hijack less than a hundred BGP prefixes (feasible) you can isolate about 50% of the mining power in the network. Once a collection of nodes are partitioned from the network the network becomes more vulnerable to double spending attacks, transaction filtering, and selfish mining attacks.

Nodes representing merchants, exchanges, and other large entities are thus unable to secure their transactions, or may not be able to broadcast them to the the network to begin with. The resulting longer-term loss trust in Bitcoin security may trigger a loss of value for Bitcoin. Attackers may even short Bitcoin and gain from the resulting devaluation.

The authors also demonstrate delay attacks which are effective against individual targets, but not against the network as a whole as the partitioning attacks are.

### The core building block: BGP hijacking

BGP (Border Gateway Protocol) is the routing protocol that controls how packets are forwarded in the Internet. Routes are associated with IP prefixes, and are exchanged between neighbouring networks (Autonomous Systems, AS). The origin AS makes the original route announcement, and this then propagates through the network hop by hop.

In BGP, the validity of route announcements is not checked. In effect, this means that any AS can inject forged information on how to reach one or more IP prefixes, leading other ASes to send traffic to the wrong location. These rogue advertisements, known as BGP “hijacks”, are a very effective way for an attacker to intercept traffic en route to a legitimate destination.

If you want to hijack the traffic for a particular prefix, you can either advertise the same prefix (in which case you will attract traffic that is ‘closer’ to you than the genuine owner), or you can advertise a more specific (longer) prefix (up to /24) in which case longest match (i.e., yours) always wins. By leaving at least one path from the attacker to the destination untouched, a BGP hijack can be turned into an interception.

Is it really that simple to hijack Internet traffic??? I mean, does this really happen in practice? Sure it does! As evidence, the authors analysed six months of BGP updates (that’s 4 billion of them) to look for hijacks.

We see that there are hundreds of thousands of hijack events each month. While most of these hijacks involve a single IP prefix, large hijacks involving between 300 and 30,000 prefixes are also seen every month.

That’s all pretty alarming in general, but here we’re just focused on the Bitcoin specific nodes. Each month, at least 100 Bitcoin nodes are victims of hijacks. In November 2015 as an example, 7.8% of the network nodes were hijacked.

### The structure of the Bitcoin network

The vulnerability of the Bitcoin network overall to routing attacks depends on the routing characteristics of the Bitcoin network itself. The conduct a study (details in section VI) to uncover the Bitcoin network topology. The key findings are as follows:

• Unsurprisingly, there seems to be some kind of power law at play whereby a few ASes host most of the Bitcoin nodes. E.g., only 13 ASes hosts 30% of the entire Bitcoin network.
• Just a few ASes naturally intercept the majority of Bitcoin traffic. Hurricane Electric, Level3, and Telianet for example, can together intercept more than 60% of all possible Bitcoin connections.

• Over 90% of Bitcoin nodes are vulnerable to BGP hijacks. 93% of them have prefixes shorter than /24!
• Mining pools are multi-homed and all use at least two ASes to connect to the Bitcoin network. Multi-homing is one of the main protections against routing attacks.
• Bitcoin routing protocols are stable over time – the same IPs are present on average for 15.2 consecutive days.

### Partitioning attacks

The goal of a partitioning attack is to isolate some set of nodes P from the rest of the network, effectively partitioning the Bitcoin network into two disjoint components. It may be that not all of the nodes in P can be isolated (due to ‘leakage points’ from connections in P that can’t be intercepted – e.g., a peering agreement between mining pools using private connections). So ultimately we try to find the maximal subset of P that can be partitioned off.

The attacker first diverts the traffic destined to nodes in P by hijacking the most-specific prefixes hosting each of the IP address. Once on-path, the attacker intercepts the Bitcoin traffic (e.g., based on the TCP ports) and identifies whether the corresponding connections cross the partition she tries to create. If so, the attacker drops the packets. If not, meaning the connections are contained within P, she monitors the exchanged Bitcoin messages so as to detect “leakage points”. Leakage points are nodes currently within P, which maintain connections with nodes outside of P that the attacker cannot intercept…

These leakage points are dropped from P until the maximal set of nodes that can be isolated is left.

The authors demonstrate the practicality of the attack by hijacking their own nodes:

First, we performed a real BGP hijack against our Bitcoin nodes and show that it takes less than 2 minutes for an attacker to divert Bitcoin traffic. (Emphasis mine)

Second, we estimated the number of prefixes to hijack so as to isolate nodes with a given amount of mining power. We found that hijacking only 39 prefixes is enough to isolate a specific set of nodes which accounts for almost 50% of the overall mining power. (Emphasis mine)

### Delay attacks

Delay attacks slow down the propagation of new blocks sent to a set of Bitcoin nodes, without disrupting their connections. The attack can be targeted at selected nodes, or in theory conducted network wide. The authors’ analysis though shows that network wide attacks are not really feasible. To carry out a delay attack, the attacker first intercepts traffic using BGP hijacking with a forwarding route left open, and then tampers with the incoming and/or outgoing protocol messages. By tampering with outgoing GETDATA requests, a requesting node will never receive the blocks it is asking for. By tampering with incoming BLOCK messages the content can be corrupted so that the victim considers it invalid. Each of these actions triggers a 20 minute timeout.

We verified the practicality of delay attacks by implementing an interception software which we used against our own Bitcoin nodes. We show that intercepting 50% of a node’s connections is enough to keep the node uniformed for 63% of its uptime.

For 68% of the nodes in the Bitcoin network, there is at least one AS other than their provider that intercepts more than 50% of their connections.

### Countermeasures

The authors recommend the following short-term measures:

• Increase the diversity of node connections, for example by ensuring that all Bitcoin nodes are multi-homed.
• Select Bitcoin peers in a route-aware way, adding extra random connections if the same AS appears in all paths.
• Monitor RTT to detect sudden changes (increases) and establish extra random connections as a protection mechanism.
• Monitor additional statistics as early warning signals of an attack: distribution of connections, request-response intervals, simultaneous disconnection of peers, and so on.
• Embrace the natural churn of the network to continually refresh connections
• Use gateways (for mining pools) in different ASes
• Prefer peers hosted in the same AS, and in /24 prefixes

Longer term, Bitcoin can encrypt connections and/or use a Message Authentication Code to detect tampering. The authors also recommend separating control and data channels, negotiating a set of random ports on connection that will be used to exchange Bitcoin data. This forces an AS-level adversary to look at all traffic, not just the port 8333, which would be very costly. Other suggestions include adding UDP heartbeats to detect interception, and ensuring that any block requests are made on multiple connections in parallel. Ph

### The last word

Our work underscores the importance of proposed modifications which argue for encrypting Bitcoin traffic or traffic exchanged among miners. Yet, we stress that not all routing attacks will be solved by such measures since attackers can still disrupt connectivity and isolate nodes by dropping Bitcoin packets instead of modifying them.

SoK: Cryptographically proctected database search Fuller et al., IEEE Security and Privacy 2017

This is a survey paper (Systematization of Knowledge, SoK) reviewing the current state of protected database search (encrypted databases). As such, it packs a lot of information into a relatively small space.

As we’ve seen before, there are a wide-variety of cryptographic techniques in play (LIST), and we’re seeing the increasing emergence of commercial solutions:

In the commercial space, a number of established and startup companies offer products with protected search functionality, including Bitglass, Ciphercloud, CipherQuery, Crypteron, IQrypt, Kryptnostic, Google’s Encrypted BigQuery, Microsoft’s SQL Server 2016, Azure SQL Database, PreVeil, Skyhigh, StealthMine, and ZeroDB. While not all commercial systems have undergone thorough security analysis, their core ideas come from published work. For this reason, this paper focuses on systems with a published description.

I’m going to start by jumping to the heart of the paper (for me at least), which looks at the different mechanisms for protected base queries, and their strengths and weaknesses

### Approaches to protected search

The section provides systematic reviews of the different cryptographic approaches used across query types and an evaluation of known attacks against them. Due to length limitations, we focus on the Pareto frontier of schemes providing the currently best available combinations of functionality, performance, and security.

To give you an idea where we’re heading, here’s the big summary table, with schemes grouped by query type (equality, boolean, range, or other).

(Click for larger view)

After query type, the schemes are broken down by the primary approach that they take to protection: legacy, custom, oblivious, or other.

#### Legacy compliant schemes

Legacy schemes insert ciphertexts into a traditional database, without changing the indexing and querying mechanisms. Such schemes are based on property-preserving encryption which produces ciphertexts preserving some property (e.g., equality and/or ordering) of the underlying plaintexts. “As a result, Legacy schemes immediately inherit decades of advances and optimizations in database management systems.” As we saw a couple of weeks ago though, in practice implementations based on existing database engines may leak a lot more than the cryptographic theory taken on its own would lead you to believe.

Within the legacy bucket, deterministic encryption (DET) preserves only equality but applying a randomized but fixed permutation to all messages. Even without a querier making any queries though, the server learns everything about equalities between data items. Order-preserving encryption (OPE) preserves the relative order of the plaintexts, so that range queries can be performed over ciphertexts:

This approach requires no changes to a traditional database, but comes at the cost of quite significant leakage: roughly, in addition to revealing the order of data items, it also leaks the upper half of the bits of each message.

Mutable OPE only reveals the order of ciphertexts at the expense of added interactivity during insertion and query execution. Legacy approaches can be extended to perform boolean queries and joins by combining the results of equality or range queries.

Even the strongest legacy schemes reveal substantial information about queries and data to a dedicated attacker.

#### Custom schemes

Custom schemes include inverted indices and tree traversal. Inverted index schemes traditionally supported equality searches on single table databases via a reverse lookup mapping each keyword to a list of identifiers for database records containing the keywords. OSPIR-OXT extends this to support boolean queries using a second index. Other inverted index schemes have been introduced to support selection, projection, and Cartesian product at the expense of additional leakage.

Cash and Tessaro demonstrate that secure inverted indices must necessarily be slower than their insecure counterparts, requiring extra storage space, several non-local read requests, or large overall information transfer.

Tree traversal schemes execute queries by traversing a tree and returning the leaf nodes at which the query terminates (sort of). The main challenge is to hide the traversal pattern through the tree. Tree-based schemes have also been developed that can support range queries.

#### Oblivious schemes

Oblivious schemes aim to hide common results between queries. Oblivious RAM (ORAM) has been researched for decades, and the performance of ORAM schemes has steadily increased, with the Path ORAM scheme being a current favourite.

Applying ORAM techniques to protected search is still challenging.

Of interest here is 3PC-ORAM which shows that by adding a second non-colluding server you can build an ORAM scheme which is much simpler than previous constructions.

#### Other schemes

There have been several custom schemes supporting other query types such as ranking boolean query results, calculating inner products of fixed vectors, and computing the shortest distance on a graph.

These schemes mostly work by building encrypted indices out of specialized data structures for performing the specific query computation.

#### Oh, you want to update the data as well???

Not all secure schemes support updating the data, especially those in the custom and oblivious categories. Most legacy schemes allow immediate queries after insertion. For custom schemes that do support updates, new values are typically inserted into a less efficient side index. Periodically a refresh operation updates the main database structures to include these updates – this is expensive though so it can’t be done too frequently.

### Attacks

Now that we’ve got a basic map of the territory, let’s take a look at the leakage inference attacks the different schemes are vulnerable too. There’s another big table for those too:

(Click for larger view).

From the first ‘schemes’ table that we looked at, take the ‘S leakage’ (server leakage) column and then look up the values there in this table to find out which schemes are vulnerable to which attacks.

We stress that leakage inference is a new and rapidly evolving field. As a consequence, the attacks in Table III only cover a subset of leakage profiles included in Table II. Additionally, this section merely provides lower bounds on the impact of leakage because attacks only improve over time.

Attacks are divided into two broad groups depending on whether the attacker is trying to recover data stored at the server (data recovery), or to learn the set of queries asked by a user (query recovery).

The required leakage columns tell you what kind of information leakage the attack depends on. ‘Init’ here means that some data is leaked during the initial loading of the data, without any queries being made, and ‘Query’ refers to information leakage during querying.

Examples include the cardinality of a response set, the ordering of records in the database, and identifiers of the returned records. Some attacks require leakage on the entire dataset while others only require leakage on query responses

All current attacks require some prior knowledge – information about the stored data or the queries made. The prior knowledge column indicates how much an attacker needs to know. Some attacks also require the attacker to be able to insert records.

The attack efficacy columns give an idea of runtime required, amount of prior knowledge, and how large a keyword universe the different attacks can deal with.

Most of the attacks rely on the following two facts: 1) different keywords are associated with different numbers of records, and 2) most systems reveal keyword identifiers for a record either at rest (e.g, DET) or when it is returned across multiple queries.

The assumed attacker models for inference attacks are also fairly weak: there is the semi-honest attacker who follows protocols but tries to learn as much additional information as they can from what they observe, and an attacker capable of data injection (i.e., they can take some action that ultimately ends up in a record being updated in the database).

The authors identify the follow open problem, which seems to be saying “it’s an open problem to find real world use-cases where the known leakage attacks don’t matter!”

… it is important to understand what these leakage attacks mean in real-world application scenarios. Specifically, is it possible to identify appropriate real-world use-cases where the known leakage attacks do not disclose too much information? Understanding this will enable solutions that better span the security, performance, and functionality tradeoff space.

### Full database solutions

The following table summaries protected search systems that have made the transition to full database solutions.

• CryptDB enables most DBMS functionality with a performance overhead of under 30%.
• Arx is built on top of MongoDB and reports a performance overhead of approximately 10%.
• BlindSeer reports slowdowns of between 20% and 300% for most queries
• OSPIR-EXT can occasionally beat a MySQL system with a cold cache (a somewhat strange comparison!), but are an order of magnitude slower than MySQL with a warm cache.
• SisoSPIR reports a 500% slowdown compared to a baseline MySQL system on keyword equality and range queries.

The systems offer different performance/leakage trade-offs. Considering just those based on relational algebra for the moment:

CryptDB is the fastest and easiest to deploy. However, once a column is used in a query, CryptDB reveals statistics about the entire dataset’s value on this column. The security impact of this leakage should be evaluated for each application. Blind Seer and OSPIR-OXT also leak information to the server but primarily on data returned by the query. Thus, they are appropriate in settings where a small fraction of the database is queried. Finally, SisoSPIR is appropriate if a large fraction of the data is regularly queried. However, SisoSPIR does not support Boolean queries, which is limiting in practice.

It is an open problem to construct protected search systems that leverage distributed server architectures.

### In conclusion

There’s a lot of information in this paper that I didn’t have space to cover, but the overall message is that the market is early and adopters need to be careful to understand what guarantees they’re really getting and if those are sufficient for their use cases. I’ll leave you with this thought from the author’s own conclusions which I rather like, as it speaks to the value of not knowing something:

Governments and companies are finding value in lacking access to individual’s data. Proactively protecting data mitigates the (ever-increasing) risk of server compromise, reduces the insider threat, can be marketed as a feature, and frees developer’s time to work on other aspects of products and services.

I’m not quite so sure about how much time it frees up, but I agree with the other points.

Cloak and dagger: from two permissions to complete control of the UI feedback loop Fratantonio et al., IEEE Security and Privacy 2017

If you’re using Android, then ‘cloak and dagger’ is going to make for scary reading. It’s a perfect storm of an almost undetectable attack that can capture passwords, pins, and ultimately obtain all permissions (aka, a ‘god app’), that can leave behind almost no trace, and that may not require the user to view and (knowingly) accept even a single permissions dialog. Oh, and there’s no easy fix: “we responsibly disclosed our findings to Google; unfortunately, since these problems are related to design issues, these vulnerabilities are still unaddressed.

At the core of the attack are two permissions: `SYSTEM_ALERT_WINDOW` and `BIND_ACCESSIBILITY_SERVICE`. If an app can gain both of these permissions, it’s game on:

… we demonstrate the devastating capabilities the “cloak and dagger” attacks offer an adversary by showing how to obtain almost complete control over the victim’s device. In this paper, we will demonstrate how to quietly mount practical, context-aware clickjacking attacks, perform (unconstrained) keystroke recording, steal user’s credentials, security PINs, and two factor authentication tokens, and silently install a God-mode app with all permissions enabled.

The `SYSTEM_ALERT_WINDOW` permission lets an app draw overlays on top of other apps, and is widely used. For example, the Facebook, Twitter, LastPass, and Skype apps all use it, as do about 10% of the top apps on Google Play Store. Fortunately for our attacker, the permission is automatically granted to any app in the official Play Store, and by targeting SDK level 23 or higher, the user will not be shown the list of required permissions at installation time (the idea is to defer asking until runtime when the permission is actually required for the first time). So that’s one permission down, one to go – and the user has no way of knowing.

We note that this behaviour seems to appear to be a deliberate decision by Google, and not an oversight. To the best of our understanding, Google’s rationale behind this decision is that an explicit security prompt would interfere too much with the user experience, especially because it is requested by apps used by hundreds of millions of users.

The `BIND_ACCESSIBILITY_SERVICE` permission grants an app the ability to discover UI widgets displayed on the screen, query the content of those widgets, and interact with them programmatically. It is intended to help make Android devices more accessible to users with disabilities. Because of the security implications, this permission needs to be manually enabled through a dedicated menu in the settings app, it can be done in just three clicks, but they have to be the right three clicks of course. If you have the `SYSTEM_ALERT_WINDOW` permission though, getting the user to make those three clicks is fairly easy.

The example attack app built by the authors presents a simple tutorial with three buttons: “Start Tutorial”, then “Next” and a final “OK” button. When the OK button is clicked, the app shows a short video that lasts about 20 seconds. Except things are not what they seem. The tutorial app UI is actually an on-top opaque overlay and the “Start Tutorial” button click actually clicks on the accessibility service entry, the “Next” button click actually clicks on the ‘on/off’ switch, while the “OK” button click is clicking on the real OK button from the Settings app (the overlay UI has a carefully crafted hole at this exact point). The `SYSTEM_ALERT_WINDOW` settings are such that if a click is sent through to the underlying window from an overlay, then the overlay is not told the whereabouts of that click. Traditionally that makes click-jacking attacks difficult because you don’t know where the user clicked – the event reports coordinates (0,0) – and hence you don’t know when to update your overlay (fake) UI in response to a click so that the user remains unsuspecting.

Our technique works by creating a full screen opaque overlay that catches all a user’s clicks, except for the clicks in a very specific position on the screen (the point where the attacker wants the user to click). This is achieved by actually creating multiple overlays so as to form a hole: the overlays around the hole capture all the clicks, while the overlay on the hole is configured to let clicks pass through… since there is only one hole, there is only one way for a user’s click to not reach the malicious app.

In other words, if you see an event with coordinates (0,0), you know that the user clicked in the hole!

There are four fundamental attack primitives which can be combined in many creative ways, once the two permissions are obtained:

• You can modify what the user sees (as in the click-jacking example above)
• You can know what is currently displayed
• You can inject user input
• You can know what the user inputs and when

Here are 12 example attacks:

1. Context-aware clickjacking – we just saw this in action to fool the user into granting the `BIND_ACCESSIBILITY_SERVICE` permission, overlays are used to lure the user into clicking on target parts of the screen.
2. Context-hiding – using overlays that cover the full screen apart from carefully designed ‘holes’, so that the user is completely fooled concerning the true context of the button or other UI element that they click on.
3. Keystroke inference – it turns out that even an app that only has the `SYSTEM_ALERT_WINDOW` permission can learn what keys you touch on the onscreen keyboard, including any private messages, passwords etc.. This works by creating several small transparent overlays, one on top of each key on the keyboard. The overlays don’t intercept any clicks, so these go to the underlying keyboard – however, they do receive MotionEvents for touches outside of their area. The `FLAG_WINDOW_IS_OBSCURED` flag is set to true if a click event passed through a different overlay before reaching its final destination. Combine this with the MotionEvents, and a careful stacking of the overlays so that each has a different Z-level, and you can tell based on obscured flag which key was actually pressed.
4. Keyboard app hijacking – normally when an app attempts to get the text of a password field, the getText method just returns an empty string, but with the `BIND_ACCESSIBILITY_SERVICE` permission the keyboard app itself is treated as a normal unprivileged app, and each of the key widgets generates accessibility events through which all keystrokes can be recorded, including passwords.
5. Targeted password stealing Use the accessibility service to detect that the user just launched e.g. the Facebook app login activity. Now draw a visible overlay that looks like the username and password EditText widgets, and an overlay on top of the login button. When the user clicks on the login button overlay, record the credentials and then pass them through to the real Facebook app. At the end of the attack, the user is logged into their real Facebook account and is completely unsuspecting.
6. Security PIN stealing – if you use a security PIN, the security screen pad generates accessibility events which an accessibility app can see even when the screen is locked.
7. Phone screen unlocking – accessibility apps can also inject events while the phone is locked. If a secure lock screen is not used, an accessibility app can inject events to enter the PIN, unlock the phone, and then go on to do whatever it wants. All this can be done while the phone screen remains off. “It is simple to imagine a malware that would unlock the phone during the night, and it would go on the web and click ads…” If you use a swipe pattern though, there is no way for an accessibility app to inject swipes.
8. Silent app installation the initial malicious app with only the two permission can contain within itself an APK for another app which requests all permissions. It is possible to install this app while stealthily covering the screen with an on-top overlay. Remember the 20 second video that played as part of our tutorial earlier? That’s the overlay which gives the attacking app 20 seconds to go through the god-app installation process behind it, completely unbeknownst to the user. The installed app can be configured to not appear in the launcher, and to have device admin privileges so that its uninstall button is disabled. Disguising it as a legitimate system app (e.g.. “Google Parental Control”) should complete the deception. The full power is this is best understood when you see it in action:
9. Enabling all permissions – with the app installed as in the previous example, it is possible to automatically click and enable all the requested permissions while a full screen overlay is on top.
10. 2FA token stealer – “for SMS-based tokens, the accessibility service can be configured to be notified for an new notification, which contains the required token.” What if you’re using an authenticator app though? It turns out a malicious app can just open the activity displaying the tokens and read them off the screen – none of the views containing the tokens are protected in e.g., the Google Authenticator App. Why not combine this with #7 and grab a bunch of tokens at night with the screen off?
11. Ad hijacking – an accessibility app can figure out where and when ads are shown in an app, and draw an invisible on top overlay that sends clicks to an ad of the attackers choice, generating revenue.
12. Browse the web – the accessibility service has full access to the browser, so you can make Chrome visit any page, click on ‘likes,’ post content on behalf of the user, and so on. “Moreover, since all HTML elements are nicely exposed, we believe it would be simple to extend the password stealer attacks we described earlier to web-based forms.

A user study with 20 users showed that none of the users suspected anything at all while interacting with the tutorial in the app, and after those three clicks their devices were fully compromised and a God-mode app installed.

None of the users was able to even suspect that they were under attack. What is more worrisome is that none of the users actually managed to understand what happened even after we told them the app they played with was malicious and even after we gave complete and free access to the compromised device. We also stress that the subjects could not detect anything even when they had a chance to directly compare what happened against a “benign” baseline: In a real-world scenario, there would not be any baseline to use as reference.

Section IX in the paper outlines some possible defences that Google could implement to prevent such ‘cloak and dagger’ attacks, which centre around a set of system and API modifications. These would clearly take some time to roll out. In the short term, not automatically granting `SYSTEM_ALERT_WINDOW` permissions to Play Store apps, and more closely scrutinising apps that request both permissions (the authors had such a test app easily accepted) would both help.

We responsibly disclosed all our findings to the Android security team, which promptly acknowledged the security issues we reported. When available, the fixes will be distributed through an over-the-air (OTA) update as part of the monthly Android Security Bulletins. Unfortunately, even if [though?] we reported these problems several months ago, these vulnerabilities are still unpatched and potentially exploitable. This is due to the fact that the majority of presented attacks are possible due to inherent design issues outlined in Section V. Thus, it is challenging to develop and deploy security patches as they would require modifications of several core Android components.

If you want to see videos of some of the attacks in action, as well as a table indicating which versions of Android are affected (as of time of writing, the word ‘vulnerable’ appears in every cell!), then check out http://cloak-and-dagger.org/.

tags: ,

IoT goes nuclear: creating a ZigBee chain reaction Ronen et al., IEEE Security and Privacy 2017

You probably don’t need another reminder about the woeful state of security in IoT, but today’s paper choice may well give you further pause for thought about the implications. The opening paragraph sounds like something out of science fiction – except that it’s a demonstrated reality today:

Within the next few years, billions of IoT devices will densely populate our cities. In this paper, we describe a new type of threat in which adjacent IoT devices will infect each other with a worm that will rapidly spread over large areas, provided that the density of compatible IoT devices exceeds a certain critical mass.

The ZigBee protocol is the radio link of choice for many IoT devices since it is simple and widely available at low cost and with low power consumption. It has much lower bandwidth than WiFi, but that doesn’t matter for many IoT applications. The popular Philips Hue smart lamps use ZigBee for example. Suppose you could build a worm that jumps directly from one lamp to another using their ZigBee wireless connectivity and their physical proximity. If the install base of lamps in a city is sufficiently dense, you could take them all over in no time, with the worm spreading like a physical virus. The authors of today’s paper demonstrate exactly how to build such a worm:

… we developed and verified such an infection using the popular Philips Hue smart lamps as a platform… The attack can start by plugging in a single infected bulb anywhere in the city, and then catastrophically spread everywhere within minutes.

If plugging in an infected bulb is too much hassle, the authors also demonstrate how to take over bulbs by war-driving around in a car, or by war-flying a drone. (I get the impression they had fun with this one!). Based on percolation theory, we can estimate that if a city such as Paris has around 15,000 or more randomly located smart lamps then the attack will spread everywhere within the city.

What we demonstrate in this paper is that even IoT devices made by big companies with deep knowledge of security, which are protected by industry-standard cryptographic techniques, can be misused by hackers can rapidly cause city-wide disruptions which are very difficult to stop and investigate.

The whole paper is a wonderful tour de force that combines two main elements:

1. A Correlation Power Analysis (CPA) attack against the CCM encryption mode used to encrypt and verify firmware updates, allowing the authors to encrypt, sign, and upload malicious OTA updates to infect lamps.
2. A takeover attack that allows the authors to take control over lamps from long distances without using custom hardware.

Let’s look at these two components in turn, and then revisit the analysis that shows how easily such a worm can spread through a city given a relatively low density of installed bulbs. We’ll conclude by considering some of the attacks this enables, and another appeal for better IoT security.

ZigBee provides an OTA update feature that the authors exploit. They started out by finding several different older Hue models which had previously had software updates. By plugging these in and letting them update, a recording of the update process can be made for analysis, which enabled discovery of the hex code OTA requires to be sent over the air. The team then wrote a python script to impersonate a lightbulb and ask for an OTA update using the code.

The firmware image that we recorded was not dependent on our impersonated lightbulb MAC address. This brought us to the conclusion that the software is protected by a single key that is shared at least between all the lamps from a specific model.

On this assumption of symmetric cryptography, recovering the encryption and authentication keys from any one Philips Hue lightbulb will allow a software update to performed to any other lightbuild and upload malicious code of the attacker’s choosing. (At least once the OTA firmware image structure has been reverse engineered, which the authors also did).

To get those keys, the authors developed a side-channel power analysis attack to break the Philips bootloader:

Side channel power analysis measures the power consumed by a digital device as it performs operations. With this attack it is possible to infer something about the data being processed by the device, which was first demonstrated as a method of breaking cryptographic algorithms by Kocher et al..

Section 6 of the paper provides the full details of how differential power analysis (DPA) and correlation power analysis (CPA) were used to recover the key bits. We don’t really have space to cover it here (and besides, I’m not sure I could do it justice), but it’s well worth reading if you’re interested, and a great demonstration of the weakness of ‘security by obscurity’ even when you think you’re doing everything else right.

### Reaching bulbs from a distance

At this point, if we can communicate with a bulb, we can upload our firmware to it. Communication normally requires close physical proximity though:

[The ZibBee chips in the Hue lamps] will ignore any request to reset or to change their affiliation unless it is sent from a ZigBee transmitter which is only a few centimeters away from the lamp. Even though the attacker can try to spoof such a proximity test by using very high power transmitters, the fact that the received power decreases quadratically with the distance makes such brute force attacks very hard.

Section 7 in the paper details how the authors managed to bypass the proximity check in the Atmel ZigBee chip’s Touchlink implementation. It all boils down to an assumption in the code that a message with a zero-value transaction id has already been rejected as invalid. However, this check is actually only carried out for Scan Request messages, and not for any other message in the protocol.

This means that we can send any other message assuming zero value for the Transaction and Response Ids and it will be received and processed as a valid message by the light.

So the team send a unicast ‘Reset to Factory New Request’ command to the light with a zero Transaction Id. Upon receiving the message the light undergoes a factory reset and begins scanning for a new network to join by sending ZigBee Beacon request messages. “…we respond with a ZigBee Beacon message with the Association Permit flag set to true. This cause the light to start a ZigBee association process and join our network.”

For this part of the attack:

We install 3 Philips Hue lamps in offices at the first floor of our faculty building. We successfully tested our full attack from a car which was parked across the lawn at at distance of about 50 meters… We then successfully tested our attack while ‘wardriving’ the car at the far edge of the lawn.

Then they installed 5 smart bulbs in the third floor of an office building and mounted the attack kit onto a DJI drone. Successful takeover was achieved with a fly-by (warflying) and attack code that caused all the lamps to repeatedly signal SOS in Morse code while the drone hovered in front of the building.

Building a self-spreading worm is now possible by combining the ability to sign and encrypt arbitrary binaries with our knowledge of the long-rage takeover attack vulnerability.

### How many bulbs do you need before the infection takes over a whole city?

Once you can spread from bulb to bulb, all you need to do is infect one (or a small number) of seed bulb(s) and let nature take its course. Recall from some of the network theory we’ve looked at previously that in a random graph, once a certain number of edges have been added a phase change occurs in which a single giant connected component emerges. If we model this by assume N smart lamps (nodes) placed randomly within a city, and an edge between any two lamps that are less than distance D apart, then the field of Percolation theory can tell us the critical value of N (the Percolation threshold) at which this giant connected component should emerge.

For the city of Paris, with a total area of 105 square kilometers, and assuming D = 100 meters we find that the critical mass of installed lamps in the whole city of Paris needs to be about N = 15,000.

Since the Philips Hue smart lamps are very popular in Europe and especially in affluent areas such as Paris, there is a very good chance that this threshold has in fact already been exceeded, and thus the city is already vulnerable to massive infections via the ZigBee chain reaction described in this paper.

### You took over my bulb, so what?

• A Bricking attack makes the attack irreversible such that any effect caused by the worm (blackout, constant flickering etc.) is permanent. Once the worm is in place, it can determine what further OTA updates to accept if any.
• Wireless network jamming uses the ZigBee band, which overlaps with WiFi. By going into ‘test mode’ which transmits a continuous wave signal without first checking for a clear channel, it would be possible to mount DoS attacks disrupting all WiFi or other 2.4GHz traffic.
• Data infiltration and exfiltration using Philips Hue lamps demonstrated by Ronen and Shamir at a rate of about 10KB per day. Using infected lamps similar covert channels can be created but at much higher rates.
• Let’s get a bit more serious… Epileptic seizures – it is possible to use the Philips Hue to trigger epileptic seizures, or to drive LEDs at frequencies that increase long-term discomfort in humans. Imagine this happening simultaneously across a whole city!

### Can we please take IoT security a little more seriously now?

Our attacks exploit a specific implementation bug of the Touchlink commission protocol and a specific design for the OTA update process, but they are just an example of the way security in IoT is designed today. The Atmel code we reviewed was very well written and documented, but it is extremely difficult to implement complex state machines of such protocols without any bugs. The main problem is in the insecure design of the ZLL standard itself. We believe this will not be the last bug or attack found against ZLL commissioning… The sharp contrast between the open and inclusive manner in which the TLS 1.3 standard was designed and the secretive work on the ZigBee 3.0 specification that is still not open to the public, is a big part of the problem.

The attack is also another reminder that “security by obscurity has failed time after time,” and any reuse of keys across devices is a big security risk.

We should work together to use the knowledge we gained to protect IoT devices or we might face in the near future large scale attacks that will affect every part of our lives.

tags:

The password reset MitM attack Gelernter et al., IEEE Security and Privacy 2017

The Password Reset Man-in-the-Middle (PRMitM) attack is really very simple, but that doesn’t mean it’s not dangerous. It involves persuading the user to sign-up for an account for some service under the attacker’s control (maybe there’s an enticing free download for example), and then manipulating the registration flow such that the attacker is actually able to reset the password for the user’s account on some other system. An especially good target is the user’s email account. As we know, it’s all downhill from there. In addition to explaining the attack, and demonstrating it in user experiments with Google and Facebook users, the authors also provide a set of recommendations for implementing a password reset process to minimise the risks of PRMitM attacks.

We evaluated the PRMitM attacks on Google and Facebook users in several experiments, and found that their password reset process is vulnerable to the PRMitM attack. Other websites and some popular mobile applications are vulnerable as well… Since millions of accounts are currently vulnerable to the PRMitM attack, we also present a list of recommendations for implementing and auditing the password reset process.

The findings were reported to vulnerable vendors. Snapchat and Yahoo! fixed the vulnerability, Google, LinkedIn and Yandex indicated that they plan to fix it. Other less vulnerable websites (e.g. Facebook) indicated they would consider using the findings in the future but do not plan to apply fixes anytime soon.

Sometimes users genuinely do forget their passwords, and then there needs to be a way to reset them. You can have the strongest password / authentication mechanism in the world of course, but if the password is easy to reset none of that matters. So how do you make a password reset process that’s at least as strong and secure as the regular authentication process?

If you’re using one of the ten most popular websites, here are the challenges you may be presented with during the reset process:

(The authors don’t discuss if/how this changes when you have two-factor authentication enabled – do many sites then require you to provide a backup code or recovery key for example? I didn’t have time to investigate all of the sites listed above).

CAPTCHAs are really only there to make it harder for attackers to automate going through the password reset process; they don’t provide any fundamental security. Security questions may relate to answers you provided during account creation, or to information you may know about your account (for example: which of these contacts is in your address book…). Phone codes refers to either a code sent to a registered phone number via SMS, or an automated call that may be placed to that number which reads you a code. The most common defense is sending a password reset link in an email, unfortunately this mechanism is usually not relevant for email accounts themselves.

Let’s be honest, bar the email link, none of these mechanisms are particularly great: security questions are known to be problematic, and SMS is wide open – weaknesses in the SS7 protocol let anyone redirect your phone calls and text messages:

“It’s so simple, it’s almost embarrassing to call it a hack” – So hey, you should stop using texts for two-factor authentication, Wired 2016.

The authors of this paper also conducted a short experiment to see whether users would give ‘correct’ answers to security questions for low-value websites (e.g. “What is your mother’s maiden name?”). Almost 80% of users in the study provided the correct answer to the security question, indicating the problem of relying on security questions. You could make up answers of course, but then you’ll need a way of remembering those answers when you’ve forgotten your password…

The problems with security questions and the popularity of mobile phones has made the authentication using mobile devices a preferred option for password recovery. The most common way to authenticate a user via mobile phone is by sending a code to the device. The user then has to insert the received code into the website to reset the password.

Spear-phishing attacks have been targeting the SMS codes sent during password reset for at least a couple of years (see “Password recovery scam tricks users into handing over email account access“). If that spear-phishing attack isn’t quite subtle enough to dupe enough users, and hijacking SMS via SS7 seems like a bit of a hassle, then there’s always the PRMitM attack.

### The basic PRMitM attack

Phishing attacks exploit the users; there is no bug in the design of the attacked website and the attacker exploits unwary users who ignore indications given to them by the browsers. On the other hand, PRMitM attacks exploit bugs in the design of the password-reset process.

The fundamental insight behind a PRMitM attack is that many of the challenges and steps in a typical registration process are similar to those in a password reset process. The attacker sets up a website that encourages a user to create an account. When the user registers they provide their email address, which the attacker uses to initiate a password reset process with the email provider. Every challenge sent by the email provider (e.g., security questions), is forwarded to the user and appears to be part of the registration process. The answers supplied by the user are then passed on the email provider as part of the password reset flow.

It’s easy to see how that works with security questions, but what about SMS codes? As part of the fake registration process, the attacker asks the victim for their phone number, claiming that a code will be sent to it that must be entered to complete registration. Now as part of the password reset flow initiated by the attacker, the email provider sends a code via SMS to the user’s phone. The user enters this code on the attacker’s website, and the attacker in turn supplies it to the email provider. A similar approach works if the email provider reads a code to the user via a phone call.

### Why are users fooled??

Firstly, there’s the simple fact that users are expecting a code to be sent to them as part of a registration process they initiated, so when a code arrives it is not surprising. Ideally at least 3 pieces of information should appear in the SMS message sent by a provider in addition to the code (i) the name of the sending website, (ii) an explanation that the code is for password reset, and (iii) a warning to avoid using the code anywhere but the sending website. None of the evaluated website’s messages included all of these elements:

• Yandex, Google, and Yahoo in Russian send just a code
• LinkedIn, Google and Yahoo (other languages) send website name and code, but no indication it is for password reset
• None of the tested sites include a warning about not entering the code anywhere else.

Even if the SMS did have all of these elements, the authors found that many users don’t even read the text or even open the message, but just quickly grab the code from the notification that pops up on the phone.

In a series of experiments run by the authors using a PRMitM attack against their Facebook accounts, most ‘victims’ (they weren’t actually hacked, but the researchers demonstrated they could) remained completely unaware that they had been hacked. The same thing happened with phone call based verifications when running PRMitM attacks against Google accounts.

If you look at messaging applications, you find similar vulnerabilities in their password reset mechanisms too:

### Defences and guidelines

If using security questions, if these are directly related to the actions done by the user on the site (for example, Google asks questions about contacts, labels, and so on) these can’t be forwarded to users as legitimate security questions for other sites.

Nevertheless, it is desirable to avoid relying on security questions, as they can be bypassed by attackers, especially if the attacker is related to the victim.

Better is not to send a code in clear text over SMS (and even better in my mind is not to use SMS, due to its vulnerabilities, but leaving that aside…), but instead to send a link via SMS. Asking a victim to copy and paste a link from an SMS into their website would certainly look like an unnatural act. The authors call this scheme Link-via-SMS (LVS) based password reset.

A long link is better than just a long code. The natural user interaction with links is to press on them. On the other hand, there is always a chance that a user will just copy the code without reading the message. In our implementation of the LVS, the link refers the user to an interactive page that has an alert about the attempt to reset the user password.

In tests, using the LVS approach caused all participants in the study to detect and thwart the PRMitM attack.

Websites should also always send a notification after both password reset and psasword change – via both email and SMS. Of the websites tested, only Google sends an SMS notification on password change.

In section IX of the paper you’ll find a quick checklist that can be used to audit and secure password reset procedures. If you must use SMS etc., these are worth checking out. I’ll leave you with the following reminder though:

Finally, although the recommendations of this section are given mainly in the perspective of the PRMitM attack, it is important to note that according to the NIST Digital Authentication Guideliness, due to other security problems it is not recommended to rely only on SMS or phone calls for authentication.