Skip to content

Dynamic control flow in large-scale machine learning

June 7, 2018

Dynamic control flow in large-scale machine learning Yu et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

In 2016 the Google Brain team published a paper giving an overview of TensorFlow, “TensorFlow: a system for large-scale machine learning.” This paper is a follow-up, taking a much deeper look at how TensorFlow supports dynamic control flow, including extending automatic differentiation to control flow constructs.

Embedding control flow within the dataflow graph

With a wide range of machine learning models in use, and rapid exploration of new techniques, a machine learning system needs to be expressive and flexible to support both research and production use cases. Given the ever larger models and training sets, a machine learning system also needs to be scalable. These means both using individual devices efficiently (anything from phones to custom ASCIs in datacenters), and also supporting parallel execution over multiple devices.

Both the building blocks of machine learning and the architectures built up using these blocks have been changing rapidly. This pace appears likely to continue. Therefore, rather than defining RNNs, MoEs (mixture of experts), and other features as primitives of a programming model, it is attractive to be able to implement them in terms of general control-flow constructs such as conditionals and loops. Thus, we advocated that machine learning systems should provide general facilities for dynamic control flow, and we address the challenge of making them work efficiently in heterogeneous distributed systems consisting of CPUs, GPUs, and TPUs.

The demand for dynamic control flow has been rising over the last few years. Examples include while-loops used within RNNs, gating functions in mixture-of-experts models, and sampling loops within reinforcement learning.

Instead of relying on programming languages outside of the graph, TensorFlow embeds control-flow as operations inside the dataflow graph. This makes whole program optimisation easier and keeps the whole computation inside the runtime system, avoiding the need to communicate with the client (which can be costly in some deployment scenarios). The implementation supports both parallelism and asynchrony, so e.g. control-flow logic on CPUs and compute kernels on GPUs can overlap.

The main control flow operators are a conditional cond(pred, true_fn, false_fn), and a while loop while_loop(pred, body, inits). There are other higher order constructs built on top of these (for example, map_fn, foldl, foldr, and scan).

We analyzed more than 11.7 million (!) unique graphs for machine learning jobs at Google over the past year, and found that approximately 65% contain some kind of conditional computation, and approximately 5% contain one or more loops.

Control flow in TensorFlow

The basic design of TensorFlow is as follows: a central coordinator maps nodes in the dataflow graph to the given set of devices, and then partitions the graph into a set of subgraphs, one per node. Where the partitioning causes an edge to span two devices the edge is replaced with pair of send and receive communication operations using a shared rendezvous key.

When dynamic control flow is added into the mix, we can no assume that each operation in the graph is executed exactly once, and so unique names and rendezvous keys are generated dynamically. Conditional branches and loops may be arbitrarily partitioned across devices.

We rely on a small set of flexible, expressive primitives that serve as a compilation target for high-level control-flow constructs within a dataflow model of computation.

Those primitives are switch, merge, enter, exit, and nextIteration. Every execution of an operation takes place within an ‘frame’. Without control flow, each operation is executed exactly once. With control flow, each operation executes at most once per frame. The following figure shows how a while-loop can be translated into these primitives to give you the idea:

Tensors inside executors are represented by tuples (value, isDead, tag), where isDead is a boolean indicating whether the tensor is on an untaken branch of a switch, and the tag identifies a frame. The evaluation rules are shown in the following figure:

The rules allow multiple loop iterations to run in parallel, but left unchecked this will use a lot of memory. Empirically, a limit of 32 parallel executions at a time seems to work well.

When the subgraph of a conditional branch or loop body is partitioned across devices partitions are allowed to make progress independently. (There is no synchronisation after each loop iteration, and no central coordinator). The receive operation of a conditional is always ready and can be started unconditionally. If the corresponding send is never executed though (the branch is not chosen) that means we’d be blocking forever waiting for input. Therefore the system propagates an isDead signal across devices from send to receive to indicate the branch has not been taken. This propagation may continue across multiple devices as needed.

For distributed execution of loops each partition needs to know whether to proceed or exit at each iteration. To handle this the graph is rewritten using simple control-loop state machines. Here’s an example partitioning a simple while-loop. The dotted lines represent the control edges.

The overhead for the distributed execution of a loop is that every participating device needs to receive a boolean at each iteration from the device that produces the loop predicate. However, the communication is asynchronous and computation of the loop predicate can often run ahead of the rest of the computation. Given typical neural network models, this overhead is minimal and largely hidden.


TensorFlow supports automatic differentiation. That is, given a graph representing a neural network, it will generate efficient code for the corresponding distributed gradient computations. In the base case this is back-propagation using the chain rule, and TensorFlow includes a library of gradient functions corresponding to most of its primitive operations.

Tensors used in the gradient function (e.g., x and y in the above example) are kept until the gradient computation is performed. That can consume a lot of memory in deep neural networks, and it gets worse when we add loops. To support back-propagation through control flow constructs:

Each operation in the graph is associated with a ‘control flow context’ that identifies the innermost control-flow construct of which that operation is a member. When the backpropagation traversal first encounters a new control-flow content, it generates a corresponding control-flow construct in the gradient graph.

For a conditional tf.cond(pred, true_fn, false_fn) with output gradients g_z this is simply tf.cond(pred, true_fn_grad(g_z), false_fn_grad(g_z)). For while loops:

  • The gradient of a while loop is another loop that executes the gradient of the loop body for the same number of iterations as the forward loop, but in reverse.
  • The gradient of each differentiable loop variable becomes a loop variable in the gradient loop.
  • The gradient of each differentiable tensor that is constant in the loop is the sum of the gradients for that tensor at each iteration.

The overall performance is heavily dependent on how intermediate values are treated. To avoid recomputing these values they are pushed onto a stack during loop execution, and popped during gradient computation. Stack operations are asynchronous so they can run in parallel with actual computation.

Memory management

Especially on GPUs, where memory is more limited, memory management is crucial. When tensors are pushed onto stacks they are moved from GPU to CPU memory. Separate GPU streams are used for compute and I/O operations to improve their overlap. Each stream is a sequence of sequentially executed GPU kernels. A combination of TensorFlow control edges and GPU hardware events are used to synchronise dependent operations executed on different streams.

Future directions

Dynamic control flow is an important part of bigger trends that we have begun to see in machine learning systems. Control-flow constructs contributed to the programmability of theses systems, and enlarge the set of models that are practical to train using distributed resources. Going further, we envision that additional programming language facilities will be beneficial. For instance, these may include abstraction mechanisms and support for user-defined data structures. The resulting design and implementation challenges are starting to become clear. New compilers and run-time systems such as XLA (Accelerated Linear Algebra), will undoubtedly play a role.

Reducing DRAM footprint with NVM in Facebook

June 6, 2018

Reducing DRAM footprint with NVM in Facebook Eisenman et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

…to the best of our knowledge, this is the first study on the usage of NVM devices in a commercial data center environment.

We’ve been watching NVM coming for some time now, so it’s exciting to see a paper describing its adoption within Facebook. MyRocks is Facebook’s primary MySQL database, and is used to store petabytes of data and to serve real-time user activities. MyRocks uses RocksDB as the storage engine, and a typical server consumes 128GB of DRAM and 3 TB of flash. It all seems to work well, so what’s the problem? Spiralling costs!

As DRAM facing major scaling challenges, its bit supply growth rate has experienced a historic low. Together with the growing demand for DRAM, these trends have led to problems in global supply, increasing total cost of ownership (TCO) for data center providers. Over the last year, for example, the average DRAM DDR4 price has increased by 2.3x.

Just using less DRAM per server isn’t a great option as performance drops accordingly. So the big idea is to introduce non-volatile memory (NVM) to pick up some of the slack. NVM is about 10x faster than flash, but still over 100x slower than DRAM. We can make up for the reduction in performance over DRAM by being able to use much more NVW due to the lower cost. So we move from a two-layer hierarchy with DRAM cache (e.g. 96GB) and flash to a three-layer hierarchy with a smaller amount of DRAM (e.g. 16GB), a larger NVM cache layer, and then flash. As of October 23rd, 2017 16GB of NVM could be picked up on Amazon for $39, whereas 16GB of DRAM cost $170.

We present MyNVM, a system built on top of MyRocks, that significantly reduces the DRAM cache using a second-layer NVM cache. Our contributions include several novel design choices that address the problems arising from adopting NVM in a data center setting…

The following chart shows the results achieved in production when replacing RocksDB with MyNVM as the storage engine inside MyRocks. Comparing the first and third data points, you can see that MyRocks/MyNVM has slightly higher latency (20% at P99) than MyRocks/RocksDB with 96GB storage, however it uses only 1/6th of the DRAM. MyRocks/MyNVM with 16GB of DRAM is 45% faster than MyRocks/RocksDB also with 16GB DRAM. So the NVM layer is doing a good job of reducing costs while closing the performance gap.

NVM comes in two form factors: byte-addressable DIMM, and also as a block device. Much of the prior work focuses on the DIMM use case. But Facebook use the block device form factor due to its lower cost. As we’ll see shortly, loading data in blocks has knock-on effects all the way through the design.

Tricky things about NVM

NVM is less expensive than DRAM on a per-byte basis, and is an order-of-magnitude faster than flash, which makes it attractive as a second level storage tier. However, there are several attributes of NVM that make it challenging when used as a drop-in replacement for DRAM, namely its higher latency, lower bandwidth, and endurance.

Facebook swap 80GB of DRAM for 140GB of NVM. The lower latency of NVM is therefore compensated for by the higher hit rate from having a larger cache. However, bandwidth becomes a bottleneck. Peak read bandwidth with NVM is about 2.2GB/s, which is 35x lower than DRAM read bandwidth. This issue is made worse by the fact that Facebook chose to use NVM as a block device, so data can only be read at a granularity of 4KB pages. For small objects, this can result in large read amplification.

… we found the NVM’s limited read bandwidth to be the most important inhibiting factor in the adoption of NVM for key-value stores.

NVM also has endurance considerations: if cells are written to more than a certain number of times they wear out and the device lifetime is shortened. In the cache use case with frequent evictions this can easily become a problem, so NVM can’t be used as a straight drop-in replacement for DRAM without some care taken to avoid excessive writes.

Finally, given the very low latency of NVM compared to other block devices, the operating system interrupt overhead itself becomes significant (about 2µs out of a 10µs average end-to-end read latency).

The design of MyNVM

MyNVM uses NVM as a 2nd level write-through cache. The objective is to significantly reduce costs while maintaining latency and qps.

The default block size is 16KB, which means fetching at least 16KB every time we want to read an object. This ends up requiring more than double the available read bandwidth. Reducing the block size to 4KB actually makes things worse! This is due to the increased footprint of the DRAM index blocks (4x), which in turn lower the available DRAM for caching, and hence increase the hit rate on the 2nd level NVM cache.

To address this problem, the index is itself partitioned into smaller blocks, with an additional top-level index. Only the top-level index and the relevant index partitions then need to to read and cached in DRAM for any given lookup. This brings the required read bandwidth down, but as shown in the figure above, we’re still right up against the device limits.

We can further reduce the read bandwidth requirement by carefully aligning blocks with physical pages. RocksDB data blocks are compressed by default, so that a 4KB block consumes less than that when compressed and written to NVM. Now we end up in a situation where data blocks span pages, and we may need to read two pages for a single block. MyNVM elects to use 6KB blocks, which compress on average to 4KB. (As a nice side-effect, this also reduces the size of the index). The 6KB blocks compress to around 4KB, but still don’t align perfectly with pages. MyNVM zero pads the end of a page if the next compressed block cannot fully fit into the same page. This reduces the number of blocks spread over two pages by about 5x.

The improved block alignment buys a lot more read bandwidth headroom:

It also reduces the P99 latency since we’re reading less data.

By default RocksDB applies compression using a per-block dictionary. With smaller block sizes this makes the compression less effective (11% overhead at 6KB blocks). To combat this, MyNVM using a preloaded dictionary based on data uniformly sampled across multiple blocks. This reduces the overhead to 1.5%.

Addressing the durability constraint

To avoid wearing out the NVM, MyNVM uses an admission control policy to only store blocks in the 2nd level cache that are not likely to be quickly evicted. An LRU list is kept in DRAM representing the size of the NVM. When a block is allocated from flash it is only cached in NVM if it has been recently accessed and is therefore present in the simulated cache LRU. For MyNVM, using a simulated cache size of 40GB gives sufficiently accurate prediction to accommodate the endurance limitation of the device.

Interrupt latency

To lower the operating system interrupt overhead, the team explored switching to a polling model. Continuous polling quickly took CPU usage of a core to 100%. A hybrid polling strategy involving sleeping for a time threshold after an I/O is issue before starting to poll significantly reduced the CPU usage again. With 8 threads or more though, the benefits of polling diminish.

An improved polling mechanism in the kernel could remove many of these limitations. Until that is available, we decided to currently not integrate polling in our production implementation of MyNVM, but plan to incorporate it in future work.


We saw the headline production results at the top of this post. The following figures show the mean and P99 latencies achieved over a 24 hour period, with intervals of 5M queries.

And here we can see the queries per second comparison:

This is just the beginning

NVM can be utilized in many other data center use cases beyond the one described in this paper. For example, since key-value caches, such as memcached and Redis, are typically accessed over the network, their data can be stored on NVM rather than DRAM without incurring a large performance cost. Furthermore, since NVM is persistent, a node does not need to be warmed up in case it reboots. In addition, NVM can be deployed to augment DRAM in a variety of other databases.

ServiceFabric: a distributed platform for building microservices in the cloud

June 5, 2018

ServiceFabric: a distributed platform for building microservices in the cloud Kakivaya et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

Microsoft’s Service Fabric powers many of Azure’s critical services. It’s been in development for around 15 years, in production for 10, and was made available for external use in 2015.

ServiceFabric (SF) enables application lifecycle management of scalable and reliable applications composed of microservices running at very high density on a shared pool of machines, from development to deployment to management.

Some interesting systems running on top of SF include:

  • Azure SQL DB (100K machines, 1.82M DBs containing 3.48PB of data)
  • Azure Cosmos DB (2 million cores and 100K machines)
  • Skype
  • Azure Event Hub
  • Intune
  • Azure IoT suite
  • Cortana

SF runs in multiple clusters each with 100s to many 100s of machines, totalling over 160K machines with over 2.5M cores.

Positioning & Goals

Service Fabric defies easy categorisation, but the authors describe it as “Microsoft’s platform to support microservice applications in cloud settings.” What particularly makes it stand out from the crowd is that it is built on foundations of strong consistency, and includes support for stateful services through reliable collections: reliable, persistent, efficient and transactional higher-level data structures.

Existing systems provide varying levels of support for microservices, the most prominent being Nirmata, Akka, Bluemix, Kubernetes, Mesos, and AWS Lambda [there’s a mixed bag!!]. SF is more powerful: it is the only data-ware orchestration system today for stateful microservices. In particular, our need to support state and consistency in low-level architectural components drives us to solve hard distributed computing problems related to failure detection, failover, election, consistency, scalability, and manageability. Unlike these systems, SF has no external dependencies and is a standalone framework.

Every layer in SF supports strong consistency. That doesn’t mean you can’t build weakly consistent services on top if you want to, but this is an easier challenge than building a strongly consistent service on top of inconsistent components. “Based on our use case studies, we found that a majority of teams needing SF had strong consistency requirements, e.g., Microsoft Azure DB, Microsoft Business Analytics Tools, etc., all rely on SF while executing transactions.”

High level design

SF applications are collections of independently versioned and upgradeable microservices, each of which performs a standalone function and is composed of code, configuration, and data.

SF itself is composed of multiple subsystems, with the major ones shown in the figure below.

At the core of SF is the Federation Subsytem, which handles failure detection, routing, and leader election. Built on top of the federation subsystem is the Reliability Subsystem providing replication and high availability. The meat of the paper describes these two subsystems in more detail.

Federation subsystem

The ring

At the core of the federation subsystem you’ll find a virtual ring with 2^m points, called the SF-Ring. It was internally developed at Microsoft starting in the early 2000’s, and bears similarity to Chord and Kademlia. Nodes and keys are mapped to a point in the ring, with keys owned by the node closest to it and ties won by the predecessor. Each node keeps track of its immediate successor and predecessor nodes in the ring, which comprise its neighborhood set.

Routing table entries are bidirectional and symmetric. Routing partners are maintained at exponentially increasing distances in the ring, in both clockwise and anti-clockwise directions. Due to bidirectionality, most routing partners end up being symmetric. This speeds up routing, the spread of failure information, and the updating of routing tables after node churn.

When forwarding a message for a key, a node searching its routing table for the node closest to key in either a clockwise or anti-clockwise direction. Compared to clockwise only routing we get faster routing, more routing options in the face of stale or empty tables, better load spread across nodes, and avoidance of routing loops.

Routing tables are eventually convergent. A chatter protocol exchanges routing table information between routing partners ensuring eventual consistency for long distance neighbours.

A key result from the SF effort is that strongly consistent applications can be supported at scale by combining strong membership in the neighbourhood with weakly consistent membership across the ring. Literature often equates strongly consistent membership with virtual synchrony, but this approach has scalability limits.

Nodes in the ring own routing tokens which represent the portion of the ring whose keys they are responsible for. The SF-Ring protocol ensures that there is never any overlap between tokens (always safe), and the every token range is eventually owned by at least one node (eventually live). When a node joins, the two immediate neighbours each split the ring segment with the new node at exactly the half-way point. When a node leaves, its successor and predecessor split the range between them halfway.

As we’ll see when we look at the reliability subsystem, nodes and objects (services) are placed into the ring rather than simply relying on hashing. This enables preferential placement taking into account failure domains and load-balancing.

Consistent membership and failure detection

Membership and failure detection takes place within neighbourhood sets. There are two key design principles:

  1. Strongly consistent membership: all nodes responsible for monitoring a node X must agree on whether it is up or down. In the SF-Ring, this means that all nodes in X’s neighbourhood set must agree on its status.
  2. Decoupling failure detection from failure decision: failure detection protocols (heartbeats) detect a possible failure, a separate arbitrator group decides on what to do about that. This helps to catch and stop cascading failure detections.

A node X periodically sends lease renewal requests to each of its neighbours (monitors). The leasing period is adjusted dynamically but is typically around 30 seconds. X must obtain acks (leases) from all of its monitors. This property defines the strong consistency. If X fails to obtain all of its leases, it considers removing itself from the group. If a monitor misses a lease renewal heartbeat from X it considers marking X as failed. In both cases, the evidence is submitted to the arbitrator group.

The arbitrator acts as a referee for failure detections and for detection conflicts. For speed and fault-tolerance, the arbitrator is implemented as a decentralized group of nodes that operate independent of each other. When any node in the system detects a failure, before taking actions relevant to the failure, it needs to obtain confirmation from a majority (quorum) of nodes in the arbitrator group.

The arbitrator protocol details can be found in section 4.2.2 of the paper. Using lightweight arbitrator groups allows membership, and hence the ring, to scale to whole datacenters.

Leader election

Given we have a well-maintained ring, SF has a nice pragmatic solution to leader election:

For any key k in the SF-Ring, there is a unique leader: the node whose token range contains k (this is unique due to the safety and liveness of routing tokens). Any node can contact the leader by routing to key k. Leader election is thus implicit and entails no extra messages. In cases where a leader is needed for the entire ring we use k=0.

Reliability subsystem

In the interests of space, I’m going to concentrate on the placement and load balancer (PLB) component of the reliability subsystem. It’s job is place microservice instances at nodes in such a way as to ensure balanced load.

Unlike traditional DHTs, where object IDs are hashed to the ring, the PLB explicitly assigns each service’s replicas (primary and secondaries) to nodes in SF-Ring.

The placement considers available resources at nodes, outstanding requests, and the parameters of typical requests. It also continually moves services from overly exhausted nodes to under-utilised nodes. The PLB also migrates services away from a node that is about to be upgraded.

The PLB may be dealing with tens of thousands of objects in a constantly changing environment, thus decisions taken at one moment may not be optimal in the next. Thus PLB favours making quick and nimble decisions, continuously making small improvements. Simulated annealing is used for this. The simulated annealing algorithm sets a timer (10s in fast mode, 120s in slow mode) and explores the state space until convergence or until the timer expires. Each state has an energy. The energy function is user-definable, but a common case is the average standard deviation of all metrics in the cluster (lower is better).

Each step generates a random move, considers the energy of the new prospective state due to this move, and decides whether to jump. If the new state has lower energy the annealing process jumps with probability 1; otherwise if the new state has d more energy than the current and the current temperature is T, the jump happens with probability e^{-d/T}. This temperature T is high in initial steps (allowing jumps away from local minima) but falls linearly across iterations to allow convergence later.

Considered moves are fine-grained. For example, swapping a secondary replica to another node, or swapping primary and secondary replica.

Reliable collections

SF’s reliable collections provide data structures such as dictionaries and queues that are persistent, available and fault-tolerant, efficient, and transactional. State is kept locally in the service instance while also being made highly available, so reads are local. Writes are relayed from primary to secondaries via passive replication and considered complete once a quorum has acknowledged.

Reliable collections build on the services of the federation and reliability subsystems: replicas are organised in an SF-Ring, failures are detected and a primary kept elected. PLB (in conjunction with the failover manager ) keeps replicas fault-tolerant and load-balanced.

SF is the only self-sufficient microservice system that can be used to build a transactional consistent database which is reliable, self-*, and upgradable.

Lessons learned

Section 7 of the paper contains an interesting discussion of lessons learned during the development of SF. Since I’m already over my target write-up length, I will just give the headlines here and refer you to the paper for full details:

  • Distributed systems are more than just nodes and a network. Grey failures are common.
  • Application/platform responsibilities need to be well isolated (you can’t trust developers to always do the right thing).
  • Capacity planning is the application’s responsibility (but developers need help)
  • Different subsystems require different levels of investment

What’s next?

Much of our ongoing work addresses the problem of reducing the friction of managing the clusters. One effort towards that is to move to a service where the customer never sees individual servers… other interesting and longer term models revolve around having customers owning servers, but also being able to run microservice management as a service where those servers join in. Also in the short term we are looking at enabling different consistency levels in our Reliable Collections, automatically scaling in and out Reliable Collection partitions, and imbuing the ability to geo-distribute replica sets. Slightly longer term, we are looking at best utilizing non-volatile memory as a store for ServiceFabric’s Reliable Collections. This requires tackling many interesting problems ranging from logging bytes vs. block oriented storage, efficient encryption, and transaction-aware memory allocations.

Hyperledger fabric: a distributed operating system for permissioned blockchains

June 4, 2018

Hyperledger fabric: a distributed operating system for permissioned blockchains Androulaki et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

This very well written paper outlines the design of HyperLedger Fabric and the rationales for many of the key design decisions. It’s a great introduction and overview. Fabric is a permissioned blockchain system with the following key features:

  • A modular design allows many components to be pluggable, including the consensus algorithm
  • Instead of the order-execute architecture used by virtually all existing blockchain systems, Fabric uses an execute-order-validate paradigm which enables a combination of passive and active replication. (We’ll be getting into this in much more detail shortly).
  • Smart contracts can be written in any language.

…in popular deployment configurations, Fabric achieves throughput of more than 3500 tps, achieving finality with latency of a few hundred ms and scaling well to over 100 peers.

Examples of use cases powered by Fabric include foreign exchange netting in which a blockchain is used to resolve trades that aren’t settling; enterprise asset management tracking hardware assets as they move from manufacturing to deployment and eventually to disposal; and a global cross-currency payments system processing transaction among partners in the APFII organisation in the Pacific region.

The big picture

Fabric is a distributed operating system for permissioned blockchains that executes distributed applications written in general purpose programming languages (e.g., Go, Java, Node.js). It securely tracks its execution history in an append-only replicated ledger data structure and has no cryptocurrency built in.

A Fabric blockchain consists of a set of permissioned nodes, with identities provided by a modular membership service provider (MSP). Nodes in a the network play one of three roles; client, peer, or ordering service.

  • Clients submit transaction proposals for execution, help orchestrate the execution phase, and finally broadcast transactions for ordering.
  • Peers execute transaction proposals and validate transactions. All peers maintain the blockchain ledger. Not all peers execute all transaction proposals, only a subset of nodes called endorsing peers (or endorsers) do so, as specified by the policy of the chaincode (smart contract) to which the transaction pertains.
  • Ordering service nodes (aka orderers) collectively form an ordering service that establishes a total order across all transactions.

It’s possible to construct Fabric networks with multiple blockchains connected to the same ordering service, each such blockchain is called a channel.

Channels can be used to partition the state of the blockchain network, but consensus across channels is not coordinated and the total order of transactions in each channel is separate from the others.

From order-execute to execute-order-validate

Everything in Fabric revolves around the execute-order-validate processing pipeline. This is a departure from the traditional blockchain model.

All previous blockchain systems, permissioned or not, follow the order-execute architecture. This means that the blockchain networks orders transactions first, using a consensus protocol, and then executes them in the same order on all peers sequentially.

Fabric versions up to 0.6 also used the order-execute approach. The weakness of the order-execute are that it forces sequential execution of transactions, cannot cope with non-deterministic code, and requires all smart contracts to run on all peers, which may introduce confidentiality concerns. Feedback across many proof-of-concept applications highlighted some of the practical issues with order-execute too:

  • Users would report a bug in the consensus protocol, which in all cases on investigation turned out to be non-deterministic transaction code.
  • Users would complain of poor performance, e.g., only five transactions per second, and then on investigation it turned out that the average transaction for the user took 200ms to execute.

We have learned that the key properties of a blockchain system, namely consistency, security, and performance, must not depend on the knowledge and goodwill of its users, in particular since the blockchain should run in an untrusted environment.

Fabric rejects order-execute and instead using a three-phase execute-order-validate architecture. A distributed application in Fabric consists of two parts: its smart contract or chaincode, and an endorsement policy configured by system administrators which indicates which indicates permissible endorsers of a transaction.

The execution phase

Clients sign and send a transaction proposal to one or more endorsers for execution. Endorsers simulate the proposal, executing the operation on the specified chaincode. Chaincode runs in an isolated container. As a result of the simulation, the endorser produces a writeset (modified keys along with their new values) and a readset of keys read during the simulation, along with their version numbers. The endorser then cryptographically signs an endorsement which includes the readset and writeset, and sends this to the client.

The client collects endorsements until they satisfy the endorsement policy of the chaincode (e.g. x of N). All endorsers of the policy are required to produce the same result (i.e., identical readset and writeset). The client then creates a transaction and passes it to the ordering service.

Note that under high contention for certain keys, it is possible for endorsers to return different results and the proposal will fail. “We consciously adopted this design, as it considerably simplifies the architecture and is adequate for typical blockchain applications.” In the future CRDTs may be supported to enhance the liveness of Fabric under contention.

Executing a transaction before the ordering phase is critical to tolerating non-deterministic chaincodes. A chaincode in Fabric with non-determinism can only endanger the liveness of its own operations, because a client might not gather a sufficient number of endorsements for instance.

The ordering phase

When a client has assembled enough endorsements it submits a transaction to the ordering service.

The ordering phase establishes a total order on all submitted transactions per channel. In other words, ordering atomically broadcasts endorsements and thereby establishes consensus on transactions, despite faulty orderers. Moreover, the ordering service batches multiple transactions into blocks and outputs a hash-chained sequence of blocks containing transactions.

There may be a large number of peers in the blockchain network, but only relatively few are expected to implement the ordering service. Fabric can be configured to use a built-in gossip service to disseminate delivered blocks from the ordering service to all peers.

The ordering service is not involved in maintaining any blockchain state, and does not validate or execute transactions. Thus the consensus mechanism is completely separated from execution and validation and can be made pluggable (for example using crash-fault tolerant – CFT – or Byzantine fault tolerant – BFT – consensus algorithms).

The validation phase

Blocks are delivered to peers either directly by the ordering service or via gossip. Validation then consists of three sequential steps:

  1. Endorsement policy validation happens in parallel for all transactions in the block. If endorsement fails the transaction is marked as invalid.
  2. A read-write conflict check is done for all transactions in the block sequentially. If versions don’t match the transaction is marked as invalid.
  3. The ledger update phase appends the block to the locally stored ledger and updates the blockchain state.

The ledger of Fabric contains all transactions, including those that are deemed invalid. This follows from the overall design, because the ordering service, which is agnostic to chaincode state, produces the chain of the blocks and because the validation is done by peers post-consensus.

A nice property that comes from persisting even invalid transactions is that they can be audited, and clients that try to mount a DoS attack by flooding the network with invalid transactions can easily be detected.

Selected component details

Section 4 in the paper contains a number of interesting implementation details for the various components. In the interests of space, I’m going to focus here on the ledger itself and on chaincode execution.

The ledger component consists of a block store and a peer transaction manager. The block store persists transaction blocks in append only files. It also maintains indices to support random access to blocks and transactions within blocks. The peer transaction manager holds the latest state in a versioned key-value store. A local key-value store is used to implement this, and there are implementations available based on LevelDB and on Apache CouchDB.

Chaincode is executed in a container which isolates chaincodes from each other and from the peer, and simplifies chaincode lifecycle. Go, Java, and Node.js chaincodes are currently supported. Chaincode and the peer communicate using gRPC. Special system chaincodes which implement parts of the Fabric itself run directly in the peer process.

Through its modularity, Fabric is well-suited for many further improvements and investigations. Future work will address (1) performance by exploring benchmarks and optimizations, (2) scalability to large deployments, (3) consistency guarantees and more general data models, (4) other resilience guarantees through different consensus protocols, (5) privacy and confidentiality for transactions and ledger data through cryptographic techniques, and much more.

ForkBase: an efficient storage engine for blockchain and forkable applications

June 1, 2018

ForkBase: an efficient storage engine for blockchain and forkable applications Wang et al., arXiv’18

ForkBase is a data storage system designed to support applications that need a combination of data versioning, forking, and tamper proofing. The prime example being blockchain systems, but this could also include collaborative applications such as GoogleDocs. Today for example Ethereum and HyperLedger build their data structures directly on top of a key-value store. ForkBase seeks to push these properties down into the storage layer instead:

One direct benefit is that it reduces development efforts for applications requiring any combination of these features. Another benefit is that it helps applications generalize better by providing additional features, such as efficient historical queries, at no extra cost. Finally, the storage engine can exploit performance optimization that is hard to achieve at the application layer.

Essentially what we end up with is a key-value store with native support for versioning, forking, and tamper evidence, built on top of an underlying object storage system. At the core of ForkBase is a novel index structure called a POS-Tree (pattern-oriented-split tree).

The ForkBase stack

From the bottom-up, ForkBase comprises a chunk storage layer that performs chunking and deduplication, a representation layer that manages versions, branches, and tamper-proofing, and a collection of data access APIs that combine structured data types and fork semantics. Higher level application services such as access control and custom merge functions can be implemented on top of the API.

ForkBase is a key-value store, where the stored objects are instances of FObject.

Data versioning

The main challenge with data versioning (keeping the full history of every data item, including any branches and merges) is managing storage consumption. Clearly there is an opportunity for deduplication, on the assumption that versions do not completely change their content from one version to the next.

Delta-based deduplication stores just the differences (deltas) between versions, and reconstructs a given version by following a chain of deltas. You can play with the storage/reconstruction cost trade-off in such schemes.

Content-based deduplication splits data into chunks, each of which is uniquely identified by its content (i.e., a hash of the content). Identical data chunks can then be detected and redundant copies eliminated.

ForkBase opts for content-based deduplication at the chunk level. Compared to similar techniques used in file systems, ForkBase uses smaller chunks, and a data-structure aware chunking strategy. For example, a list will only be split at an element boundary so that a list item never needs to be reconstructed from multiple chunks. ForkBase recognises a number of different chunk types, each uniquely identified by its cid, which is simply a hash of the chunk contents.

The chunkable object types (Blob, List, Set, and Map) are stored as POS-Trees, which will look at shortly.

A FObject’s uid is simply an alias for the chunk id of the Meta chunk for the object.

Fork semantics

Support for forking is based in two key operations: forking and conflict resolution. Fork operations create a new branch, which evolves independently with local modifications isolated from other branches. ForkBase supports both on-demand and on-conflict forking.

On-demand forks are explicitly requested via the API and are tagged with a user-supplied name. An on-conflict fork is implicitly created upon concurrent modification of the same data item. A branch created as a result of a Fork-on-Conflict is untagged, and is identified simply by its uid.

Tagged branches can be merged with another branch, identified either by tag or by version. When conflicts are detected during a merge a conflict list is returned and the application layer is asked to provide a resolution. There are built-in resolution functions for simple strategies such as append, aggregate, and choose-one.

Tamper evidence

The uid of an FObject uniquely identifies both the object’s value and its derivation history. Logical equivalence therefore requires objects have not only the same value, but also the same history. Versions are linked in a cryptographic hash chain to to ensure any attempt at tampering can be detected. Each FObject stores the hashes of the previous versions it derives from in the bases field.

The Pattern-Oriented-Split Tree

Large structured objects are not usually accessed in their entirety. Instead, they require fine-grained access, such as element look-up, range query and update. These access patterns require index-structures e.g., B+-tree, to be efficient. However, existing index structures are not suitable in our context that has many versions and where versions can be merged.

The capacity-based splitting strategies of B+-trees and variants are sensitive to the values being indexed and their insertion order. This makes it harder to deduplicate across versions, and harder to find differences between two versions when merging. Using fixed-sized nodes gets around the insertion order issue, but introduces another issue known as the boundary-shifting problem due to the insertions in the middle of the structure.

The author’s solution is the Pattern-Oriented-Split Tree which supports the following properties:

  • Fast lookup and update
  • Fast determination of differences between two trees, and subsequent merge
  • Efficient deduplication
  • Tamper evidence

Every node in the tree is a chunk (either an index chunk, or at the leaves, an object chunk). Lookups follow a path guided by the split keys. Child node cids are crypographic hashes of their content, as in a Merkle tree. Two objects with the same data will have the same POS-tree, and tree comparison affords an efficient recursive solution. The real secret sauce here lies in how POS-Trees decide where to make splits.

The structure is inspired by content-based slicing and resembles a combination of a B+-tree and a Merkle tree. In a POS-tree, the node (i.e., chunk) boundary is defined as patterns detected from the object content. Specifically, to construct a node, we scan from the beginning until a pre-defined pattern occurs, and then create new node to hold the scanned content. Because the leaf nodes and internal nodes have different degrees of randomness, we define different patterns for them.

Leaf node splitting is done using a rolling hash function. Whenever the q least significant bits in the rolling hash are all zero a pattern match is said to occur. If a pattern match occurs in the middle of an element (e.g., a key-value pair in a Map) then the chunk boundary is extended to cover the whole element. Every leaf node except for the last node therefore ends with a pattern.

Index splitting uses a simpler strategy, looking for a cid pattern where cid && (2^r -1) == 0. The expected chunk size can be configured by choosing appropriate values for q and r. To ensure chunks cannot grow arbitrarily large, a chunk can be forced at some threshold value. POS-tree is not designed for cases the the object content is simply a sequence of repeated items – without the pattern all nodes gravitate to the maximum chunk size and the boundary shift problem returns.

ForkBase in action – Hyperledger

Section 5 of the paper looks at the construction of a blockchain platform, wiki engine, and collaborative analytics application on top of ForkBase. I’m just going to concentrate here on the blockchain use case, in which the authors port Hyperledger v0.6 to run on top of ForkBase.

It takes 18 new lines of code to move Hyperledger on top of ForkBase, and the elimination of 1918 lines of code from the Hyperledger code base. (ForkBase itself is about 30K lines of code mind you!).

Another benefit is that the data is now readily usable for analytics. For state scan query, we simply follow the version number stored in the latest block to get the latest Blob object for the requested key. From there, we follow base-version to retrieve the previous values. For block scan query, we follow the version number stored on the requested block to retrieve the second-level Map object for this block. We then iterate through the key-value tuples and retrieve the corresponding Blob objects.

Both state scans (returning the history of a given state) and block scans (returning the values of the states at a specific block) are slower in the original Hyperledger codebase, which is designed for fast access to the latest states. (Note: this seems to be referring to the peer transaction manager, or PTM, component of HyperLedger. Hyperledger also includes a block store which is indexed).

It’s in these scan operations that ForkBase shows the biggest performance benefits. If we look at latency and throughput, the ForkBase and Rocksdb based Hyperledger implementations are pretty close. (ForkBase-KV in the figure below is Hyperledger using ForkBase as a pure KV store, not taking advantage of any of the advanced features).


zkLedger: privacy-preserving auditing for distributed ledgers

May 31, 2018

zkLedger: privacy-preserving auditing for distributed ledgers Narula et al., NSDI’18

Somewhat similarly to Solidus that we looked at late last year, zkLedger (presumably this stands for zero-knowledge Ledger) provides transaction privacy for participants in a permissioned blockchain setting. zkLedger also has an extra trick up its sleeve: it provides rich and fully privacy-preserving auditing capabilities. Thus a number of financial institutions can collectively use a blockchain-based settlement ledger, and an auditor can measure properties such as financial leverage, asset illiquidity, counter-party risk exposures, and market concentration, either for the system as a whole, or for individual participants. It provides a cryptographically verified level of transparency that’s a step beyond anything we have today.

The goals of zkLedger are to hide the amounts, participants, and links between transactions while maintaining a verifiable transaction ledger, and for the Auditor to receive reliable answers to its queries. Specifically, zkLedger lets banks issue hidden transfer transactions which are still publicly verifiable by all other participants; every participant can confirm a transaction conserves assets and assets are only transferred with the spending bank’s authority.

Setting the stage

A zkLedger system comprises n banks and an auditor that verifies certain operational aspects of transactions performance by the basks. A depositor or set of depositors can also issue and withdraw assets from the system. Issuance and withdrawal of assets are global public events.

The main action takes place when banks exchange assets by creating transfer transactions. A transfer moves v shares of some asset t to a given recipient bank (or banks). Agreements to transfer are arranged outside of the system, and settled on zkLedger. All transactions are submitted to a globally-ordered append-only ledger, which could be a blockchain.

Cryptographic building blocks

To protect their privacy, banks do not broadcast payment details in the clear. Instead, banks post commitments to the ledger, using Pedersen commitments. Pedersen commitments are perfectly hiding and computationally binding, they are also additively homomorphic, a fact which zkLedger makes extensive use of. (By additively homomorphic we mean that given commitments to values v1 and v2, there is an operation we can perform on those commitments to produce a commitment to the value v1 + v2. )

Every bank has a Schnorr signature keypair and distributes their public key to all other system participants.

Assertions about payment details are made based on non-interactive zero-knowledge proofs (NIZKs). In an NIZK scheme a prover can convince a verifier of some property about private data the prover holds, without revealing the private data itself. The binary string proof \pi can be appended to the ledger and verified by any party of the system without interaction between the prover and the verifier.

In theory, NIZK proof systems exist for all properties in NP whereas the practical feasibility of NIZKs is highly dependent on the complexity of the property at hand… The design of zkLedger is carefully structured so that all NIZK proofs have particularly efficient constructions.

The zkLedger

At a high level zkLedger looks like this:

Banks maintain their own private state, and for efficiency a commitment cache which holds a rolling product of commitments by row and by asset so that it can quickly produce proofs and answer questions from auditors. The ledger itself has own entry (row) per transaction, and every row contains one column for each participating bank. (Banks can be added or removed by appending special signed transactions to the ledger).

Suppose bank A wants to transfer 100 shares of an asset to bank B. The transaction row conceptually contains a -100 entry in A’s column, 100 in B’s column, and zero in every other column. The values are not posted in the clear though, instead the column entries are Pedersen commitments for the respective amounts. Since there is no way for an outsider to tell the difference between a commitment to zero and any other value, both the transaction amounts and participants are protected.

Keeping values and participants private is a good start, but we also need to maintain overall integrity via the following invariants:

  • Transfer transactions cannot create or destroy assets
  • The spending bank must give consent to the transfer, and must own enough of the particular asset to execute the transaction

The first invariant is upheld via a proof of balance, and the second invariant is upheld using a proof of assets.

For proof of balance it suffices to show that the values in a given row sum to zero. If the the prover chooses the random inputs r to the commitments such that all of the r values also sum to zero, then a verifier can confirm that the the committed values all sum to zero by showing that the product of the commitments is 1.

A common approach to showing proof-of-assets is to use Unspent Transaction Objects (UTXOs). In a system that doesn’t use zk-SNARKs though, this leaks the transaction graph. zk-SNARKs require a trusted third party for setup, which zkLedger wants to avoid: “the consequences of incorrect or compromised setup are potentially disastrous…

In zkLedger, a bank proves it has assets by creating a commitment to the sum of the value for the asset in its column, including this transaction. If the sum is greater than or equal to 0, then the bank has the assets to transfer. Note that this is true since the bank’s column represents all the assets it has received or spent, and the Pedersen commitments can be homomorphically added in columns as well as in rows.

In addition, in its own entry (where the value is negative), a bank includes proof of the knowledge of its secret key as a proof of authorisation. Thus we have a disjunctive proof – either the committed value for an entry is greater than or equal to zero, or the creator of the transaction knows the secret key for the entry.

There’s one more issue we still need to consider: commitments rely on modulus. If we’re using modulus N, we need to make sure that committed values are within 0..N-1. Range proofs are used to show that values are within range, and zkLedger supports asset value amounts up to a trillion. Now the only thing I can really tell you about range proofs is that they’re the most expensive part of generating the transaction and if we’re not careful we need two of them: one for the commitment value and one for the sum of assets in the column. With a level of indirection zkLedger manages to get this back down to just one range proof per transaction.


The auditor can ask a query of a Bank, such as “How many Euros did you hold at time t?,” and the bank responds with an answer and a proof that the answer is consistent with the transactions on the ledger. The auditor can multiply commitments in the bank’s column for Euros, and verify the proof and answer with the total. Given the table construction, the auditor knows that they are seeing every asset transfer – there is no way for a bank to ‘hide’ assets on the ledger.

Given that every bank is affected by every transaction (because each row contains a commitment for every bank, even if to the value zero), each bank needs to be able to total and prove all of the commitments in its column. To do this, the bank needs to know the random input used for each of those commitments, otherwise it won’t be able to open up commitments to the auditor. To meet this requirement, the spending bank is required to include a publicly verifiable Token in every entry, which is based on a combination of the bank’s public key and the random input. The token construction enables the bank to show that the asset total is correct without actually needing to know the random input (details in §4.2 of the paper). Alongside the token, we also need a proof of consistency that the same random input was used both in construction of the token and in forming the value commitment.

Through the use of sums, means, ratios, variance, co-variance, and standard deviation, an auditor in zkLedger can determine the following, among other measurements: leverage ratios (how much of an asset a bank has on its books compared to other holdings); concentration (using a measure called the Herfindahl-Hirschman Index – HHI – to measure how competitive an industry is); real-timet price indexes.

Sums are supported via the additive structure of Pedersen commitments. For everything else there is map/reduce. Take as an example an auditor that wants to calculate the mean transaction size for a given bank and asset. A commitment to the total value is obtained by summing the column, but we don’t know what the denominator should be, because we don’t know which entries are actually commitments to zero. Map/reduce solves this: in the map step the bank produces new commitments per row indicating whether or not the bank was involved in the transaction (1 if the bank is involved, zero otherwise). In the reduce step these commitments are summed and the result is sent to the auditor along with the corresponding proofs. More complex queries may require multiple map and reduce computations (see the example in §5 of the paper for computing variance).

Putting it all together

For a transfer transaction, each entry in the row contains:

  • A Pedersen commitment to the value being transferred
  • An audit token so that audit requests can be answered without knowing the random input to the commitment
  • A proof -of-balance
  • A proof-of-assets
  • A proof-of-consistency between tokens and commitments

Banks can also add additional metadata – either encrypted or in plaintext.

Performance evaluation

A Go prototype of zkLedger shows that it is possible to create the needed proofs in milliseconds.

However, the cost of verifying transactions increases quadratically with the number of banks, and all transactions must be strictly serialised. Banks can verify transaction in parallel, so the time to process transactions increases linearly.


With 10 banks, we’re already down to around 2 transactions per second. “We are optimistic that a faster range proof implementation will directly improve performance.” Realistically though, it looks like with the current state-of-the-art we’re limited to fairly low volume markets with limited numbers of participants.

Using the commitment cache (online auditor below), auditing time is roughly constant. Without it (offline auditor) audit time is linear in the number of transactions in the ledger.


zkLedger is the first distributed ledger system to provide strong transaction privacy, public verifiability, and complete, provably correct auditing. zkLedger supports a rich set of auditing queries which are useful to measure the financial health of a market.

Towards a design philosophy for interoperable blockchain systems

May 30, 2018

Towards a design philosophy for interoperable blockchain systems Hardjono et al., arXiv 2018

Once upon a time there were networks and inter-networking, which let carefully managed groups of computers talk to each other. Then with a capital “I” came the Internet, with design principles that ultimately enabled devices all over the world to interoperate. Like many other people, I have often thought about the parallels between networks and blockchains, between the Internet, and something we might call ‘the Blockchain’ (capital ‘B’). In today’s paper choice, Hardjono et al. explore this relationship, seeing what we can learn from the design principles of the Internet, and what it might take to create an interoperable blockchain infrastructure. Some of these lessons are embodied in the MIT Tradecoin project.

We argue that if blockchain technology seeks to be a fundamental component of the future global distributed network of commerce and value, then its architecture must also satisfy the same fundamental goals of the Internet architecture.

The design philosophy of the Internet

This section of the paper is a précis of ‘The design philosophy of the DARPA Internet protocols’ from SIGCOMM 1988. The top three fundamental goals for the Internet as conceived by DARPA at that time were:

  1. Survivability: Internet communications must continue even if individual networks or gateways were lost
  2. The ability to support multiple types of communication service (with differing speed, latency, and reliability requirements).
  3. The ability to accommodate and incorporate a variety of networks

In addition, the end-to-end principle was central in deciding where responsibility for functionality should lie: in the network versus in the applications at the network endpoints. A classic example is end-to-end encryption, which needs to be between the communicating parties and therefore places responsibility for this with the endpoints.

The Internet is structured as a collection of autonomous systems (routing domains), stitched together through peering agreements. Autonomous Systems (ASs) are owned and operated by legal entities. All routers and related devices are uniquely identified within a domain. Interaction across domains is via gateways (using e.g. BGP).

A design philosophy for the Blockchain

We believe the issue of survivability to be as important as that of privacy and security. As such, we believe that interoperability across blockchain systems will be a core requirement — both at the mechanical level and the value level — if blockchain systems and technologies are to become fundamental infrastructure components of future global commerce.

An interoperable blockchain architecture as defined by the authors has the following characteristics:

  • It is composed of distinguishable blockchain systems, each representing a distributed data ledger
  • Transaction execution may span multiple blockchain systems
  • Data recorded in one blockchain is reachable and verifiable by another possible foreign transaction in a semantically compatible manner

Survivability is defined in terms of application level transactions: it should still be possible to complete a transaction even when parts of The Blockchain are damaged.

The application level transaction may be composed of multiple ledger-level transactions (sub-transaction) and which may be intended for multiple distinct blockchain systems (e.g. sub-transaction for asset transfer, simultaneously with sub-transaction for payments and sub-transaction for taxes).

(Are we reinventing XA all over again?)

Sub-transactions confirmed on a spread of blockchain systems are opaque to the user application, in the same way that packets routing through multiple domains is opaque to a communications application.

The notions of survivability and blockchain substitution in the event of failure raise a number of questions such as the degree to which an application needs to be aware of individual blockchain systems’ capabilities and constructs, and where responsibility for reliability (e.g. re-transmitting a transaction) should lie. What should we do about resident smart contracts that exist on a (possibly unreachable) blockchain system, and hence may not be invokable or able to complete? Can smart contracts be moved across chains? Should the current chain on which a contract resides be opaque to applications (i.e., give it an “IP” address which works across the whole Blockchain)? How do we know when to trigger the moving of a contract from one chain to another?

The Internet goal of supporting multiple types of service with differing requirements is reinterpreted as need to support multiple types of chain with differing consensus, throughput, and latency characteristics. (And we might also add security and privacy to that list).

When it comes to accommodating multiple different blockchain systems, we want to be able to support transactions spanning blockchains operated (or owned) by different entities. In the Internet, the minimum assumption is that each network must be able to transport a datagram or packet as the lowest unit common denominator. What is the corresponding minimum assumption for blockchains? How can data be referenced across chains? What combinations of anonymity (for users and for nodes) can be supported?

The notion of value is at a layer above blockchain transactions (just as the Internet separates the mechanical transmission of packets from the value of the information contained in those packets). For families of applications that need to transfer value across chains, the Inter-Ledger Protocol offers a promising direction.


The MIT Tradecoin project has a number of objectives, one core goal being the development of a “blueprint” model for interoperable blockchain systems which can be applied to multiple use cases.

Ultimately there are two different levels of interoperability: mechanical level interoperability, and value level interoperability (encompassing constructs that accord value as perceived in the human world). “Humans, societies, real assets, fiat currencies, liquidity, legal regimes and regulations all contribute to form the notion of value as attached to (bound to) the constructs (e.g., coins, tokens) that circulate in the blockchain system….” The two level view follows the end-to-end principle by placing the human semantics (value) at the ends of (outside) the mechanical systems.

Legal trust is the contract that binds the technical roots of trust at the mechanical level with legally enforceable obligations and warranties.

Legal trust is the bridge between the mechanical level and the value level. That is, technical-trust and legal-trust support business trust (at the value level) by supporting real-world participants in quantifying and managing risks associated with transactions occurring at the mechanical level. Standardization of technologies that implement technical trust promotes the standardization of legal contracts — also known as legal trust frameworks — which in turn reduces the overall business cost of operating autonomous systems.

(And not only that, it provides the trust required for businesses to trade value on blockchains).

Tradecoin views individual blockchain systems as fully autonomous, and connects them via gateways. Gateways provide value stability, reachability, and transaction mediation for cross-domain transactions.

To support reachability, gateways resolve identifiers and may provide a NAT-like function to translate between internal and external identifiers. When it comes to transaction mediation, the Tradecoin view seems to be that gateways will act as transaction coordinators, with individual blockchain systems acting as resource managers.

Since blockchains BC1 and BC2 are permissioned and one side cannot see the ledger at the other side, the gateways of each blockchain must “vouch” that the transaction has been confirmed on the respective ledgers. That is, the gateways must issue legally-binding signed assertions that make them liable for misreporting (intentionally or otherwise). The signature can be issued by one gateway only, or it can be a collective group signature of all gateways in the blockchain system.

For all this to work smoothly, there are five ‘desirable features’:

  1. Both the transaction initiating and recipient applications must be able to independently verify that the transaction was confirmed on their respective blockchains.
  2. Gateway signatures must be binding, regardless of the gateway selection mechanism used.
  3. There should be multiple reliable ‘paths’ (sets of gateways) between any two blockchains.
  4. There must be a global resolution mechanism for identifiers such that they can always be resolved to the correct authoritative blockchain system.
  5. Gateways must all be identifiable (i.e., not anonymous), both within and across domains. “Gateways must be able to mutually authenticate each other without any ambiguity as to their identity, legal ownership, or the ‘home’ blockchain autonomous system which they exclusively represent.

Gateways are connected together via the equivalent of peering agreements:

For the interoperability of blockchain systems, a notion similar to peering and peering-agreements must be developed that (i) defines the semantic compatibility required for two blockchains to exchange cross-domain transactions; (ii) specifies the cross-domain protocols required; (iii) specifies the delegation and technical-trust mechanisms to be used; and (iv) defines the legal agreements (e.g. service levels, fees, penalties, liabilities, warranties) for peering. It is important to note that in the Tradecoin interoperability model, the gateways of a blockchain system represent the peering-points of the blockchain.

Requirement (iv) above seems problematic in cases where there is no well-defined legal entity associated with a blockchain.

Interoperability forces a deeper re-thinking into how permissioned and permissionless blockchain systems can interoperate without a third party (such as an exchange).