tags:

When coding style survives compilation: de-anonymizing programmers from executable binaries Caliskan et al., NDSS’18

As a programmer you have a unique style, and stylometry techniques can be used to fingerprint your style and determine with high probability whether or not a piece of code was written by you. That makes a degree of intuitive sense when considering source code. But suppose we don’t have source code? Suppose all we have is an executable binary? Caliskan et al., show us that it’s possible to de-anonymise programmers even under these conditions. Amazingly, their technique still works even when debugging symbols are removed, aggressive compiler optimisations are enabled, and traditional binary obfuscation techniques are applied! Anonymous authorship of binaries is consequently hard to achieve.

One of the findings along the way that I found particularly interesting is that more skilled/experienced programmers are more fingerprintable. It makes sense that over time programmers acquire their own unique way of doing things, yet at the same time these results seem to suggest that experienced programmers do not converge on a strong set of stylistic conventions. That suggests to me a strong creative element in program authorship, just as experienced authors of written works develop their own unique writing styles.

If we encounter an executable binary sample in the wild, what can we learn from it? In this work, we show that the programmer’s stylistic fingerprint, or coding style, is preserved in the compilation process and can be extracted from the executable binary. This means that it may be possible to infer the programmer’s identity if we have a set of known potential candidate programmers, along with executable binary samples (or source code) known to be authored by these candidates.

Out of a pool of 100 candidate programmers, Caliskan et al. are able to attributed authorship with accuracy of up to 96%, and with a pool of 600 candidate programmers, they reach accuracy of 83%. These results assume that the compiler and optimisation level used for compilation of the binary are known. Fortunately, previous work has shown that toolchain provenance, including the compiler family, version, optimisation level, and source language, can be identified using a linear Conditional Random Field (CRF) with accuracy of up to 99% for language, compiler family, and optimisation level, and 92% for compiler version.

One of the potential uses for the technology is identifying authors of malware.

### Finding fingerprint features in executable binaries

So how is this seemingly impossible feat pulled off? The process for training the classifier given a corpus of works by authors in a candidate pool has four main steps, as illustrated below:

1. Disassembly: first the program is disassembled to obtain features based on machine code instructions, referenced strings, symbol information, and control flow graphs.
2. Decompilation: the program is translated into C-like pseudo-code via decompilation, and this pseudo-code is passed to a fuzzy C parser to generate an AST. Syntactical features and n-grams are extracted from the AST.
3. Dimensionality reduction: standard feature selection techniques are used to select the candidate features from amongst those produced in steps 1 and 2.
4. Classification: a random forest classifier is trained on the corresponding feature vectors to yield a program that can be used for automatic executable binary authorship attribution.

#### Disassembly

The disassembly step runs the binary through two different disassemblers: the netwide disassembler (ndisasm), which does simple instruction decoding, and the radare2 state-of-the-art open source disassembler, which also understands the executable binary format. Using radare2 it is possible to extract symbols, strings, functions, and control flow graphs.

Information provided by the two disassemblers is combined to obtain our disassembly feature set as follows: we tokenize the instruction traces of both disassemblers and extract token uni-grams, bi-grams, and tri-grams within a single line of assembly, and 6-grams, which span two consecutive lines of assembly… In addition, we extract single basic blocks of radare2’s control flow graphs, as well as pairs of basic blocks connected by control flow.

#### Decompilation

Decompilation is done using the Hex-Rays commercial state-of-the-art decompiler, which produces human readable C-like pseudo-code. This code may be much longer than the original source code (e.g. decompiling a program that was originally 70 lines long may produce on average 900 lines of decompiled code).

From the decompiled result, both lexical and syntactical features are extracted. Lexical features are word unigrams capturing integer types, library function names, and internal function names (when symbol table information is available). Syntactical features are obtained by passing the code to the joern fuzzy parser and deriving features from the resulting AST.

#### Dimensionality reduction

Following steps one and two, a large number of features can be generated (e.g., 705,000 features from 900 executable binary samples taken across 100 different programmers). A first level of dimensionality reduction is applied using WEKA’s information gain attribute selection criteria, and then a second level of reduction is applied using correlation based feature selection. The end result for the 900 binary samples is a set of 53 predictive features.

#### Classification

Classification is done using random forests with 500 trees. Data is stratified by author analysed using k-fold cross-validation, where k is equal to the number of available code samples per author.

### Evaluation results

The main evaluation is performed using submission to the annual Google Code Jam competition, in which thousands of programmers take part each year. “We focus our analysis on compiled C++ code, the most popular programming language used in the competition. We collect the solutions from the years 2008 to 2014 along with author names and problem identifiers.”

Datasets are created using gcc and g++, using each of O1, O2, and O3 optimisation flags (so six datasets in all). The resulting datasets contain 900 executable binary samples from 100 different authors. As we saw before, the authors are able to reduce the feature set down to 53 predictive features.

To examine the potential for overfitting, we consider the ability of this feature set to generalize to a different set of programmers, and show that it does so, further supporting our belief that these features effectively capture programming style. Features that are highly predictive of authorial fingerprints include file and stream operations along with the formats and initializations of variables from the domain of ASTs, whereas arithmetic, logic, and stack operations are the most distinguishing ones among the assembly instructions.

Without optimisation enabled, the random forest is able to correctly classify 900 test instances with 95% accuracy. Furthermore, given just a single sample of code (for training) from a given author, the author can be identified out of a pool of 100 candidates with 65% accuracy.

The classifier also reaches a point of dramatically diminishing returns with as few as three training samples, and obtains a stable accuracy by training on 6 samples. Given the complexity of the task, this combination of high accuracy with extremely low requirement on training data is remarkable, and suggests the robustness of our features and method.

The technique continues to work well as the candidate pool size grows:

### Turning up the difficulty level

Programming style is preserved to a great extent even under the most aggressive level 3 optimisations:

…programmers of optimized executable binaries can be de-anonymized, and optimization is not a highly effective code anonymization method.

Fully stripping symbol information reduces classification accuracy by 24%, so even removing symbols is not an effective form of anonymisation.

For the  pièce de résistance the authors use Obfuscator-LLVM and apply all three of its obfuscation techniques (instruction substitution, introducing bogus control flow, flattening control flow graphs). And the result? “Using the same features as before, we obtain an accuracy of 88% in correctly classifying authors.

… while we show that our method is capable of dealing with simple binary obfuscation techniques, we do not consider binaries that are heavily obfuscated to hinder reverse engineering.

### So you want to stay anonymous?

If you really do want to remain anonymous, you’d better plan for that from the very beginning of your programming career, and even then it doesn’t look easy! Here are the conditions recommended by the authors:

• Do not have any public repositories
• Don’t release multiple programs using the same online identity
• Try to have a different coding style (!) in each piece of software you write, and try to code in different programming languages.
• Use different optimisations and obfuscations to avoid deterministic patterns

Another suggestion that comes to mind is to use an obfuscater deliberately designed to prevent reverse engineering. Although since these weren’t tested, we don’t actually know how effective that will be.

A programmer who accomplishes randomness across all potential identifying factors would be very difficult to deanonymize. Nevertheless, even the most privacy savvy developer might be willing to contribute to open source software or build a reputation for her identity based on her set of products, which would be a challenge for maintaining anonymity.

tags: ,

### What could possibly go wrong?

We’ve studied some of the issues involved in writing smart contracts before. The authors of Zeus also provide a short summary of some of the ways that smart contracts can be incorrect or unfair. Implementation challenges leading to incorrect behaviour include:

• Reentrancy: multiple parallel external invocations are possible using the call family of constructs. If global state is not correctly managed, a contract can be vulnerable to reentrancy attacks. The related cross-function race condition can occur when two different functions operate on the same global state.

• Unchecked sends: upon a send call, a computation-heavy fallback function at the receiving contract can exhaust the available gas and cause the invoking send to fail. If this is not correctly handled it can lead to loss of Ether.

• Failed sends: best practices suggest executing a throw upon a failed send in order to revert the transaction. But this practice also has its risks. For example, in the following extract the throw causes loss of money to the DAO:

• Integer over or underflow: there are over 20 different scenarios that require careful handling of integer operations to avoid overflow or underflow. For example:

• Transaction state dependence: contract writers can utilise transaction state variables such as tx.origin for managing control flow within a contract. A crafty attacker can manipulate tx.origin (for example, using social engineering or phishing techniques) to their advantage.

Contracts may also end up being unfair due to logical design errors:

• Absence of required logic: for example, not guarding a call to selfdestruct with a check that only the owner of a contract is allowed to kill it.
• Incorrect logic: “there are many syntactically legal ways to achieve semantically unfair behaviour.” For example, the HackersGold contract had a bug where the transferFrom function used =+ in place of += and thus failed to increment a balance transferred to the recipient. (15 unique contracts in the data set had copies of the function with the same bug, and held over $35,000 worth of Ether between them). • Logically correct but unfair designs. For example, an auction house contract that does not declare whether it is ‘with reserve,’ meaning that sellers can also bid or withdraw an item before it is sold. Unsuspecting bidders (with no expertise in examining the source code) could lose money due to artificially increased bids or forfeit their participation fee. (My personal view: when the code is the contract, and there is no other recourse, you’d really better be able to read the code if you want to participate). “This contract… indicates the subtleties involved in multi-party interactions, where fairness is subjective.” Finally, miner’s also have some control over the environment in which contracts execute. Smart contract designers need to be aware of: • Block state dependencies: many block state variables are determined from the block header, and thus vulnerable to tampering from the block miner. • Transaction order dependencies: miners can reorder transactions and hence potentially influence their outcome. ### A high-level description of Zeus Of Zeus itself, we get only a relatively high-level description. …Zeus takes as input a smart contract and a policy (written in a specification language) against which the smart contract must be verified. It performs static analysis atop the smart contract code and inserts policy predicates as assert statements at correct program points. Zeus then leverages its source code translator to faithfully convert the smart contract embedded with policy assertions to LLVM bitcode. Finally, Zeus invokes its verifier to determine assertion violations, which are indicative of policy violations. Smart contracts are modelled using an abstract language that looks like this: Programs are sequences of contract declarations, and each contract is viewed as a sequence of one or more method definitions in addition to the declaration and initialisation of persistent storage private to the contract. Contracts are identified by an Id, and the invocation of a publicly visible method on a contract is viewed as a transaction. Execution semantics are succinctly given in the following table: (Enlarge). A source code translator translates Solidity code into the abstract contract language. The policy language enables a user to specify fairness rules for a contract. These rules are specified as XACML-styled five tuples with Subject, Object, Operation, Condition, and Result. Our abstract language includes assertions for defining state reachability properties on the smart contract. Zeus leverages the policy tuple to extract: (a) predicate (i.e., Condition) to be asserted, and (b) the correct control location for inserting the assert statements in the program source. Given the policy-enhanced program, Zeus translates it into LLVM bitcode. Here’s a simple end-to-end example showing the various components: Finally, Zeus feeds the LLVM bitcode representation to an existing verification engine (Seahorn) that leverages constrained horn clauses (CHCs) to quickly ascertain the safety of the smart contract. Zeus is not tied to Seahorn though, in theory it could also be used with other verifiers operating on LLVM bitcode, such as SMACK or DIVINE. ### Evaluating Solidity-based contracts We periodically scraped Etherscan, Etherchain, and Ethercamp explorers over a period of three months and obtained source code for 22,493 contracts at distinct addresses. We discounted 45 contracts that had an assembly block in their source, and obtained 1524 unique contracts (as par their sha256 checksum). In the remainder of this section, we present results only for these unique contracts, unless otherwise specified. The collected contracts divide up as follows: Zeus checks for all of the correctness and miner’s influence issues are run across all of these contracts. Results are manually validated to determine the set of false positive and negatives. Zeus is also compared against the results from Oyente (‘Making smart contracts smarter’). Here’s what Zeus finds: (Enlarge). • 21,281 out of 22,493 contracts (94.6%), containing more than$ 0.5 billion worth of Ether are vulnerable to one or more bugs. Across the unique contracts, 1194 out of 1524 contracts were found to be vulnerable to one or more bugs.
• There were zero false negatives across all seven of the bug classes.
• Verification is fast, with only 44 out of 1524 contracts (2.89%) timing out in at least one bug class. The timeout threshold was set at one minute.

To evaluate fairness the authors had to write some policies. As as example, for the CrowdFundDao the authors implemented policies that (a) blacklisted developers cannot participate in the scheme, and (b) an investment must be more than a threshold limit. Zeus is able to determine that neither of these checks are encoded in the contract. As a general purpose policy, the team wrote a policy that selfdestruct should only be called by the contract owner. 284 of the 1524 contracts included a selfdestruct, with 5.6% of them violating the policy.