Skip to content

Training classifiers with natural language explanations

August 24, 2018

Training classifiers with natural language explanations Hancock et al., ACL’18

We looked at Snorkel earlier this week, which demonstrates that maybe AI isn’t going to take over all of our programming jobs. Instead, we’ll be writing labelling functions to feed the machine! Perhaps we could call this task label engineering. To me, it feels a bit like programming a quick-and-dirty expert system, where the downstream generative model deals with all the inaccuracies and inconsistencies so that we don’t have to be perfect, just useful. Given the success of the approach, a natural question to ask is how we can enable end users to more easily create useful labelling functions in their domain. This is where BabbleLabble comes in!

In this work, we propose BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision. A semantic parser converts these explanations into programmatic labeling functions that generate noisy labels for an arbitrary amount of unlabeled data, which is used to train a classifier.

So much for those programming jobs ! 😉

Working with BabbleLabble, it takes users about twice as long per example to provide a label plus an explanation as it does just to label an example. (The user has to do the work of figuring out the rationale in both cases). By automatically distilling a set of labelling functions from the explanations though (we’ll get to how that bit works shortly), the end result is that on three tasks users were able to train classifiers with comparable F1 scores from 5-100x faster.

If you combine BabbleLabble with Snorkel and the advances in machine learning models, you can start to see a new end-to-end approach for the rapid creation of intelligent systems. It starts with domain knowledge capture systems, of which simply labelling an example by hand is the simplest example. Snorkel and BabbleLabble show us that we can do so much more here to capture and exploit available domain knowledge. Domain knowledge capture systems assist in the task of label engineering, and their output is a set of labelling functions. A probabilistic label generator intelligently combines the labelling functions to create a set of probabilistic labels (data programming), and finally we bolt on the end the supervised machine learning model of our choice and train it using the generated labels.

What ties all this together is that via the wonders of data programming, the labelling functions don’t have to be perfect. It’s a much lower bar to write/generate a useful labelling function than it is to create one with very high accuracy and coverage.

Don’t lose the rationale

When we ask a human annotator to provide a label for an example, we up with limited information (one bit, in the case of binary classification).

This invites the question: how can we get more information per example, given that the annotator has already spent the effort reading and understanding an example? … In this work, we tap into the power of natural language and allow annotators to provide supervision to a classifier via natural language explanations.

For example:

The captured explanations are turned into candidate labeling functions, which go through a filtering process to produce a final set of labeling functions used to drive Snorkel.


(Enlarge)

What’s very neat is that the semantic parser used to turn explanations into candidate labeling functions is generic – i.e., it the same parser can be used across different tasks without any task-specific training.

One of our major findings is that in our setting, even a simple rule-based parser suffices…

This works because the filtering stage removes inconsistent and unhelpful (e.g., assigning the same label to every example) labelling functions, and Snorkel’s data programming copes with inconsistencies and inaccuracies in those that remain. Another very interesting finding is that labelling functions that are near to the perfect logical capture of the explanation often perform better than a perfect capture. In the evaluation, this is explained thus:

In other words, when users provide explanations, the signals they describe provide good starting points, but they are actually unlikely to be optimal.

Generating labeling functions

The first step in the process is the use of a semantic parser to parse explanations and generate candidate labelling functions. Explanations follow a loose grammar (see example predicates in appendix A of the paper), and the parser iterates over input subspans looking for rule matches:

We emphasize that the goal of this semantic parser is not to generate the single correct parse, but rather to have coverage over many potentially useful LFs.

Users of the system are provided with minimal examples of all of the predicates in the grammar in use. The construction of the grammar was guided by analysing 500 free-form user-provided explanations on a spouse task. The resulting predicates are summarised in the table below. There are 200 rule templates in total.

Distilling labeling functions

The set of candidate labelling functions (LFs) produced by parsing an explanation are then filtered. A semantic filter checks that a candidate LF is actually consistent with its own example.

For example, given an explanation “True, because the words ‘his wife’ are right before person 2”, there are two possible interpretations of ‘right’. It could be ‘right before’ or ‘to the right of’. A candidate LF generated from the latter interpretation can be excluded when the example in question is “Tom Brady and his wife Gisele Bundchen were spotted in New York City on Monday amid rumors…”. (Because the words ‘his wife’ are actually to the left of person 2).

A set of pragmatic filters remove LFs that are constant (label every example the same), redundant (labels the training set identically to some other LF), or correlated.

Finally, out of all LFs from the same explanation that pass all the other filters, we keep only the most specific (lowest coverage) LF. This prevents multiple correlated LFs from a single example dominating.

Over the three tasks in the evaluation, the filter bank removes over 95% of incorrect parses, and the incorrect ones that remain have average end-task accuracy within 2.5 points of the corresponding correct parses.

Experimental results

We already know from Snorkel what happens downstream of the labelling function selection process, so let’s jump straight to the evaluation section of the paper. BabbleLabble is used on three tasks, Spouse, Disease, and Protein.

In the spouse task annotators see a sentence with two highlighted names and are asked to label whether the two people are spouses (sentences taken from the Signal Media news articles dataset).

In the disease task annotators are shown a sentence with highlighted names of a chemical and a disease, and asked whether the sentences suggests the chemical causes the disease. Because the necessary domain expertise was not readily available, explanation sentences in this case were created by having someone unfamiliar with BabbleLabble translate from the labelling functions generated as part of the Snorkel publication evaluation.

The protein task was completed in conjunction with a neuroscience company called OccamzRazor. Annotators were shown a sentence from biomedical literature with highlighted names of a protein and kinase, and asked to label whether or not the kinase influences the protein in terms of a physical interaction or phosphorylation.

Example sentences and explanations from each of the three tasks are shown in the figure below.


(Enlarge)

On an F1 score basis, the a classifier trained with BabbleLabble achieves the same accuracy as a classifier trained with just end labels, using between 5x to 100x fewer examples. On the spouse task, 30 explanations are worth somewhere around 5,000 labels, on the disease task 30 explanations are worth around 1,500 labels, and on the protein task 30 explanations are worth around 175 labels.

Once the number of labeled examples is sufficiently large, traditional supervision once again dominates, since ground truth labels are preferable to noisy ones generated by labeling functions. However, in domains where there is much more unlabeled data available than labeled data (which in our experience is most domains), we can gain in supervision efficiency from using BabbleLabble.

The team compared the accuracy of BabbleLabble without the LF filtering process (i.e., using all candidate LFs), BabbleLabble a as is, and BabbleLabble using ‘perfect’ parsers generated by hand to accurately translate the explanation sentences. The results show that the simple rule-based semantic parser and the perfect parser have nearly identical average F1 scores.

Incorrect LFs still often provide useful signal.


(Enlarge)

Learn more

Snorkel: rapid training data creation with weak supervision

August 22, 2018

Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18

Earlier this week we looked at Sparser, which comes from the Stanford Dawn project, “a five-year research project to democratize AI by making it dramatically easier to build AI-powered applications.” Today’s paper choice, Snorkel, is from the same stable. It tackles one of central questions in supervised machine learning: how do you get a large enough set of training data to power modern deep models?

…deep learning has a major upfront cost: these methods need massive training sets of labeled examples to learn from – often tens of thousands to millions to reach peak predictive performance. Such training sets are enormously expensive to create…

Snorkel lets you throw everything you’ve got at the problem. Heuristics, external knowledge bases, crowd-sourced workers, you name it. These are known as weak supervision sources because they may be limited in accuracy and coverage. All of these get combined in a principled manner to produce a set of probability-weighted labels. The authors call this process ‘data programming’. The end model is then trained on the generated labels.

Snorkel is the first system to implement our recent work on data programming… While programming weak supervision seems superficially similar to feature engineering, we observe that users approach the two processes very differently. Our vision – weak supervision as the sole port of interaction for machine learning – implies radically different workflows…

The big picture

There are three main stages in the Snorkel workflow:

  1. Instead of hand-labelling large quantities of training data, users write labelling functions which capture patterns and heuristics, connect with external knowledge bases (distant supervision), and so on. A labelling function is a Python method which given an input can either output a label or abstain. Snorkel also includes a number of declarative labelling functions that can be used out of the box.
  2. Snorkel learns a generative model over all of the labelling functions, so that it can estimated their accuracies and correlations. “This step uses no ground-truth data, learning instead from the agreements and disagreements of the labeling functions.”
  3. Snorkel outputs a set of probabilistic labels which can then be used to train a wide variety of machine learning models.


(Enlarge)

While the generative model is essentially a re-weighted combination of the user-provided labeling functions – which tend to be precise but low coverage – modern discriminative models can retain this precision while learning to generalize beyond the labelling functions, increasing coverage and robustness on unseen data.

Labelling functions

Say we’re interested in a binary classifier text-relation extraction task, in which a (chemical, disease) input tuple maps to true iff the chemical causes the disease. Snorkel breaks input documents (PubMed abstracts) down into a context hierarchy made up of context types. The set of context types that make sense will be data dependent. Here we might extract documents, sentences, spans, and entities. Tuples of relevant entities are then passed to labelling functions as candidates.

Writing in Python, a labelling function encoding the heuristic that the word ‘causes’ in-between a chemical and a disease indicates a causal relationship would look like this:

For simple cases, there are built-in declarative labelling functions. In this case, we could have used a pattern-based function instead of writing our own:

Labeling function generators create multiple labelling functions from a single resource. We could use the Comparative Toxicogenomics Database as a distant supervision source for example, and label candidates as ‘true’ if they appear in the “Causes” subset, and ‘false’ if they appear in the “Treats” subset.

One neat example in the evaluation is using a set of crowdworkers to crowdsource annotations, and then representing each crowdworker as a distinct labelling function. Snorkel will automatically learn to adapt to the different skill levels and accuracy of the workers.

The generative model

Once we have a collection of labelling functions, an obvious thing to do would be to ask each function to label a candidate and use majority voting to determine the resulting label. In fact, in situations where we don’t have many votes on an input (e.g., most of the labelling functions abstain), and in situations where we have lots of votes, then majority voting works really well. But in-between these two extremes, taking a weighted vote based on modelling labelling function accuracy works better.

Snorkel uses a heuristic based on the ratio of positive to negative labels for each data point to decide whether to use majority voting or to build a generative model of function accuracy in order to perform weighted voting.

Essentially, we are taking the expected counts of instances in which a weighted majority vote could possibly flip the incorrect predictions of unweighted majority vote under best case conditions, which is an upper bound for the expected advantage.

When a generative model is called for it is built as a factor graph, applying all labelling functions to the unlabelled data points and capturing the labelling propensity, accuracy, and pairwise correlations of the functions. The details of learning the model are given in an earlier paper, ‘Learning the structure of generative models without labeled data.’

Dealing with correlated labels

Often the provided labelling functions are not independent. For example functions could be simple variations of each other, or they could depend on a common source of distant supervision. If we don’t account for the dependencies between labelling functions, we can get into all sorts of trouble:

Getting users to somehow indicate dependencies by hand is difficult and error-prone.

We therefore turn to our method for automatically selecting which dependencies to model without access to ground truth (See ‘Learning the structure of generative models without labeled data.’ It uses a pseudo-likelihood estimator, which does not require any sampling or other approximations to compute the objective gradient exactly. It is much faster than maximum likelihood estimation, taking 15 seconds to select pairwise correlations to be modeled among 100 labeling functions with 10,000 data points.

The estimator does rely on a hyperparameter \epsilon though, which trades-off between predictive performance and computational cost. With large values of \epsilon no correlations are included and as we reduce the value progressively more correlations are added, starting with the strongest. The following plots show examples of the numbers of correlations added for different values of the correlation threshold \epsilon in three different tasks.

Generally, the number of correlations grows slowly at first, then hits an “elbow point” beyond which the number explodes… setting \epsilon to this elbow point is a safe tradeoff between predictive performance and computational cost.

Snorkel in action

Snorkel is evaluated across six different applications and in a user study to determine how quickly subject-matter experts could learn to write labelling functions.

In the user study, participants were given 4.5 hours of instruction on how to use and evaluate models developed using Snorkel, and then had 2.5 hours to write labelling functions, for a total time invested of 7 hours. (Workshop materials are available at https://github.com/HazyResearch/snorkel/tree/master/tutorials/workshop). Models trained with Snorkel using these labelling functions are compared to models based on seven hours of hand-labelling.

Our key finding is that labeling functions written in Snorkel, even by SME users, can match or exceed a traditional hand-labeling approach. The majority (8) of subjects matched or outperformed these hand-labeled data models. The average Snorkel user’s score was 30.4 F1, and the average hand-supervision score was 20.9 F1.

(It’s a slim majority though – there were 15 participants in the study!).

Outside of the user study, the six tasks used by the authors to evaluate Snorkel are summarised in the table below:

The findings show that Snorkel outperforms distant supervision baselines (by an average of 132%), and approaches hand supervision levels of accuracy with tens of labelling functions — even when the hand-labeled data has taken weeks or months to assemble. The authors also show that the discriminative models generalises beyond the heuristics encoded in the labeling functions.

Snorkel’s deployments in industry, research labs, and government agencies show that it has real-world impact, offering developers and improved way to build models.

You can read more about Snorkel, including links to the code, on the Snorkel project page.

Filter before you parse: faster analytics on raw data with Sparser

August 20, 2018
tags:

Filter before you parse: faster analytics on raw data with Sparser Palkar et al., VLDB’18

We’ve been parsing JSON for over 15 years. So it’s surprising and wonderful that with a fresh look at the problem the authors of this paper have been able to deliver an order-of-magnitude speed-up with Sparser in about 4Kloc.

The classic approach to JSON parsing is to use a state-machine based parsing algorithm. This is the approach used by e.g. RapidJSON. Such algorithms are sequential and can’t easily exploit the SIMD capabilities of modern CPUs. State of the art JSON parsers such as Mison are designed to match the capabilities of modern hardware. Mison uses SIMD instructions to find special characters such as brackets and colons and build a structural index over a raw json string.

… we found that Mison can parse highly nested in-memory data at over 2GMB/s per core, over 5x faster than RapidJSON, the fastest traditional state-machine based parser available.

How can we parse JSON even faster? The key lies in re-framing the question. The fastest way to parse a JSON file is not to parse it at all. Zero ms is a hard lower bound ;). In other words, if you can quickly determine that a JSON file (or Avro, or Parquet, …) can’t possible contain what you’re looking for then you can avoid parsing it in the first place. That’s similar to the way that we might use a bloom filter to rule out the presence of a certain key or value in a file (guaranteeing no false-negatives, though we might have false positives). Sparser is intended for use in situations where we are interacting directly with raw unstructured or semi-structured data though where a pre-computing index or similar data structure either isn’t available or is too expensive to compute given the anticipated access frequency.

In such a context we’re going to need a fast online test with no false negatives. Comparing state-of-the-art parsers to the raw hardware capabilities suggests there’s some headroom to work with:

Even with these new techniques, however, we still observe a large memory-compute performance gap: a single core can scan a raw bytestream of JSON data 10x faster than Mison parses it. Perhaps surprisingly, similar gaps can occur even when parsing binary formats that require byte-level processing, such as Avro and Parquet.

Imagine I have a big file containing tweet data. I want to find all tweets mentioning the hashtag ‘#themorningpaper’. Instead of feeding the file straight into a JSON parser, I could just do a simple grep first. If grep finds nothing, we don’t need to parse as there won’t be any matches. Sparser doesn’t work exactly like this, but it’s pretty close! In place of grep, it uses a collection of raw filters, designed with mechanical sympathy in mind to make them really efficient on modern hardware. A cost optimiser figures out the best performing combination of filters for a given query predicate and data set. When scanning through really big files, the cost optimiser is re-run whenever parsing throughput drops by some threshold (20% in the implementation).

Sparser can make a big difference to execution times. Across a variety of workloads in the evaluation, Sparser achieved up to a 22x speed-up compared to Mison. This is a big deal, because serialising (and de-serialising) is a significant contributor to overall execution times in big data analytic workloads. So much so, that when integrated into a Spark system, end-to-end application performance improved by up to 9x.

Efficient raw filters

Raw filters (RF) operate over raw bytestreams and can produce false positives but no false negatives. They are designed to be SIMD efficient. There are two raw filter types: substring search and key-value search.

Say we have a predicate like this: name = "Athena" AND text = "My submission to VLDB". Substring search just looks for records that contain a substring sequence from the target values. For efficiency reasons it considers 2, 4, and 8-byte wide strings. Sticking with 4-byte substrings, we have several we could potentially use for matching, e.g. ‘Athe’, ‘ubmi’ or ‘VLDB’. Using VLDB as an example, the string is repeated eight times in a 32-byte vector register. We need 4 one-byte shifts to cover all possible matching positions in an input sequence:

Note that the example in the figure above is actually a false positive for the input predicate. That’s ok. We don’t mind a few of those getting through.

The main advantage of the substring search RF is that, for a well-chosen sequence, the operator can reject inputs at nearly the speed of streaming data through a CPU core.

Key-value search looks for all co-occurrences of a key and a corresponding value within a record. The operator takes three parameters: a key, a value, and a set of one-byte delimiters (e.g. ,). After finding an occurrence of the key, the operator searches for the value and stops searching at the first occurring delimiter character. The key, value, and stopping point can all be searched for using the packed vector technique we looked at for substrings.

Whereas substring searches support both equality and LIKE, key-value filters do not support LIKE. This prevents false negatives from getting through.

Optimising filter cascades

Sticking with the example predicate name = "Athena" AND text = "My submission to VLDB", there are multiple raw filters we could consider, and multiple ways to order those filters. For example, if “VLDB” is highly selective it might be good to run a substring filter on VLDB first, and then feed the results into a key-value filter looking for name = "Athena". But if ‘VLDB’ occurs frequently in the dataset, we might be better off doing the key-value filtering first, and the substring search second. Or maybe we should try alternative substring searches in combination or instead, e.g. ‘submissi’. The optimum arrangement of filters in an RF cascade depends on the underlying data, the performance cost of running the individual raw filters, and their selectivity. We also have to contend with predicates such as(name = "Athena" AND text = "Greetings") OR name = "Jupiter", which are converted into DNF form before processing.

The first stage in the process is to compile a set of candidate RFs to consider based on clauses in the input query. Each simple predicate component of a predicate in DNF form is turned into substring and key-value RFs as appropriate. A substring RF is produced for each 4- and 8-byte substring of each token in the predicate expression, plus one searching for the token in its entirety . Key-value RFs will be generated for JSON, but for formats such as Avro and Parquet where the key name is unlikely to be present in the binary stream these are skipped. For the simple predicate name = "Athena" we end up with e.g.:

  • Athe
  • then
  • hena
  • Athena
  • key = name, value = Athena, delimeters = ,

Since these can only produce false positives, if any of these RFs fails, the record can’t match. For conjunctive clauses, we can simply take the union of all the simple predicate RFs in the clause. If any of them fail, the record can’t match. For disjunctions (DNF is the disjunction of conjunctions) then we require that an RF from each conjunction must fail in order to prevent false negatives.

Now Sparser draws a sample of records from the input and executes (independently) all of the RFs generated in the first step. It stores the passthrough rates of each RF in a compact matrix structure as well as recording the runtime costs of each RF and the runtime cost for the full parser.

After sampling, the optimizer has a populated matrix representing the records in the sample that passed for each RF, the average running time of each RF, and the average running time of the full parser.

Next up to 32 possible candidate RF cascades are generated. A cascade is a binary tree where non-leaf nodes are RFs and leaf nodes are decisions (parse or discard). Sparser generates trees up to depth D = 4. If there are more than 32 possible trees, then 32 are selected at random by picking a random RF generated from each token in round-robin fashion.

Now Sparser estimates the costs of the candidate cascades using the matrix it populated during the sampling step. Since the matrix stores a results of each pass/fail for an RF as single bit in the matrix, the passthrough rate of RF i is simply the number of 1’s in the ith row of the matrix. The joint passthrough rate of any two RFs is the bitwise and of their respective rows.

The key advantage to this approach is that these bitwise operations have SIMD support in modern hardware and complete in 1-3 cycles on 256-bit values on modern CPUs (roughly 1ns on a 3GHz processor).

Using this bit-matrix technique, the optimiser adds at most 1.2% overhead in the benchmark queries, including the time for sampling and scoring.

Periodic resampling

Sparser periodically recalibrates the cascade to account for data skew or sorting in the underlying input file. Consider an RF that filters by date and an input file sorted by date – it will either be highly selective or not selective at all depending on the portion of the file currently being processed.

Sparser maintains an exponentially weighted moving average of its own parsing throughput. In our implementation, we update this average on every 100MB block of input data. If the average throughput deviates significantly (e.g. 20% in our implementation), Sparser reruns its optimizer to select a new RF cascade.

Experimental results

Sparser is implemented in roughly 4000 lines of C, and supports mapping query predicates for text logs, JSON, Avro, Parquet, and PCAP. The team also integrated Sparser with Spark using the Data Sources API. Sparser is evaluated across a variety of workloads, datasets, and data formats.

Here you can see the end-to-end improvements when processing 68GB of JSON tweets using Spark:

Avro and Parquet formats get a big boost too:

I’m short on space to cover the evaluation in detail, but here are the highlights:

  • With raw filtering, Sparser improves on state-of-the-art JSON parsers by up to 22x. For distributed workloads it improves the end-to-end time by up to 9x.
  • Parsing of binary formats such as Avro and Parquet are accelerated by up to 5x. For queries over unstructured text logs, Sparser reduces the runtime by up to 4x.
  • Sparser selects RF cascades that are within 10% of the global optimum while incurring only a 1.2% runtime overhead.

In a periodic resampling just using a date based predicate, the resampling and re-optimisation process improved throughput by 25x compared to a sticking with the initially selected RF cascade for the whole job.

See the blog post from the authors and a link to the code here.

Fairness without demographics in repeated loss minimization

August 17, 2018

Fairness without demographics in repeated loss minimization Hashimoto et al., ICML’18

When we train machine learning models and optimise for average loss it is possible to obtain systems with very high overall accuracy, but which perform poorly on under-represented subsets of the input space. For example, a speech recognition system that performs poorly with minority accents.

We refer to this phenomenon of high overall accuracy but low minority accuracy as a representation disparity… This representation disparity forms our definition of unfairness, and has been observed in face recognition, language identification, dependency parsing, part-of-speech tagging, academic recommender systems, and automatic video captioning.

For systems that are continually trained and evolved based on data collected from their users, the poor performance for a minority group can set in place a vicious cycle in which members of such a group use the system less (because it doesn’t work as well for them), causing them to provide less data and hence to be further under-represented in the training set…

… this problem of disparity amplification is a possibility in any machine learning system that is retrained on user data.

An interesting twist in the problem is that the authors assume neither the number of groups (K), nor group membership, is known. An online speech recognition system for example, is unlikely to collect demographic information on its users. We simply assume that there will be groups, and then bound the worst-case risk for any group.

Our work can be seen as a direct instantiation of John Rawl’s theory on distributive justice and stability, where we view predictive accuracy as a resource to be allocated. Rawls argues that the difference principle, defined as maximizing the welfare of the worst-off group, is fair and stable over time since it ensures that minorities consent to and attempt to maintain the status quo.

After showing that representation disparity and disparity amplification are serious issues with the current status quo, the authors go on to introduce a method based on distributionally robust optimization (DRO) which can can control the worst-case risk.

We demonstrate that DRO provides a upper bound on the risk incurred by minority groups and performs well in practice. Our proposed algorithm is straightforward to implement, and induces distributional robustness, which can be viewed as a benefit in and of itself.

Representation disparity and disparity amplification

We have a population with K latent groups. Each group k \in K is a proportion \alpha_k of the overall population, and has a distribution P_k. We have some trained model with parameters \theta used to make predictions based on an input Z arising from one of the latent groups. When a user makes a query, they incur some expected loss, which is termed the risk R(\theta) in this paper.

Representation disparity occurs when there is a meaningful gap between the overall risk R(\theta) and the maximum (worst-case) group risk R_{k}(\theta). I.e., good overall performance, but poor performance (higher expected loss) for the worst-case group in K. In the evaluation example, the task is text prediction for autocomplete and the training data set contains a mix of African-American English (AAE) and Standard-American English (SAE) tweets. If the AAE data is in the minority, we would expect worse than average model performance when autocompleting African-American English. The measure of the representation disparity is given by R_{AAE} - R_{AAE+SAE}.

Disparity amplification is the process by which unbalanced group risk gets worse over time.

Consider a system which learns from its users, with periodic re-training. Say there are T rounds of such training. Within each round we might gain some new users and lose some existing users (aka churn). The authors make the reasonable assumption that user retention within a group is a strictly decreasing function of risk level (i.e., the worse the system performs for a group, the fewer users of that group are retained). The decrease in user retention for a minority group as compared to the average causes the group size to shrink, meaning that in the next round it will receive even higher losses due to having fewer samples from the group in the training data.

To keep a handle on the representation disparity, we want to control (minimize) the worst-case group risk over all time periods t in 1..T.

Roughly speaking, minimizing the worst-case risk should mitigate disparity amplification as long as lower losses sead to higher user retention.

For a system to be fair over the long term there needs to be a stable and fair fixed-point for the expected user counts \lambda^{(t)} over time. That is, risk minimisation needs to maintain the same population fraction over time. It’s easy to violate this condition when minimising overall risk (the standard empirical risk minimization, ERM, approach), even when we start out with two groups that have identical population size and initial risk.

In the following example we have two initially equal groups (one on the left, one on the right), and a classifier tasked with classifying points as red or black. Start out with a classifier with boundary x_2 = 0 (optimal for this data set). Now let optimisation play out over t = 500 rounds. At some point in the earlier rounds, due to random fluctuations in sampling accuracy, we took slightly fewer samples from the cluster on the right, meaning that ERM improves the loss for the left cluster at the expense of the right-hand one (see e.g. the t = 100 line). This increases churn in the right-hand cluster, reducing its representation in future rounds. Once the scales have tipped even slightly like this the disparity is amplified until at t = 500 there are nearly no samples from the right-hand cluster and it ends up suffering high loss.

Whenever decreasing the risk for one group increases the risk for others sufficiently the fixed point will be unstable and the model will converge to a different, possibly unfair, fixed point.

Distributionally robust optimization

Clearly we might not want to use ERM if we are concerned about representation disparity and its amplification. But what should we use instead? We need something that minimises the worst-case risk over time. If we can control the worst-case risk in one time step, then by an induction-like argument you can see that we can control it over all time steps.

The fundamental difficulty in controlling the worst-case risk over a single time step comes from not observing the group memberships from which the data was sampled… To achieve reasonable performance across different groups, we postulate a formulation that protects against all directions around the data generating distribution.

Distributionally robust optimization, DRO, has roots going back to 1958 (Ye, ‘Distributionally robust stochastic and online optimization’ ) and deals with optimisation in situations where you are not sure of (or an adversary controls, within some bounds), the true underlying distribution.

In this case, what the authors do is consider all possible distributions Q within a distance r of the overall population distribution P, where r is the chi-squared divergence measure between the two distributions. The result is a ‘chi-squared ball’. Let R_{dro}(\theta;r) be the worst-case loss over all the r-perturbations around P. The authors show that this bounds the risk of the group R_{k}(\theta) and hence the overall worst-case risk.

If the smallest minority group proportion we want to consider is \alpha_{min}, then we can control the worst-case group risk by minimizing the upper bound of R_{dro}(\theta;r_{max}) with r_{max} := (1/\alpha_{min} -1)^2. See §4.1 in the paper for details. Under the covers what happens is that small levels of loss get ignored, and large losses above a threshold are upweighted due to a square term. My statistics knowledge isn’t good enough to take you any deeper here!

Fortunately, if you just want to put this into practice without diving into all the underlying theory, you just need to need to use the following optimisation objective:

Where \eta is a hyperparameter.

Evaluation

Recall the experiment with two equal left and right groups and a red/black classifier that we looked at earlier. This was unstable under ERM, with a minority group emerging and classification accuracy for that minority group getting worse over time. If we run the same experiment with DRO, the disparity amplification effect disappears.

The authors also complete a real-world human evaluation of user retention and satisfaction on a text autocomplete task (with a mix of African-American Engish and Standard-American English training examples as described earlier).

For both ERM and DRO, we train a set of five maximum likelihood bigram language models on a corpus with 366,361 tweets total and a f \in {0.1, 0.4, 0.5, 0.6, 0.9} fraction of the tweets labelled as AAE. This results in 10 possible autocomplete systems a given Mechanical Turk user can be assigned to during a task.

The results show an improvement in both minority satisfaction and retention rate with DRO, with only slightly decreasing scores for the SAE group.


(Enlarge)

The code to generate results is available on CodaLab. (CodaLab itself looks pretty interesting btw. – ‘a collaborative platform for reproducible research’).

Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples

August 15, 2018

Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples Athalye et al., ICML’18

There has been a lot of back and forth in the research community on adversarial attacks and defences in machine learning. Today’s paper examines a number of recently proposed defences and shows that most of them rely on forms of gradient masking. The authors develop attack techniques to overcome such defences, and 9 analyse defences from ICLR 2018 claiming to protect against white-box attacks. 7 of these turn out to rely on obfuscated gradients, and 6 of these fall to the new attacks (and the other one partially succumbs). Athalye et al. won a best paper award at ICML’18 for this work.

One of the great things about work on adversarial attacks and defences, as we’ve looked at before, is that they illuminate the strengths and weaknesses of current technology. Depending on the threat model you choose, for my own part I’m currently of the opinion that we’re unlikely to find a robust adversarial defence without a more radical re-think of how we’re doing image classification. If we’re talking about the task of ‘find an image that doesn’t fool a human, but does fool a neural network’ then I think there will always be gaps to exploit until we incorporate more human-like (e.g., higher level semantic understanding) reasoning into our classifiers. Not that the human system itself can’t be fooled of course (e.g., optical illusions).

Breaking gradient descent

A defense is said to cause gradient masking if it “does not have useful gradients” for generating adversarial examples; gradient masking is known to be an incomplete defense to adversarial examples. Despite this, we observe that 7 of the ICLR 2018 defenses rely on this effect.

Some defenses break gradient descent deliberately, others may do it unintentionally. Here are five clues that something isn’t right:

  1. One-step attacks perform better than iterative attacks. Iterative attacks are strictly stronger, so this shouldn’t be the case. If single-step methods are working better, it’s a sign the iterative attack is becoming stuck at a local minimum.
  2. Black-box attacks work better than white-box attacks. The black-box threat model is a strict subset of white-box attacks, so white-box attacks should perform better. When a defense obfuscates gradients, then black-box attacks (which don’t use it) often perform better. (Since practical black-box attacks are possible and we can also find e.g. universal adversarial perturbations the utility of a defense that excludes such attack modes seems rather limited anyway to me).
  3. Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should eventually fail. An attack that doesn’t achieve this should be improved (i.e., it’s a weak attack, not necessarily a strong defense).
  4. Random sampling finds adversarial examples (where a gradient-based attack does not).
  5. Increasing the distortion bound does not increase success. We expect a monotonically increasing attack success rate with increasing distortion bound.

Three ways that a defense might break gradient descent are shattering, stochastic gradients, and exploding & vanishing gradients:

  • Shattered gradients are caused when a defense is “non-differentiable, introduces number instability, or otherwise causes a gradient to be nonexistent or incorrect.”
  • Stochastic gradients are a result of randomization – either introduced in the network itself, or in the inputs being fed to the network
  • Exploding and vanishing gradients are often caused by defenses that consist of multiple iterations of neural network evaluation, feeding the output of one computation into the next.

Overcoming masked gradients

Shattered gradients can be overcome using a technique the authors call ‘Backward Pass Differentiable Approximation’ (BPDA). Think of a secured neural network (i.e., one that has been hardened against adversarial attacks) as being a composition of some hardening function g(\dot) and a regular pre-trained classifier f(\dot) such that the hardened classifier \hat{f}(x) = f(g(x)). For example, g might be a denoising function designed to remove perturbations introduced by an adversary. If g itself is smooth and differentiable we can often compute gradients through the combined network \hat{f}. Even if it isn’t differentiable, we know that by construction in a local area it is close to the identify function (i.e. g(x) \approx x ). So we conduct an attack by performing forward propagation through the neural network as usual, but replacing g(\dot) with the identity function (or even simpler, just the evaluation of g(x) ) during backpropagation. The full (general) version of BPDA operates by finding and replacing any non-differentiable layer f^i(\dot) with a differentiable approximation g(x) such that g(x) \approx f^i(x). As before, we use the vanilla network on the forward pass, and substitute g on the backward pass.

To attack a network that relies on stochastic gradients, it is necessary to estimate the gradient of the stochastic function. ‘Expectation over Transformation’ (EOT) can be used to compute the gradient over an expected transformation to an input. Given a transformation function t(\dot) EOT optimises the expectation over f(t(x)). “The optimization problem can be solved through gradient descent… differentiating through the classifier and transformation, and approximating the expectation with samples at each gradient descent step.”

Vanishing or exploding gradients can be addressed through reparameterization. Suppose differentiating f(g(x)) leads to a problem. We can introduce a new differentiable function h(\dot) such that x = h(x) and g(h(z)) = h(z) for all z. We can now compute gradients through f(h(z)) and circumvent the defense. (§4.3).

How the ICLR 2018 defenses stack up

Section 5 of the paper works through nine defenses from ICRL 2018 that claim robustness in a white-box threat model. Seven of the nine turn out to rely on some form of obfuscated gradients. For these the authors use the techniques described above to successfully attack them.

Thermometer encoding is an example of a defense that relies on gradient shattering at its core. Image pixels are mapped to l-dimensional vectors such as ‘111110000’, where the number of leading 1’s indicates the temperature (colour value) of a pixel. A BPDA attack reduces model accuracy to 30% (worse than if thermometer encoding were not used at all!).

Examples of transformation based defenses include image cropping and rescaling, bit-depth reduction, JPEG compression, randomly dropping pixels and replacing them using total variance minimization, and image quilting (reconstructing images by replaces small patches with patches taken from ‘clean’ images).

It is possible to bypass each defense independently (and ensembles of defenses usually are not much stronger than the strongest sub-component)… with our attack, we achieve 100% targeted attack success rate and accuracy drops to 0% for the strongest defense under the smallest perturbation budget…

Image cropping and rescaling are circumvented using EOT. Bit depth reduction and JPEG compression are circumvented using BPDA and the identity function on the backward pass. Random pixel replacement and quilting are defeated using a combination of EOT and BPDA.

PixelDefend and Defense-GAN both use generators to project input images back onto a data manifold before feeding them into a classifier. Informally, this is intended to ‘purify’ them by avoiding low-probability regions in the data distribution.

PixelDefend can be defeated using BPDA, and DefenseGAN can also be evaded with BPDA, though only at a 45% success rate.

See section 5 in the paper for more details and discussion of the other defenses.

… we hope that future work will be able to avoid relying on obfuscated gradients (and other methods that only prevent gradient descent-based attacks) for perceived robustness, and use our evaluation approach to detect when this occurs.

Delayed impact of fair machine learning

August 13, 2018

Delayed impact of fair machine learning Liu et al., ICML’18

“Delayed impact of fair machine learning” won a best paper award at ICML this year. It’s not an easy read (at least it wasn’t for me), but fortunately it’s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn’t. Most recently we looked at the equal opportunity and equalized odds models.

The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to consider not just (for example) the opportunity that someone in a disadvantaged group has to qualify for a loan, but also what happens in the future as a result of that loan being made. If the borrower eventually defaults, then they will also see a decline in their credit score, which will make it harder for the borrower to obtain additional loans in the future. A successful lending event on the other hand may increase the credit score for the borrower.

Consequential decisions… reshape the population over time. Lending practices, for example, can shift the distribution of debt and wealth in the population. Job advertisements allocate opportunity. School admissions shape the level of education in a community.

If we want to manage outcomes in the presence of a feedback loop, my mind immediately jumps to control theory. Liu et al. don’t go all the way there in this paper, but they do introduce “a one-step feedback model of decision making that exposes how decisions change the underlying population over time.

We argue that without a careful model of delayed outcomes, we cannot foresee the impact a fairness criterion would have if enforced as a constraint on a classification system. However, if such an accurate outcome model is available, we show that there are more direct ways to optimize for positive outcomes than via existing fairness criteria.

Fairness criteria

Three main selection policies are used as the basis of the study, and to keep things simple we’ll consider a setting in which the population is divided into two groups, A and B. Group A is considered to be the disadvantaged group (“A” for Advantaged would have made a lot more sense!). An institution makes a binary decision for each individual in each group, called selection (e.g., should this individual be offered a loan).

  • In the unconstrained or max profit model a lender makes loans so long as they are profitable, without any other consideration.
  • In the demographic parity model the institution maximises its utility subject to the constraint that the institution selects from both groups at an equal rate.
  • In the equal opportunity model utility is maximised subject to the constraint that individuals with a successful outcome (e.g. repay their loan) are selected at an equal rate across both groups.

See ‘Equality of opportunity in supervised learning’ for a discussion of the equal opportunity model and why it is preferred to demographic parity.

The outcome curve

The authors study what happens to the average (mean) score of a group under different selection policies. A policy is said to cause active harm if it causes the average score to go down, stagnation if the average score does not change, and improvement if the average score goes up. It’s also useful to compare a policy to some baseline, for which we’ll use the maximum profit / utility model. Compared to this baseline, a policy does relative harm if the average score under the policy is less than it would be under the maximum utility model, and relative improvement if the average score under the policy is greater than it would be under the maximum utility model.

Given a policy, there is some threshold score above which members of a group will be selected. We can think of the threshold as determining a selection rate, the proportion of a group selected by the policy. As the selection rate varies, so does the impact on the average score over time. We can plot this in an outcome curve.

The outcome curve shows three regions of interest: relative improvement, relative harm, and active harm. If the selection rate for a group under the maximum utility model is \beta^{MaxUtil}, then reducing the selection rate below this does relative harm to the group – i.e., the average score goes down. Intuitively, there are less positive outcomes for the group to drive a positive feedback loop (e.g. people receiving loans and paying them back).

If we increase the selection rate above the maximum utility (from the perspective of the institution remember) point, then to start with the group sees a relative improvement in their average scores. This holds so long as the additionally selected individuals have more positive outcomes than negative. The selection rate at which relative improvement is maximised is denoted by \beta^{*}.

If we push the selection rate too high though, then negative outcomes (loan defaults) rise and the average score improves less. If this improvement falls below the maximum utility point, we’re back in a region of relative harm. The rate above which relative improvement switches back to relative harm is denoted by \bar{\beta}.

Push the selection rate even higher still (beyond \beta_0, and the impact of the additional negative outcomes can mean that not only does the group see relative harm, but also active harm, i.e., they are worse off in absolute terms.

Killing with kindness?

Under analysis, both demographic parity and equal opportunity fairness criteria can lead to all possible outcomes (improvement, stagnation, and decline). Interestingly, “under a mild assumption, the institution’s optimal unconstrained selected policy can never lead to decline“. So it’s therefore possible to have a fairness intervention with the unintended consequence of leaving the disadvantaged group worse off than they were before. There are also some settings where demographic parity causes decline, but equal opportunity does not. Furthermore, there are some settings where equal opportunity causes relative harm, but demographic parity does not.

If we assume that under each fairness criteria the institution otherwise maximises utility, then we can determine the selection rates under each policy from utility curves.

Whether the policies show relative harm, relative improvement, or active harm, depends on where the selection rates \beta^{EqOpt} and \beta^{DemParity} fall with respect to \bar{\beta} and \beta_0.

Outcome based fairness

…fairness criteria may actively harm disadvantaged groups. It is thus natural to consider a modified decision rule which involves the explicit maximization of [the average score of the disadvantaged group]… This objective shifts the focus to outcome modeling, highlighting the importance of domain specific knowledge.

For example, an institution may decide to set a selection rate for the disadvantaged group that maximises the improvement in their average score, subject to a constraint that the utility loss compared to the baseline is less than some threshold.

A FICO example

The authors look at the impact of fairness constraints on FICO scores (credit worthiness) using a sample of 301,536 scores from 2003. Individuals were labeled as defaulted if they failed to pay a debt for at least 90 days on at least one account in the ensuing 18-24 month period. With a loss/profit ratio of -10, no fairness criteria surpass the active harm rate. But at a loss/profit ratio of -4, demographic parity overloans and causes active harm.

This behavior stems from a discrepancy in the outcome and profit curves for each population. While incentives for the bank and positive results for individuals are somewhat aligned for the majority group, under fairness constraints, they are more heavily misaligned in the minority group, as seen in the graphs (left) above.

Bounding data races in space and time – part II

August 10, 2018

Bounding data races in space and time Dolan et al., PLDI’18

Yesterday we looked at the case for memory models supporting local data-race-freedom (local DRF). In today’s post we’ll push deeper into the paper and look at a memory model which does just that.

Consider a memory store S which maps locations to values. For non-atomic locations it’s not enough just to store a single memory value at the location. Instead non-atomic locations are mapped to histories H, where a history is a finite map from timestamps to values, and timestamps are totally ordered.

Every thread has a frontier, F, which is a map from non-atomic locations to timestamps.

Intuitively, each thread’s frontier records, for each nonatomic location, the latest write known to the thread. More recent writes may have occurred, but are not guaranteed to be visible.

For atomic locations, the memory store records a pair (F,x), with a single data value x rather than a history. F is a frontier, which will be merged with the frontiers of threads that operate on the location. This is expressed in the rules READ-AT and WRITE-AT in the operational semantics below. The READ-AT rule, for example, should be read as “when a thread with frontier F reads a memory location A with content (Frontier-A, x), then the memory location is unchanged and the new frontier of the thread is F union Frontier-A”.


(Enlarge)

The rules for reading and writing non-atomic locations are given by READ-NA and WRITE-NA. A thread can read an arbitrary element of the history of a non-atomic location, so long as the timestamp is greater than or equal to the timestamp in the thread’s frontier. A write to a non-atomic location by a thread must specify a timestamp later than that in the writing thread’s frontier.

Note a subtlety here: the timestamp need not be later than everything else in the history, merely later than any other write known to the writing thread.

The reduction rules for the small-step operational semantics operate on machine configurations consisting of a store S and a program P. A program is a finite set of threads, each with a frontier and an expression. Silent transitions are those that reduce expressions without involving memory operations. For transitions that involve memory (rule MEMORY), then the memory operation involved determines the thread’s new frontier and the new contents of the memory location.

A trace is a sequence of machine transitions starting from an initial state in which all locations have a timestamp 0. A trace is sequentially consistent if it contains no weak transitions. A weak transition is a machine step performing a memory operation at a non-atomic location where one of the following holds:

  • the thread does not read the latest write to the location (reads a value with timestamp greater than its frontier, but not the very latest)
  • the thread writes a value to the location with a timestamp greater than the thread’s frontier, but not greater than the largest timestamp in the location history (i.e., it is not the latest write to that location)

Two transitions conflict if they access the same atomic location and at least one is a write. If there is no happens-before relation between the two transitions then we have a data race. A happens-before relation can take one of two forms: either the transitions happen on the same thread, or the location involved is atomic and at least one of the transitions is a write.

If instead of considering the whole memory store we focus in on a subset of locations L, then a transition is L-sequential if it is not a weak transition, or if it is a weak transition but not on a location in L.

A machine is L-stable if all of the L-sequential transitions in its traces are data race free.

Intuitively, M is L-stable if there are no data races on L in progress when the program reaches state M. There may be data races before reaching M, and there may be data races after reaching M, but there are no data races between one operation before M and one operation afterwards.

From an L-stable machine M, the local DRF theorem tells us that we can reason locally over a set of transitions starting from M. If all of those transitions are L-sequential then the program is data race free through those transitions.

Mapping to hardware

To facilitate mapping the operational model defined above to hardware (x86 and ARMv8) the authors also define an axiomatic semantics in terms of read and write events. Events are generated during program execution and form an event graph G. Three binary relations are defined over events:

  • program order (po): E1 po E2 if E1 and E2 occur on the same thread and E1 is ordered before E2 on that thread.
  • reads from (rf): E2 rf E1 if E1 is a write event and E2 is a read event, and both events have the same location and value
  • coherance (co): E1 co E2 if E1 and E2 are write events to the same location

In addition, a from-reads (fr) relation is defined in terms of rf and co. E1 fr E2 if E1 reads a value which is later overwritten by E2.

The axiomatic semantics are mapped to the operational semantics to show that they define the same memory model.

We now show that our memory model can be compiled efficiently to the x86 and ARMv8 memory models, and define the optimisations that are valid for a compiler to do… In our memory model, not all of the po relation is relevant, which is an important property for both compiler and hardware optimisations. For instance, there is no constraint that two nonatomic reads must be done in program order.

There are some constraints on program (re-)order(ing) though:

  • operations must not be moved before prior atomic operations
  • operations must must be moved after subsequent atomic writes
  • prior reads must not be moved after subsequent writes
  • conflicting operations must not be reordered.

These constraints still permit a number of peephole optimisations including:

  • Sequentialisation: [P || Q] => [P ; Q] (which is surprisingly invalid under C++ and Java memory models)
  • Redundant load (elimination): [r1 = a; r2 = a;] => [r1 = a; r2 = r1]
  • Store forwarding: [a = x; r1 = a] => [a = x; r1 = x]
  • Dead store (elimination): [a = x; a = y] => [a = y]
  • Common subexpression elimination: [r1 = a*2; r2 = b; r3 = a*2] => [r1 = a*2; r3 = r1; r2 = b]
  • Loop-invariant code motion: [while(...) { a = b; r1 = c*c; ...}] => [r2 = c*c; while(...) { a = b; r1 = r2; ... }]
  • Dead store elimination through reordering: [a = 1; b = c; a = 2] => [b = c; a = 1; a = 2] => [b = c; a = 2]
  • Constant propagation via store forwarding: [a = 1; b = c; r = a] => [b = c; a = 1; r = a] => [b = c; a = 1; r = 1]

Furthermore, if a program fragment satisfies the local DRF property, then the compiler is free to apply any optimisations valid for sequential programs, not just the ones permitted by the memory model, to that program fragment.

Compilation to x86 looks like this (with details in section 7.3 of the paper):

And compilation to ARMv8 is as follows (details in section 7.3 of the paper):

In all cases compilation is sound: every behaviour that the hardware model allows of a compiled program is allowed by the software model of the original program.

What price do we pay for local DRF?

We focus on quantifying the cost of nonatomic accesses on ARM and POWER architectures, since nonatomics in our memory model are free on x86. We leave the evaluation of the performance of our atomics for future work. Our aim is to evaluate the suitability of the memory model to be the default one in Multicore OCaml, a parallel extension of OCaml with thread-local minor heaps and shared major heap.

A suite of OCaml benchmarks are run looking at the overheads of the two ARMv8 compilation schemes (one using branch-after-load, and one using fence-before-store). The comparison also includes a strong release/acquire model which is strictly stronger than both.

On average, branch-after-load is 2.5% slower than the (non-parallel) OCaml baseline, and fence-before-store is 0.6% slower.

Our memory model is simpler than prevalent mainstream models and enjoys strong reasoning principles even in the presence of data races, while remaining efficiently implementable even on relaxed architectures… In future work we plan to extend our currently spartan model with other types of atomics. In particular, release-acquire atomics would be a useful extension: they are strong enough to describe many parallel programming idioms, yet weak enough to be relatively cheaply implementable.