Boosted race trees for low energy classification Tzimpragos et al., ASPLOS’19
We don’t talk about energy as often as we probably should on this blog, but it’s certainly true that our data centres and various IT systems consume an awful lot of it. So it’s interesting to see a paper using nano-Joules per prediction as an evaluation metric. The goal is to produce a low-energy hardware classifier for embedded applications doing local processing of sensor data. To get there, the authors question a whole bunch of received wisdom, beginning with this: do we really need to convert the analog sensor data into a digital signal?! Here’s another fun one: what if instead of being something you worked hard to avoid, you had to build your whole application based on the outcomes of data races??!
Typically, a sensor gathers analog information from the physical world and then converts it into a conventional digital signal… While this binary-represented integer is perfectly efficient for storage as bits in memory and for typical general purpose computing operations, it is unclear that this is the most efficient for our target application. One such possible representation is pure analog signalling.
Of course analog signalling comes with a whole bunch of challenges of its own, which is one of the reasons we tend to convert to digital. But here the authors seem to have found a perfect marriage of a class of logic called race logic, a natural encoding for sensor data, and the classification use case.
We argue that race logic fits nicely with temporally-coded sensing systems,, such as 3D depth perception systems and dynamic vision sensors, where the time that a signal gets “excited” to a logic-level “1” depends on the magnitude of the sensor reading. Furthermore, we demonstrate that the structure of race logic makes it a natural encoding for decision tree-based approaches.
The resulting system can integrate seamlessly into a scikit-learn based development process, and dramatically reduces the total energy usage required for classification with very low latency.
Introducing race logic
Race logic encodes values by delaying signals. Signals can still be low or high (0 or 1), but it’s the timing of the transition between those states which is all important. For example, a value of ‘2’ can be represented by a signal that is low until time 2 and then high from then on. Race logic has four primary operations that are easy to implement in hardware: MAX, MIN, ADD-CONSTANT, and INHIBIT.
- For MAX the output of a gate should go high only when all of the inputs have arrived. A single AND gate between two wires is therefore all we need to implement MAX in the race domain. (That’s kind of cool isn’t it 🙂 ).
- MIN is actually pretty straightforward as well – here we want the first signal to arrive so a single OR gate does the job.
- Since values are encoded by the time of the rising edge from 0 to 1, adding a constant value to a signal (ADD-CONSTANT) amounts to introducing the proportionate amount of delay to the signal. One efficient way of doing that in analog hardware is the use of current-starved inverters.
- INHIBIT takes two inputs: an inhibiting signal and a data signal. If the inhibiting signal arrives first, the output is prevented from ever going high. If the data signal arrives at the same time as, or before, the inhibiting signal, it is allowed to pass through unchanged. An INHIBIT function can be encoded using a single PMOS pass gate.
Together this set of four operations allow us to deliberately engineer “race conditions” in a circuit to perform computation at low energy.
Encoding decision trees using race logic
Take a decision tree, and turn it upside down so that instead of computation starting at the root and proceeding to an ultimately selected leaf, we start at the leaves and work our way to the root.
One idea is to assign a unique delay-encoded label to each leaf, and then let the labels race each other on their journey to the root. The first one to arrive at the root is the winner, i.e., the correct label for this classification.
Another way of looking at the problem is that each path from the tree root to a leaf corresponds to a unique conjunction of attribute tests. We can execute those tests in parallel rather than sequentially.
When it comes to making a decision at a node in the tree, each node has a fixed threshold t, learned during the training phase, against which it compares the input value. So decision trees are essentially networks of threshold functions. This fits rather nicely with an INHIBIT gate.
An end-to-end architecture
It’s now time to put our race-logic encoded decision trees in the context of an end-to-end system.
The first stage of analog-to-digital converters (ADC) often involves converting the analog signal into the temporal domain. “Such an approach leverages the small delays and high temporal resolution of nano-meter scale devices to get superior performance over their voltage domain counterparts.” For our use case, we simply don’t need the rest of the conversion (temporal to digital, TDC).
Examples of systems providing directly time-encoded outputs without TDCs, range from visual sensors, such as Dynamic Vision Sensors (DVS), Asynchronous Time-based Image Sensors (ATIS), Time to First Spike (TTFS) and Time-of-Flight (ToF) cameras, to sound sensors; e.g. the AER (Address Event Representation) Ear, which is a silicon-based cochlea that converts auditory signals from various frequency bands to precise temporally coded outputs…. the presented architecture can work with any temporally-coded input provided to it.
A programmable race-logic accelerator is used for the decision trees, with threshold values encoded using a shift register. A decoder reads the race-tree output and turns it into a one-hot encoded memory address. For an ensemble learner, we have an ensemble of such trees, with a weighted voting scheme based on general binary adders.
Integration with scikit-learn
The Race Trees accelerator is fully integrated into a sci-kit learn based development flow.
Once the model is trained, our tool analyzes the importance of input features, explores the learners performance against lower resolution data, proceeds with votes (content of tree leaves) quantization. Finally, the user has to choose one of the following options: (a) the generation of a configuration file for the crossbar networks of a programmable Race Trees accelerator, or (b) the generation of a custom RTL-level design with hardwired units.
Performance evaluation
The evaluation is performed using the MNIST dataset, since that has the most results available in the literature for comparison. Here’s how Race Trees (green dots in the top LH corner) compare against other state-of-the-art low power classifiers when we look at energy vs accuracy:
An open-source implementation of race-logic primitives and example Race Trees can be found on GitHub.
Using GradientBoosting, a classifier consisting of 1,000 race trees of depth 6 gets 97.45% accuracy at 31.35nJ/prediction. Using just 200 race trees it is still possible to achieve 95.7% accuracy, but now at just 7.8nJ/prediction.
A comparison of Race Trees to other machine learning implementations in an accuracy vs energy-delay product scatter plot is presented in Figure 14 (below), and shows that RaceTrees achieve both lower delay and energy per operation than their counterparts.
The paper’s conclusion lists a number of areas for future work which indicate there’s still plenty of room for further exploration and improvement, including for example the integration of other circuits such as vector-matrix multipliers that operate purely in the time domain. The quest for energy-efficient machine learning systems looks like it may take us to some very interesting places indeed!