Customized regression model for Airbnb dynamic pricing Ye et al., KDD’18

This paper details the methods that Airbnb use to suggest prices to listing hosts (hosts ultimately remain in control of pricing on the Airbnb platform).

The proposed strategy model has been deployed in production for more than 1 year at Airbnb. The launch of the first iteration of the strategy model yielded significant gains on bookings and booking values for hosts who have adopted our suggestions… multiple iterations of the strategy model have been experimented [with] and launched into production to further improve the quality of our price suggestions.

Figuring out the right price for a night in a given Airbnb listing is challenging because no two listings are the same. Even when we constrain to e.g. similar sized properties in the same region, factors such as the number of five star reviews can influence price. Furthermore demand is time-varying due to seasonality and regional events (with different seasonality patterns for different countries). And then of course, how far in advance a booking is being made also factors into the price (“as lead time reduces, there are less opportunities for this night to be booked, which leads to changes in the demand function“).

To help hosts maximise their revenues, Airbnb offers “Price Tips” and “Smart Pricing” tools. Price Tips presents a calendar view showing the predicted likelihood of bookings on a day-by-day basis, given the current pricing as set by the host. When clicking on a given day, more detail and an Airbnb suggested price are shown.

With Smart Pricing hosts can set a min and max price and then any new price suggestions generated by Airbnb that fall within these ranges will be automatically adopted for all available nights.

In the ideal world we’d estimate a demand curve F(P) giving an estimate of demand at a given price P, and then choose P so as to maximise $P \times F(P)$. We’ve already seen some reasons why things aren’t quite so straightforward in the Airbnb case (we at least need to factor in the listing itself, and the time, giving $F(P, t, id)$ ). Moreover, since Airbnb don’t directly control pricing but can only suggest —with partial adoption of their suggestions by hosts— it is not so easy to directly experiment and explore different pricing strategies to observe market responses.

The pricing system that Airbnb ultimately settled on has three components:

• First a booking probability binary classification model makes predictions of the likelihood a listing will be booked on each night.
• These predictions are then fed into a pricing strategy model which suggests prices for the available nights
• Additional personalisation logic is applied to the prices output by the strategy model to incorporate hosting goals, special events etc..

The main focus of this paper is the pricing strategy model, but we do get brief details on the booking probability model.

### The Booking probability model

Booking probability is predicted using Gradient Boosting Machines (GBM), with a separate model trained for each market. The sampling rate for training data varies based on market density:

Markets with a high density of listings benefit from the location-based models the most, which we sample at a rate higher than the global constant sampling rate.

The models take into account three different types of features:

• Listing features such as listing price per night, room type, person capacity, number of bedrooms/bathrooms, amenities, locations, reviews, historical occupancy rate, whether or not instant booking is enabled, and so on.
• Temporal features such as seasonality (day of year, day of week etc.), calendar availability (gap between check-in and check-out), how many days there are between now and the night in question, and so on.
• Supply and demand features such as number of available listings in the neighbourhood, listing views, search / contact rates, and so on.

By scoring the booking probability model at different price points in a range it’s possible to get an estimated demand curve. However, due to the challenges outlined above getting an accurate enough demand curve at listing-night level to use for price setting is extremely difficult.

We have tried to directly apply revenue maximization strategies based on our estimated demand curve, but online A/B testing results showed that these methods often fail to optimize revenue for our hosts in practice. Therefore, we decide to pursue alternative solutions…

The alternative solution using the output of the booking probability model as just one input into the pricing strategy model.

### The Pricing strategy model

Let’s start with this: in the absence of a ground truth for the optimal price, what should you use as an evaluation metric for training a pricing strategy model?

After some deliberation, the team settled on two evaluation metrics: price decrease recall (PDR) and booking regret (BR). We do have historical information on whether a given listing was actually booked on a given night, and the price it was booked at. Both PDR and BR tap into this information.

Let’s assume that if a listing wasn’t booked on a given night at price P, then it also wouldn’t have been booked at some suggested price ≥ P. But if the price had been lower than $P$, then there is some chance at least it might have snagged a booking. PDR is the percentage of non-booked nights where the price suggested by the strategy model is less than the actual price that was advertised. In the following figure, PDR will be 0.6 (3 out of 5 unbooked nights had lower suggested prices).

If all we had was PDR though, we’d end up training a model to offer free accommodation every night! If a listing was booked on a given night at some price P, and we suggest a price for the night ≤ P, then our suggestion is leaving money on the table. Booking Regret captures this missed revenue. BR is calculated as follows: for all the booked nights, take the maximum of zero and the percentage below the booked price of the suggested price. Now take the median of these values.

For example, given:

Then BR will be the median of (14, 5, 6, 0, 0) = 5%.

Now we need a way to combine these ideas into a single loss function. It looks like this:

It’s not as bad as it looks, honest! $f_{\theta}(\mathbf{x}_i)$ is the suggestion made by the price suggestion function given input parameters $\mathbf{x}_i$. L is a lower bound function for the optimal price range, and U is an upper bound function.

For booked listing nights, the lower bound is the booking price $P_i$, and for non-booked listing nights it is $c_1 P_i$ where $c_1$ is a constant between 0 and 1.

For non-booked listing nights, the upper bound is the calendar price $P_i$ at which the sample was not booked. For a booked night, the upper bound is $c_2 P_i$ where $c_2$ is a constant > 1.

When suggestions fall between the upper and lower bound, the loss is zero; otherwise the loss is the distance between the suggestion and the bound.

The set of features $x_i$ includes the calendar price $P_i$ set by the host, the guest booking probability for the night as output by the booking probability model, and a set of market demand signals (unspecified) that are not fully captured by the booking probability model. The pricing suggestion model itself is “an asymmetric exponential form model, which applies price increases/decreases upon the calendar price with magnitude learned from data.” The suggested price is given by $P \dot V$ where the increase / decrease magnitude V is given by:

• D is a demand score computed from additional demand signals at the cluster level (a cluster is a group of similar listings).
• $\theta_1$ controls how fast the prices grow/shrink
• $\theta_2$ controls when the original calendar price is suggested
• q is the estimated booking probability
• $\varphi_H$ and $\varphi_L$ are constants between 1 and 2 which control the extent to which the suggestion curves bend.

We do not use the same constants for price increases and decreases since we would like the training system to learn the ratios asymmetrically. In this way, price suggestions can reflect the demand sensitivity more thoroughly by taking advantage of the non-linear manner in which markets perceive supply and demand.

The parameters $\theta_1$ and $\theta_2$ are trained at the listing level, for each of the 4 million+ active listing on Airbnb. Market level and global level fallback parameters are also prepared in case a listing has insufficient training examples (e.g., a new listing). Training puts most emphasis on the latest booking behaviours to better reflect seasonal signals.

### Evaluation

Offline and online evaluation results show that the proposed strategy model performs significantly better than a direct max-rev pricing strategy. We are also actively working on improving the demand curve estimation. With a more accurate demand curve, we may revisit the direct revenue maximization strategy in the future.

Compared to a naive strategy of pricing directly off of the demand estimation curves from the booking probability model, the pricing strategy model significantly improves PDR and BR (except for BR with dataset (a)).

As well as a quantitative evaluation (details in section 5.1 of the paper), the authors also inspected price suggestions generated on 2018-02-08 for 120 nights into the future. The following figures show the suggestions generated for Tokyo and for Tahoe respectively.

For both markets strong weekly patterns emerge, and in Tokyo there is also a strong spike from late March to early April corresponding with the cherry blossom season. “From these two examples, we see that our model can indeed capture the market dynamics in a timely fashion.

Rosetta: large scale system for text detection and recognition in images Borisyuk et al., KDD’18

Rosetta is Facebook’s production system for extracting text (OCR) from uploaded images.

In the last several years, the volume of photos being uploaded to social media platforms has grown exponentially to the order of hundreds of millions every day, presenting technological challenges for processing increasing volumes of visual information… our problem can be stated as follows: to build a robust and accurate system for optical character recognition capable of processing hundreds of millions of images per day in realtime.

Images uploaded by clients are added to a distributed processing queue from which Rosetta inference machines pull jobs. Online image processing consists of the following steps:

1. The image is downloaded to a local machine in the Rosette cluster and pre-processing steps such as resizing (to 800px in the larger dimension) and normalization are performed.
2. A text detection model is executed to obtain bounding box coordinates and scores for all the words in the image.
3. The word location information is passed to a text recognition model that extracts characters given each cropped word region from the image.
4. The extracted text along with the location of the text in the image is stored in TAO.
5. Downstream applications such as search can then access the extracted textual information corresponding to the image directly from TAO.

The most interesting part is of course the text extraction using the two-step process outlined above.

This two-step process has several benefits, including the ability to decouple training process and deployment updates to detection and recognition models, run recognition of words in parallel, and independently support text recognition for different languages.

### Text detection in Rosetta

Text detection is the most compute and latency sensitive component. After evaluating several different approaches the authors settled on the Faster-RCNN detection model. Amongst other reasons, Faster-RCNN was readily available to them as part of the Facebook Detectron platform. Since Detectron has been open-sourced by Facebook, that means it’s readily available to you too!

Faster-RCNN learns a fully convolutional CNN that can represent an image as a convolutional feature map. It also learns a region proposal network that takes the feature map as an input and produces a set of k proposal bounding boxes that contain text with high likelihood, together with their confidence score. There are a number of different choices for the convolutional body of Faster-RCNN. In tests, ShuffleNet proved the fastest (up to 4.5x faster than ResNet-50).

To train the text detection model the team initially used the COCO-Text dataset, but the wide-variety of text in images uploaded to Facebook (including e.g. lots of images with overlaid words) didn’t match well to the COCO-Text training set. In the end, the team used three different datasets for training: first an artificially generated dataset with text overlaid images; then COCO-Text; and finally a human-rated dataset specifically collected for Facebook client applications. The following table shows how accuracy improved as the various datasets were introduced.

### Text recognition in Rosetta

Text recognition is done using a fully-convolutional model called CTC (because it uses a sequence-to-sequence CTC loss during training) that outputs a sequence of characters. The last convolutional layer predicts the most likely character at every image position of the input word.

(Enlarge)

… every column of the feature map corresponds to the probability distribution of all characters of the alphabet at that position in the image, and CTC finds the alignments between those predictions, which may contain duplicate characters or a blank character, and the ground truth label.

For example, given the input training word LEARNING, the model might produce the sequence of characters ‘L-EE-A-RR-N-I-NN-G,’ which includes blanks (‘-’) and duplicates.

Decoding greedily takes the most likely character at every position of the sequence, and the in post-processing contiguous duplicate characters not delimited by the blank character are removed.

… a pre-defined dictionary would be too limiting for many real-world applications, which require recognizing more than just simple words as in the case of URLs, emails, special symbols and different languages. Therefore, an important architectural decision and a natural choice was to use a character-based recognition model.

A fixed width is needed at training time to be able to efficiently train using batches of images. Word images are resized to 32×128 pixels, with right-zero padding if the original is less than 128 pixels wide. This minimises the amount of distortion introduced in the images. During testing images are resized to a height of 32 pixels, preserving their aspect ratio (regardless of resulting width).

The number of character probabilities emitted is dependent on the width of the word image. A stretching factor of 1.2 was found to lead to superior results compared to using the original aspect ratio (you get 20% more output probabilities that way).

Training the CTC model proved difficult with it either diverging after just a few iterations or training too slowly to be of practical use. The solution was to use curriculum learning, i.e., starting with a simpler problem and increasing the difficulty as the model improves.

Training started with words of three characters or less, with the maximum word length increasing at every epoch. The width of images was also reduced initially. Training started with a tiny learning rate and this was also gradually increased at every epoch.

The overall accuracy of the system was further improved by 1.54% by introducing random jittering in the training set — randomly moving the bounding box coordinates of ground truth to model the behaviour of noise from the detection model.

Our system is deployed to production and processes images uploaded to Facebook everyday.

Columnstore and B+ tree – are hybrid physical designs important? Dziedzic et al., SIGMOD’18

Earlier this week we looked at the design of column stores and their advantages for analytic workloads. What should you do though if you have a mixed workload including transaction processing, decision support, and operational analytics? Microsoft SQL Server supports hybrid physical design combining both column store and B+ tree indexes in the same database.

It is generally understood that columnstores are crucial to achieving high performance for analytic queries and that B+ tree indexes are key to supporting transactional workloads efficiently. However, it is not well understood whether hybrid physical designs – both columnstore and B+ tree indices on the same database and potentially the same table – are important for any of the above workloads.

Through a series of benchmarks the authors show that hybrid physical designs can result in more than an order of magnitude lower execution costs for many workloads when compared to alternatives using B+ tree-only or columnstore-only. The Database Engine Tuning Advisor (DTA) for SQL Server is extended to analyze and recommend the appropriate indices for a given workload. Support for columnstore indices and the new DTA functionality was released in January 2017 as part of the Community Technology Preview release for Microsoft SQL Server 2017.

### Physical design options in SQL Server

RDBMSs have supported B+ trees and heap files for several decades. With the advent of columnstores, which significantly outperform B+ trees for data analysis workloads, many commercial RDBMS vendors have added support for columnstore indexes (CSI)…

In SQL Server columnstores are treated as indexes. They can be primary (the main storage for all columns of the table) or secondary (e.g., just a subset of columns). You can have any combination of primary and secondary indexes on the same table. The primary can be a heap file, B+ tree, or a columnstore, and secondaries can be B+ trees or columnstore. At most one columnstore index is supported per table though.

SQL Server columnstores support vectorised operations and compression using run-length encoding and dictionary encoding. Within a columnstore, column data is held in column segments, each with data for 100K-1M rows. Inserts are handled using delta stores implemented as B+ trees. Secondary columnstores, optimised for operational analytics, used a B+ tree delete buffer for logical deletion of rows. This is periodically compressed into a delete bitmap storing the physical identifiers of the deleted rows. Primary columnstores don’t support delete buffers and work directly with delete bitmaps instead.

### Microbenchmarks

The authors conducted a series of microbenchmarks as follows:

• scans with single predicates with varying selectivity to study the trade-off between the range scan of a B+ tree vs a columnstore scan
• sort and group-by queries to study the benefit of the sort order supported by B+ trees (columnstores in SQL Server are not sorted).
• update statements with varying numbers of updated rows to analyze the cost of updating the different index types

The key findings are summarised in the table below.

In a nutshell, B+ tree indexes are suitable for short range scans where the index allows efficient point and short range lookups. B+ trees are also the cheapest to update. On the other hand, primary CSIs are most suitable for large scans and bulk updates typical in data warehousing and analysis workloads. Secondary CSIs can provide significant speed-up for operational analytics on the same database where the OLTP application generating the data also runs. The basic workload axes can be combined in a variety of ways where a mix of the basic physical design axes are needed for optimal performance.

Well, surprise! B+ trees are good for OLTP, and columstores are good for analytics. Where things get interesting though, is when we combine the two for certain workloads….

### Recommending hybrid designs

The Database Engine Tuning Advisor (DTA) recommends B+ tree indexes (primary and/or secondary), materialized views, and partitioning.

We extended DTA to analyze the combined space of B+ tree and columnstore indexes. By analyzing the workload, DTA is now capable of recommending B+ tree indexes only, columnstore indexes only, or a combination.

The high level architecture of DTA is shown in the figure below.

DTA begins with a local per-query analysis stage called candidate selection, which determines the optimal set ofd indexes for each query. Then it proceeds to global analysis, starting out by considering the potential to merge indexes on the same table. Once this is done DTA selects a final set of indexes to minimise the total cost of the workload subject to specified constraints.

DTA uses a cost-base search, which means it needs to estimate the costs using some indexes it hasn’t actually built yet. The “what-if” API is used to simulate such hypothetical indexes.

During the candidate selection phase DTA considers columnstore indexes on the tables referenced in the query. Given the constraint in SQL Server of only one columnstore index per table DTA chooses to include all columns with data types suitable for columnstore indexes. (And if the table includes a column with an unsupported data type, then a primary columnstore index is ruled out).

During merging there’s not much extra that can be done – only one columstore index is supported per table, and we can’t merge a columnstore index with a B+ tree index. During the final selection phase it is necessary to estimated per-column sizes for cost estimation. For hypothetical indexes, we need to do this without actually building the index. DTA uses block-level sampling coupled with techniques to estimate the impact of compression.

The effectiveness of run-length encoding depends on the number of runs in the column and the length of each run… SQL Server uses a greedy strategy that picks the next column to sort by based on the column with the fewest runs; we mimic this approach in our [estimation] technique.

The GEE estimator is used to estimate the number of distinct values for a set of columns.

### Evaluation

The paper closes with an evaluation of both industry-standard benchmarks and real-world customer workloads to see how well the hybrid physical designs suggested by DTA improve query performance.

Read-only workloads are based on the TPC-DS benchmark and five real customer workloads. For mixed workloads, the CH benchmark (an extension of TPC-C) is used.

Figure 9 below shows the results for the read-only workloads. In each chart the blue bars show the speed-up obtained by the hybrid design vs CSI-only, and the green bars show the speed-up versus a B+ tree only design. For example, on TPC-DS, 46 queries were sped-up by a factor of 1.2x when comparing the hybrid design to a CSI only design.

(Enlarge)

… hybrid leverages the best of columnstores and B+ tree across several workloads. For each workload, there are several queries for which a hybrid physical design results in more than an order of magnitude improvement in execution cost. In some cases, the improvement is 2-3 orders of magnitude.

An example of a TCP-DC query that really benefits from the hybrid design is query #54. It references several large fact tables and as well as many dimension tables. The predicates on the dimension tables are selective enough that B+ trees have a significant advantage. Other tables have columnstore indexes. A similar pattern emerges with the workload of customer four where the optimiser uses an index seek on the fact table(s) followed by a scan of the columnstore on the dimensions, joining the tables with a hash join.

Here are the results for the CH workload:

The hybrid design significantly speeds up the H (analytic) queries, while resulting in a moderate slow-down for the C (OLTP) queries – mostly the write transactions NewOrder and Payment.

The design and implementation of modern column-oriented database systems Abadi et al., Foundations and trends in databases, 2012

I came here by following the references in the Smoke paper we looked at earlier this week. “The design and implementation of modern column-oriented database systems” is a longer piece at 87 pages, but it’s good value-for-time. What we have here is a very readable overview of the key techniques behind column stores.

### What is a column store?

Column stores are relational databases that store data by column rather than by row. Whereas a traditional row-based store stores all attributes of one row together, followed by the attributes of the next row, and so on, a column-based stored uses one logical file per attribute (column). The column-oriented layout makes it efficient to read just the columns you need for a query, without pulling in lots of redundant data.

Data for a column may be stored in an array with implicit ids (a), or in some format with explicit ids (b).

Since data transfer costs from storage (or through a storage hierarchy) are often the major performance bottlenecks in database systems, while at the same time database schemas are becoming more and more complex with fat tables with hundreds of attributes being common, a column-store is likely to be much more efficient at executing queries… that touch only a subset of a table’s attributes.

Column stores are typically used in analytic applications, with queries that scan a large fraction of individual tables and compute aggregates or other statistics over them. (Whereas e.g. OLTP applications retrieving all attributes for a given row are better suited to row-based stores).

### Making column stores fast

Simply storing data in columns isn’t sufficient to get the full performance out of column-based stores. There are a number of techniques that have been developed over the years that also make a big impact. The figure below shows an unoptimised column store performing worse than a row store on a simplified TPC-H benchmark. But by the time you add in a number of the optimisations we’re about to discuss, it ends up about 5x faster than the row store.

One of the most important factors in achieving good performance is preserving I/O bandwidth (by e.g. using sequential access wherever possible and avoiding random accesses). Thus even when we look at techniques such as compression, the main motivation is that moving compressed data uses less bandwidth (improving performance), not that the reduced sizes save on storage costs.

Pioneering column-store systems including MonetDB, VectorWise, and C-Store.

The following table summarizes some of the key techniques used in column-stores, and their row-store counterparts.

Let’s dive into some of these in more detail…

### Block-oriented and vectorized processing

Traditional query execution uses a tuple-at-a-time pull-based iterator approach in which each operator gets the next input tuple by calling the next() method of the operators of its children in the operator tree.

MonetDB introduced the idea of performing simple operations one column at a time. To do this, it used lower level Binary Association Table (BAT) operators, with complex expressions in a query mapped into multiple BAT Algebra operations. “The philosophy behind the BAT Algebra can also be paraphrased as ‘the RISC approach to database query languages’.” VectorWise improved on MonetDB with a vectorised execution model striking a balance between the full materialization of intermediate results required by MonetDB and the high functional overhead of tuple-at-a-time iterators in traditional systems.

Essentially, VectorWise processes one block/vector of a column at a time as opposed to one column-at-a-time or one tuple-at-a-time.

The operators are similar to those used in tuple pipelining, excepte that the next() method returns a vector of N tuples instead of only one. Vectorised processing has a number of advantages including:

• Reduced function call overhead (N times fewer calls)
• Better cache locality with tuned vector sizes
• Opportunity for compiler optimisation and use of SIMD instructions
• Block-based algorithm optimisations
• Parallel memory access speed-ups through speculation
• Better support for runtime adaptation based on performance profiling data.

#### Column-specific compression

Intuitively, data stored in columns is more compressible than data stored in rows. Compression algorithms perform better on data with low information entropy (i.e., with high data value locality), and values from the same column tend to have more value locality than values from different columns.

Compression improves performance by reducing the amount of time spent in I/O. With performance as a goal, CPU-light compression schemes are preferred. Different columns may of course use different compression schemes as appropriate.

Frequency partitioning makes compression schemes even more efficient by partitioning a column such that each page has as low an information entropy as possible (e.g., storing frequent values together).

Encoding schemes include:

• Run length encoding (RLE) which uses (value, start position, run length) triples.
• Bit-vector encoding, useful when columns have a limited number of possible data values. Each possible data value has an associated bit-vector, with the corresponding bit in the vector set to one if the column attribute at that position has the chosen value.
• Dictionary encodings (work well for distributions with a few very frequent values)
• Frame-of-reference encodings (save space by exploiting value-locality : values are represented by a delta from a base reference value).

With dictionary and frame-of-reference encodings patching may also be used. Here we encode only the most frequent values, and allow ‘exception’ values which are not compressed. Typically a disk block might be split into two parts that grow towards each other: compressed codes at the start of the block growing forwards, and an error array at the end growing backwards. For tuples with exception values, their encoded value is a special escape value signalling the value should be looked up separately.

#### Direct operation on compressed data

The ultimate performance boost from compression comes when operators can act directly on compressed values without needing to decompress. Consider a sum operator and RLE encoded values. It suffices to multiply the value by the run length.

To avoid lots of compression-technique specific code blocks the general properties of compression algorithms can be abstracted so that query operators can act on compression blocks.

A compression block contains a buffer of column data in compressed format and provides an API that allows the buffer to be accessed by query operators in several ways…

Compression alone can give up to a 3x performance improvement. By adding in operation on compressed data it is possible to obtain more than an order of magnitude in performance improvements.

#### Late materialization

Many queries access more than one attribute from a particular entity, and most database output standards (e.g. ODBC, JDBC) access results using an entity-at-a-time interface, not a column-at-a-time. Therefore at some point, a column store probably needs to re-assemble row-based tuples (‘tuple construction’).

The best performance comes from assembling these tuples as late as possible, operating directly on columns for as long as possible.

In order to do so, intermediate ‘position’ lists often need to be constructed in order to match up operations that have been performed on different columns.

The following figure illustrates a late materialization query plan for a select-project-join query. The select operators filter each column independently, maximising the utilisation of memory bandwidth.

See section 4.4 in the paper for the full explanation of what’s going on here!

Late materialisation has four main advantages:

1. Due to selection and aggregation operations, it may be possible to avoid materialising some tuples altogether
2. It avoids decompression of data to reconstruct tuples, meaning we can still operate directly on compressed data where applicable
3. It improves cache performance when operating directly on column data
4. Vectorized optimisations have a higher impact on performance for fixed-length attributes. With columns, we can take advantage of this for any fixed-width columns. Once we move to row-based representation, any variable-width attribute in the row makes the whole tuple variable-width.

#### Efficient join implementations

The most straightforward way to implement a column-oriented join is for (only) the columns that compose the join predicate to be input to the join. In the case of hash joins (which is the typical join algorithm used) this results in much more compact hash tables which in turn results in much better access patterns during probing; a smaller hash table leads to less cache misses.

After a join, at least one set of output positions will not be sorted (e.g. the left input relation will have entries in sorted order, but the right output relation will not). That’s a pain when other columns from the joined tables are needed after the join. “Jive joins” add an additional column to the list of positions that we want to extract. Say we know we want to extract records 2, 4, 2, and 1 in that order. We add an ordering column thus:

And then sort the output by the first column:

We can now scan the columns from the table efficiently to get the data.

We sort by the second column (the ‘index’ that we added) to put the data back in the right order to complete the join.

#### Redundant column representations

Columns that are sorted according to a particular attribute can be filtered much more quickly on that attribute. By storing several copies of each column sorted by attributes heavily used in an application’s query workload, substantial performance gains can be achieved. C-store calls groups of columns sorted on a particular attribute projections.

#### Database cracking and adaptive indexing

An alternative to sorting columns up front is to adaptively and incrementally sort columns as a side effect of query processing. “Each query partially reorganizes the columns it touches to allow future queries to access data faster.” For example, if a query has a predicate A n where n &geq; 10, it only has to search and crack only the last part of the column. In the following example, query Q1 cuts the column in three pieces and then query Q2 further enhances the partitioning.

The terminology “cracking” reflects the fact that the database is partitioned (cracked) into smaller and manageable pieces.

Cracking can bring big performance gains without paying the overhead of pre-sorting. Here’s an example from MonetDB:

A variation called stochastic cracking performs non-deterministic cracking actions by following query bounds less strictly. By doing so it can create a more even spread of the partitioning across a column.

As a final thought, with all of this careful management of compressed columns going on you’d be right in thinking that updates (and deletes) are a pain. For analytic use cases, we generally want to load lots of data at once. The incoming data is row-based, and it needs to be split into columns with each column being written separately. Optimised loaders are important. The C-Store system for example first loads all data into an uncompressed write-optimised buffer, and then flushes periodically in large compressed batches.

Smoke: fine-grained lineage at interactive speed Psallidas et al., VLDB’18

Data lineage connects the input and output data items of a computation. Given a set of output records, a backward lineage query selects a subset of the output records and asks “which input records contributed to these results?” A forward lineage query selects a subset of the input records and asks, “which output records depend on these inputs?”. Lineage-enabled systems capture record-level relationships throughout a workflow and support lineage queries.

Data lineage is useful in lots of different applications; this paper uses as its main example interactive visualisation systems. This domain requires fast answers to queries and is typically dominated by hand-written implementations. Consider the two views in the figure below. When the user selects a set of marks in $V_1$, marks derived from the same records are highlighted in $V_2$ (linked brushing).

A typical visualisation system implements this manually, but it can equally be viewed as a backward lineage query from the selection points in $V_1$, followed by a forward lineage query from the resulting input records to $V_2$.

(See ‘Explaining outputs in modern data analytics’ which we looked at last year for an introduction to lineage and provenance principles. Chotia et al. use a shadow dataflow graph mapping the original computation but flowing in the opposite direction…).

### Challenges with existing lineage capture systems

We have the usual space/time trade-offs to consider. We can slow down the base query in order to capture lineage information during query execution (and store that information somewhere). This speeds up answering lineage queries later on. Or we can keep base queries fast and lazily materialize lineage information later when lineage queries are asked (making them slower).

As data processing engines become faster, an important question —and the main focus of this paper— is whether it is possible to achieve the best of both worlds: negligible lineage capture overhead, as well as fast lineage query execution.

### Smoke’s four principles

Smoke employs four central design principles to try and pull off this trick.

1. Tight integration of lineage capture into query execution itself, using write-efficient data structures.
2. Where apriori knowledge of lineage queries is available (e.g., the set of explorations supported by a visualisation tool), this information is used to minimise the amount of lineage that needs to be materialized.
3. Again, if we know lineage queries in advance, and those queries are only interested in aggregated information then we can potentially materialize aggregate statistics as we process queries, and prune or re-partition lineage indices.
4. Wherever possible, data structures constructed during normal operation execution are augmented and reused for lineage purposes rather than introducing separated dedicated structures.

…Smoke is an in-memory query compilation database engine that tightly integrates the lineage capture logic within query execution and uses simple, write-efficient lineage indexes for low-overhead lineage capture. In addition, Smoke enables workload-aware optimizations that prune captured lineage and push the logic of lineage consuming queries down into the lineage capture phase.

### Lineage capture

Smoke uses read- and write-efficient index structures based on row ids to capture lineage information. 1-N relationships (between input and output tuples) are represented as inverted indexes. The index’s ith entry corresponds to the ith output group, and points to a row id array containing the ids of all input records that belong to the group. 1-1 relationships between input and output are represented as a single array.

These indices are populated through a tight integration of lineage capture and relation operator logic to avoid additional API calls, and to facilitate co-optimisation. There are two basic strategies depending on the operator and circumstances: defer and inject.

Inject strategies incur the full cost of index generation during base query execution, whereas defer strategies defer (portions of) the lineage capture until after the operation execution.

Consider selection: both forward and backward lineage capture use row-id arrays, with the forward one pre-allocated based on the cardinality of the input relation. While iterating over the relation evaluating the predicate, the inject strategy adds two counters to track the row ids of the current input and output and uses these to update the indices when an output row is emitted. There is no defer strategy for selection, because it is strictly inferior to inject.

As another example consider hash joins. Smoke will generate both backward row id arrays, and forward row-id indexes. Under the inject strategy a build phase augments each hash table entry with a row id array containing the input row ids for that entry’s join key. The probe phase tracks the row id for each output record and populates the forward and backward indexes. One drawback of this strategy is that we might trigger multiple re-allocations (growing the size) of the forward indexes if an input record has many matches.

The defer strategy for hash joins takes advantage of the fact that we know the size of the forward indexes after the probe phase. The build phase maintains an additional output row id list, storing the first output record for each match (output records are emitted contiguously). After the probe phase, the forward and backward indexes can be pre-allocated and populated in a final scan of the hash table.

For multi-operator plans Smoke propagates lineage information through plan execution so that only a single set of lineage indexes connecting the input and final output relations are emitted. It also takes advantage of pipelines (that merge multiple operators into a single pipeline as part of normal query execution) to eliminate intermediate lineage materialization points where possible.

When the set of lineage queries is known in advance, Smoke can go further than the baseline lineage capture describe above and also apply instrumentation pruning and optimisation push-down.

Instrumentation pruning disables lineage capture for lineage indexes that will not be used by the workload. Lineage queries that apply summarisation / aggregation before presenting results provide an opportunity to push-down optimisations. For example, selection push-down when using a static predicate, and partitioning of index arrays by predicate attributes for dynamic selections. Group-by aggregation can also be pushed down into lineage capture.

We observe that popular provenance semantics (e.g. which and why provenance) can be expressed as lineage consuming queries and pushed down using the above optimizations. In other words, Smoke can operate as a system with alternative provenance semantics depending on the given lineage consuming query.

### Evaluation

Smoke is compared to state-of-the-art logical and physical lineage capture and query approaches using a combination of microbenchmarks, TPC-H queries, and two real-world applications.

Smoke’s lineage capture techniques outperform both logical and physical approaches by up to two orders of magnitude. For example:

Smoke also sped-up lineage query evaluation by multiple orders of magnitude, especially for low-selectivity lineage queries.

The two real-world applications used in the evaluation were Crossfilter visualisation and data profiling with UGuide. The results show that :

• Lineage can express many real-world tasks from visualisation and data profiling, that are currently implemented by hand in ad-hoc ways, and
• the lineage capture mechanism is fast enough to avoid sacrificing performance vs the hand implementations, and in many cases may even perform better.

Smoke illustrates that it is possible to both capture lineage with low overhead and enable fast lineage query performance… Our capture techniques and workload-aware optimization make Smoke well-suited for online; adaptive; and offline physical database design settings.

Same-different problems strain convolutional neural networks Ricci et al., arXiv 2018

Since we’ve been looking at the idea of adding structured representations and relational reasoning to deep learning systems, I thought it would be interesting to finish off the week with an example of a problem that seems to require it: detecting whether objects in a scene are the same or different.

This image containing a flute was correctly classified by a CNN trained on millions of photographs. On ImageNet the network even surpassed the accuracy of a human observer.

This image contains two shapes that are the same, a relationship that is immediately obvious to a human observer. “Yet, the CNN failed to learn this relation even after seeing millions of training examples.

The above is an example of a same-different (SD) visual relation problem (output whether the objects in the scene are the same, or different). Spatial relation (SR) problems ask whether objects follow a certain spatial relation, e.g. in a line, horizontally stacked, vertically stacked, and so on. For example:

The synthetic visual reasoning test (SVRT) contains a collection of 23 binary classification problems along these lines. In each case opposing classes differ based on whether their stimuli obey an abstract rule. If you train a bunch of CNNs (with different depths, filter sizes, etc.) on these tasks an interesting pattern pops out. The CNNs really struggled on problems where the abstract rule required detecting whether things were the same or different (congruent up to some transformation), whereas they achieved good accuracy on spatial relation problems.

The resulting dichotomy across the SVRT problems is striking. CNNs fare uniformly worse on SD problems than they do on SR problems. Many SR problems were learned satisfactorily, whereas some SD problems (e.g. problems 20 and 7) resulted in accuracy not substantially above chance.

For SR problems, all the CNNs did pretty well, regardless of network configuration. But for SD problems larger networks performed noticeably better than smaller ones. This suggests that something about the SD problems is straining the capacity of the CNNs.

### Probing further

To dig deeper into this apparent difference between same-different and spatial-relation problems the authors construct a new visual-relation benchmark called PSVRT. The dataset is parameterised so that the size of the items in the scene, the number of scene items, and the size of the whole image can all be controlled. Scene items are just binary bit patterns placed on a blank background. For any given configuration of parameters, the resulting scene can be used for both an SD problem and an SR problem, simply based on labelling.

Our goal was to examine how hard it is for a CNN architecture to learn relations for visually different but conceptually equivalent problems. If CNNs can truly learn the “rule” underlying these problems, then one would expect the models to learn all problems with more-or-less equal ease. However, if the CNN only memorize the distinguishing features of the two image classes, then learning should be affected by the variability of the example images in each category.

A baseline architecture was established with four convolutional layers, that was able to easily learn both the same-different and spatial-relation PSVRT problems with item size 4, image size 60, and two items in the image. This baseline CNN was then trained from scratch on a variety of PSVRT problems, each time using 20 million training images and a batch size of 50. There were three sub-experiments:

• Fixing item size (m) at 4, number of items (k) at 2, and varying image size (n) between 30 and 180.
• Fixing image size at 60, number of items at 2, and varying item size between 3 and 7.
• Fixing image size at 60, item size at 4, and varying the number of items between 2 and 6.

In all conditions, we found a strong dichotomy between SD and SR conditions. In SR, across all image parameters and in all trials, the model immediately learned at the start of training and quickly approached 100% accuracy, producing consistently high and flat mean ALC curves. In SD, however, we found that the overall ALC was significantly lower than SR.

Digging deeper, when learning did occur in SD, increasing item size never strained performance. But increasing the overall image size, or increasing the number of items did. (Gray bars in the above figures indicate the number of trials in which learning failed). The results suggest that straining is not simply a direct outcome of an increase in image variability. Using CNNs with more than twice the number of kernels (wide), or twice as many layers (deep) did not change the observed trend.

### What’s going on?

The authors hypothesise that the CNNs learn ‘subtraction templates’ when tackling SD problems: filters with one positive region and one negative region. Each relative arrangement of items requires a different subtraction template since each item must lie in on of the template’s two regions. If identical items lie in opposing regions, they are subtracted by the synaptic weights. The difference is used to choose the appropriate same/different label.

A strategy like this doesn’t require memorizing specific items, so item size doesn’t make much of a difference. However, image size (the biggest straining factor) exponentially increases the possible number of arrangements of items.

Our results indicate that visual-relation problems can quickly exceed the representational capacity of feedforward networks. While learning templates for individual objects appears to be tractable for today’s deep networks, learning templates for arrangements of objects becomes rapidly intractable because of the combinatorial explosion in the number of features to be stored… Given the vast superiority of humans over modern computers in their ability to detect visual relations, we see the exploration of attentional and grouping mechanisms as an important next step in our computational understanding of visual reasoning.

Relational inductive biases, deep learning, and graph networks Battaglia et al., arXiv’18

Earlier this week we saw the argument that causal reasoning (where most of the interesting questions lie!) requires more than just associational machine learning. Structural causal models have at their core a graph of entities and relationships between them. Today we’ll be looking at a position paper with a wide team of authors from DeepMind, Google Brain, MIT, and the University of Edinburgh, which also makes the case for graph networks as a foundational building block of the next generation of AI. In other words, bringing back and re-integrating some of the techniques from the AI toolbox that were prevalent when resources were more limited.

We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representation and computations are key to realizing this objective… We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and the rules for composing them.

### Relational reasoning and structured approaches

Human’s represent complex systems as compositions of entities and their interactions. We use hierarchies to abstract away fine-grained differences, manage part-whole associations and other more general relationships, and can solve problems by composing familiar skills and routines.

… the world is compositional, or at least, we understand it in compositional terms. When learning, we either fit new knowledge into our existing structured representations, or adjust the structure itself to better accommodate (and make use of ) the new and the old.

Modern deep learning methods have done incredibly well following an end-to-end design philosophy that avoids explicit structure and ‘hand-engineering’. Such an approach seems to be finding its limits when it comes to challenges such as complex language and scene understanding, reasoning about structured data, transferring learning, and learning from small amounts of experience.

Instead of choosing between nature (end-to-end) and nurture (hand-engineering), can we use them jointly just as nature does, to build wholes greater that the sum of their parts?

A structured representation captures a composition of known building blocks, with structured computations operating over the elements and their composition as a whole. Relational reasoning operates over structure representations combining entities and relations, using rules for how they can be composed.

Recently, a class of models has arisen at the intersection of deep learning and structured approaches, which focuses on approaches for reasoning about explicitly structured data, in particular graphs…. What these approaches all have in common is a capacity for performing computation over discrete entities and the relations between them. What sets them apart from classical approaches is how the representations and the structure of the entities and relations — and the corresponding computations — can be learned, relieving the burden of needing to specify them in advance.

### Relational inductive biases and graphs

We need some way of guiding a learning algorithm towards good solutions. The fancy term for this is inductive bias, the bias part being how an algorithm prioritises one solution over another, independent of the observed data. A regularization term is a form of inductive bias for example.

A relational inductive bias imposes constraints on relationships and interactions among entities in a learning process. Viewed through a relational lens, we can see relational inductive biases at work in several of the standard deep learning building blocks. We just need to uncover in each what are the entities, what are the relations, and what are the rules for composing them.

As an example, in a fully connected layer the entities are the units in the network, the relations are all-to-all, and the rules are specified by the weights and biases. “The implicit relational inductive bias in a fully connected layer is thus very weak: all input units can interact to determine any output unit’s value, independently across outputs.

Within the building blocks we also see aspects of sharing (e.g. using a local kernel function across all inputs in a convolutional layer) and locality.

While the standard deep learning toolkit contains methods with various forms of relational inductive biases, there is no “default” deep learning component which operates on arbitrary relational structure. We need models with explicit representations of entities and relations, and learning algorithms which find rules for computing their interactions, as well as ways of grounding them in data.

A natural structure for this representation is a graph. Graphs support arbitrary (pairwise) relational structure, and computations over graphs afford a strong relational inductive bias. Many problems are easily modelled using a graph representation. For example:

### Introducing graph networks

There is a rich body of work on graph neural networks (see e.g. Bronstein et al. 2017) for a recent survey. Gilmer et al. introduced the message-passing neural network (MPNN) which unifies various graph neural network and graph convolutional approaches. Wang et al. introduced non-local neural networks (NLNN) unifying various ‘self-attention’ style methods. The graph network (GN) framework outlined in this paper generalises and extends graph neural networks, MPNNs, and NLNNs.

The main unit of computation in the GN framework is the GN block, a “graph-to-graph” module which takes a graph as input, performs computations over the structure, and returns a graph as output… entities are represented by the graph’s nodes, relations by the edges, and system-level properties by global attributes.

A graph is thus a three-tuple $G = (\mathbf{u}, V, E)$ where $\mathbf{u}$ is a global attribute, V is a set of nodes (each represented by attributes), and E is a set of edges (also represented by attributes).

GN blocks have three update functions $\phi$ and three aggregation functions $\rho$.

$\phi^{e}$ is mapped across all edges to compute per-edge updates, and then the $\rho^{e \mapsto v}$ function reduces the edge states to a single value $\mathbf{\bar{e}'}$ for each node. $\phi^{v}$ is then mapped across all nodes and $\rho^{e \mapsto u}$ reduces the node states to a single value. Finally the global attributed is update using $\phi^{u}$ and $\rho^{v \mapsto u}$. The $\rho$ functions must be invariant to permutations of their inputs, and should take variable numbers of arguments (e.g., elementwise summation, mean, max etc.). Computations proceed from the edge, to the node, to the global level.

The full update algorithm is given below.

Consider an example graph where nodes represent rubber balls and edges are springs connecting them. We want to predict the movements of the balls in an arbitrary gravitational field.

• The edge update step might compute the forces or potential energies between two connected balls.
• The edge reduction step might sum all the forces or potential energies acting on the ith ball.
• The node update step might compute the update position, velocity, and kinetic energy (attributes) of each ball.
• The node reduction step aggregates all of the edge updates into a node, for example computing the summed forces (which should be zero in this case).
• The global reduction step might then compute the total kinetic energy of the system
• The global update step might compute something analogous to the net forces and total energy of the physical system.

Other update orders are possible (e.g. from global to node to edges).

Graph networks have (at least) three properties of interest:

1. The nodes and the edges between provide strong relational inductive biases (e.g. the absence of an edge between two nodes means that they should not directly influence each other).
2. Entities and relations are represented as sets, and thus are invariant to ordering and permutations. This is a desirable property in many contexts (e.g. objects in a scene do not have a natural ordering).
3. Since the per-edge and per-node functions are reused across all edges and nodes respectively, GNs automatically support a form of combinatorial generalisation. A single GN can operate on graphs of different sizes and shapes.

### Graph network architectures

So far it seems like we’ve just been re-inventing a general graph computation framework. And indeed the framework is agnostic to specific representations and functional forms. But the fun stuff starts when we focus on deep learning architectures allowing GNs to act as learnable graph-to-graph function approximators. The three design principles behind graph networks which support this are:

• flexible representations
• configurable within-block structure, and;
• composable multi-block architectures

Flexible representations refers to the choice of representation for the attributes of edges, nodes, and global state, as well as the form of the outputs. For deep learning implementations, real-valued vectors and tensors would be a likely choice. Edge and node outputs would typically correspond to lists of vectors or tensors, and global outputs to a single vector or tensor. Thus the output of a GN block can be passed to other deep learning building blocks. The relational structure (the graph) can be explicitly specified in the inputs, or it can be inferred or assumed.

Configurable within-block structure refers to flexibility in how the internal structure and functions of a GN block are configured. The $\phi$ functions could themselves be implemented as neural networks for example (e.g. MLPs for vectors, or CNNs for tensors). There are GN configurations that correspond to many existing works in the literature. The following figure shows some examples.

Composable multi-block architectures refers to the fact that the output of one GN block can be fed into another GN block as an input. In general arbitrary numbers of GN blocks can be composed.

One common design is the encode-process-decode configuration in which an input graph is transformed into a latent representation by an encoder GN, a shared core block is applied multiple times to process the encoding, and then a decoder GN produces the final output.

### What’s next

Realizing the full potential of graph networks will likely be far more challenging than organizing their behaviour under one framework, and indeed, there are a number of unanswered questions regarding the best ways to use graph networks. One pressing question is: where do the graphs come from that the graph networks operate over?

For example, how should we convert raw sensory data into structured graph representations?

And beyond just graphs, “one takeaway from this paper is less about graphs themselves and more about the approach of blending powerful deep learning approaches with structured representations.

… learnable models which operate on graphs are only a stepping stone on the path toward human-like intelligence. We are optimistic about a number of other relevant, and perhaps underappreciated, research directions, including marrying learning-based approaches with programs, developing model-based approaches with an emphasis on abstraction, investing more heavily in meta-learning, and exploring multi-agent learning and interaction as a key catalyst for advanced intelligence.

The authors state an intention to release an open-source implementation for graph networks later this year.