## End of Term, and the power of compound interest

Schools where I live are now breaking up for summer, and it’s time for The Morning Paper summer recess too. Over the last term, we’ve covered 67 papers and a broad range of topics. InfoQ are kindly working on another “Quarterly Review” publication (see here for the previous edition). As ever it’s hard to choose favourites, but these are my five picks:

- The Amazing Power of Word Vectors
- Deep Learning in Neural Networks, an overview
- How to build static checking systems
- Gorilla: A fast scalable in-memory time series database
- A survey of available corpora for building data-driven dialogue systems

Writing The Morning Paper is a labour of love. It’s also been a tremendous discipline for continuous learning. I have a hunch that the basic process would work well whatever it is you’re trying to improve:

- Commit to learning something new every day (every weekday is a more sustainable goal). The frequency is important here, little and often beats intermittent binges.
- Keep a journal (I happen to use a public blog, which is good for accountability, but a private notebook could work too) and
*write down what you learn each day as if you are explaining it to someone else (or a future you!)*. (The process of writing is important I think).

(*As I write, I realise this is very close to repeated application of the Feynman technique.*)

That’s it. Now let the power of compound interest do its work. Here are three ways the learning compounds for my example of reading research papers:

- Daily focused concentration improves your ability to concentrate
- Reading lots of papers helps you get better at quickly understanding the essence of a paper
- The more concepts, examples, techniques, algorithms etc. that you’ve seen, the more of the material in a paper is familiar to you, and so the quicker you can understand it.

Put all three together and your ability to absorb new material should keep getting better over time. Plus as a bonus, you end up with a great set of notes to fall back on when you inevitably can’t remember all of the details but just know that “I read something related to this a few months ago…”

Oh, and every once in a while it’s ok to take a break so that you don’t burn out.

The Morning Paper will be back on the 5th September. In the meantime…

### If you’re a regular reader,

Thank You! The interactions, likes and retweets all help to make writing The Morning Paper a very rewarding experience, it wouldn’t be the same without you. I know many of you don’t have the time to read every single write-up, so now is a good time to catch up on your backlog! The monthly archive links (July, June, May, April) are a good way to scan through previous posts.

If you know someone you think would enjoy The Morning Paper please send them this way.

### If you’re new here,

Welcome! The monthly archive links (July, June, May, April) are a good way to scan through previous posts and get an idea of whether or not you think this is for you. If after doing that you’d like to join me on my semi-random tour through computer science research then two good options are to follow me on twitter @adriancolyer where I announce each day’s paper and post a link, and to subscribe to the email edition of The Morning Paper so that you never miss an issue.

See you on the 5th September,

Regards, Adrian.

Time-adaptive sketches (Ada Sketches) for Summarizing Data Streams Shrivastava et al. *SIGMOD 2016*

More algorithm fun today, and again in the context of data streams. It’s the 3 V’s of big data, but not as you know it: Volume, Velocity, and Var… *Volatility*. Volatility here refers to changing patterns in the data over time, and that can make life awkward if you’re trying to extract information from a stream. In particular, the authors study the *heavy hitters* problem, but with a twist: we want to give more weight to recent trends.

In most applications that involve temporal data, most recent trends tend to be most informative for predictive purposes…. For instance, most recent variations in credit history are much stronger indicators of a person’s ability to make loan payments compared to variations in credit history from the distant past.

*Time-adaptive sketches* generalize sketching algorithms and have the property that they retain counts of heavy hitters with good accuracy, while also providing provable time-adaptive guarantees. Coming in at 16 pages, the essence of the paper, especially if you’re familiar with count-min sketches is this: instead of increasing counters by 1 every time you see an item, increase them by *f(t)*, where *f(t)* is a monotone function in time. When you want to extract count estimates for time *t*, divide by *f(t)*. The authors experiment with a linear function *f(t) = at*, for fixed *a* (0.5), and also an exponential function *f(t) = a ^{t}* for fixed

*a*(1.0015). Both gave good results.

Finishing the write-up here though would be to short-change you. We’re interested in why this works, and what guarantees it gives. Plus the paper also gives an excellent tour through some of the prior approaches to solving the heavy hitters problem. Let’s start there, with a very quick recap on the basic Count-Min Sketch (CMS) algorithm.

### Count-Min Sketch

Create an integer array initialised to zeros that is *w* wide and *d* deep. Take *d* pairwise independent hash functions, *h _{1},…,h_{d}* and associate one with each row of the table, these functions should produce a value in the range

*1..w*. When a new value is seen, for each row of the table, hash the value with the corresponding hash function, and increment the counter in the indicated array slot.

If you want to know the estimate of how many instances of a given value have been seen, hash the value as previously and look up the counter values that gives you in each row. Take the smallest of these as your estimate.

### Hokusai – nearly but not quite

Hokusai-sketching (Matusevych et al. 2012) introduced an *item aggregation* algorithm for constructing time-adaptive sketches.

Hokusai uses a set of Count-Min sketches for different time intervals, to estimate the counts of any item for a given time or interval. To adapt the error rate temporally in limited space, the algorithm uses larger sketches for recent intervals and sketches of smaller size for older intervals.

At the end of a time interval (e.g T), a sketch needs to be moved into the next-sized-down sketch, (the one for T–1). Hokusai has a very elegant way of doing this: at each rung on the ladder, sketch widths are halved. You can therefore compress a larger sketch into a smaller one by simply adding one half of the sketch to the other, and also halving the hash function ranges using modulo 2 operations.

Although this idea of having different-size sketches for different time intervals is reasonable and yields accuracies that are time-adaptive, it comes with several inherent shortingcomings.

### Inspiration – Dolby noise reduction!

This might date some of The Morning Paper readers – do you remember Dolby B noise reduction? And then the exciting introduction of Dolby C? Some of us grew up with music on cassette tapes, and Dolby Noise Reduction was ever present.

When recording, Dolby systems employ *pre-emphasis* – artificially boosting certain parts of the input signal. On playback, the reverse *de-emphasis* translation restores the original signal levels. This process helps to improve the signal-to-noise ratio and combat tape hiss.

We exploit the fact that Count-Min Sketch (CMS)… has better accuracy for heavy-hitters as compared to the rest of the items. While updating the sketch we apply pre-emphasis and artificially inflate the counts of more recent items compared to older ones, i.e., we make them heavier with respect to the older items. This is done by multiplying updates

cwith_{i}^{t}f(t), which is any monotonically increasing function of timet. Thus, instead of updating the sketch withcwe update the sketch with _f(t) x_{i}^{t}c. The tendency of the sketch is to preserve large values. This inflation thus preserves the accuracy of recent items, after artificial inflation, compared to the older ones._{i}^{t}

On querying of course, the de-emphasis process must be applied, which means dividing the results by *f(t)* to obtain the estimate of item *i* at time *t*. In the absence of collisions, as with the base CMS, counts are estimated exactly. Consider a CMS with only one row, and the case when two independent items *i* and *j* collide. We see *c _{i}^{t}* instances of

*i*, and

*c*instances of

_{j}^{t’}*j*. With plain CMS, we would over-estimate the count for

*i*by

*c*, whereas with the pre-emphasis process we overestimate by

_{j}^{t’}*(f(t) x c*. Therefore it is easy to see that more recent items suffer less compared to older items.

_{j}^{t)/f(t’))}### Adaptive CMS

The Adaptive Count-Min Sketch algorithm (Ada-CMS), is just CMS but with the update and query mechanisms adapted to use the pre-emphasis and de-emphasis mechanism just described. Note that when *f(t) = 1* we obtain the original CMS algorithm.

By choosing appropriate *f(t)* functions, we can tailor the behaviour for different situations.

One major question we are interested in is "Given a fixed space and current state of time T, what are the values of time

t ≤ Twhere Ada-CMS is more accurate than vanilla CMS?

For a given *w* and *d*, we can see as a start that the expected error of Ada-CMS will be less than CMS if:

For *t=T* this will always be true (due to the monotonicity requirement on *f(t)*). The upper bound on the error with vanilla CMS is *&sqrt;T*, so Ada-CMS wins when its error is less than this.

To illustrate a reasonable scenario, suppose we want the errors with Ada-CMS to be never off by a factor γ away from that of vanilla CMS ∀ t. This ensures that we guarantee accuracy within a factor γ of what the original CMS would achieve to even very old heavy hitters. In addition, we want to be more accurate than CMS on all recent time

t > K, for some desirable choice ofK.

With a couple of simple manoeuvres (see section 5.2), this turns into solving the following pair of simultaneous equations:

### Other applications

The pre-emphasis and de-emphasis technique can be used in a number of other scenarios. The authors show an example with the *Lossy Counting* algorithm, and also how it can be applied to range queries (see §6).

### Evaluation

Experimental evaluation is undertaken with two real-world streaming datasets from AOL (36M search queries with 3.8M unique terms) and Criteo (150K unique categorical terms). Comparison is undertaken between vanilla CMS, the Hokusai algorithm, Ada-CMS with a linear function (*f(t) = 0.5t*), and Ada-CMS with an exponential function (*f(t)=1.0015 ^{t})*. In all cases

*d*= 4, and

*w*was varied from 2

^{10}to 223 to see the impact of varying range sizes. Here are the results for the AOL dataset:

Here’s the standard deviation of those errors with *w=2 ^{18}*:

### The Last Word

The proposed integration of sketches with pre-emphasis and de-emphasis, as we demonstrate, posseses strong theoretical guarantees on errors over time. Experiments on real datasets support our theoretical findings and show significantly superior accuracy and runtime overhead compared to the recently proposed Hokusai algorithm. We hope that our proposal will be adopted in practice, and it will lead to further exploration of the pre-emphasis and de-emphasis idea for solving massive data stream problems.

## Range thresholding on streams

Range thresholding on streams Qiao et al. *SIGMOD 2016*

It’s another streaming paper today, also looking at how to efficiently handle a large volume of concurrent queries over a stream, and also claiming a significant performance breakthrough of several orders of magnitude. We’re looking at a different type of query though, known as a *range threshold* query. These are queries of the form “tell me when *x* events have been received with a value within this range.” For example: “*Alert me when 100,000 shares of AAPL have been sold in the price range [100,105) starting from now,*” and multi-dimensional variants such as “*Alert me when 100,000 shares of AAPL have been sold in the price range [100,105), and at the time of sale the NASDAQ index is at 4,600 or lower.*”

More formally we’re looking at a data stream of elements *e*, each with a *value v(e)* in some *d*-dimensional space, each dimension with a real-valued domain. Each element also has a *weight w(e)* – a positive integer which defaults to 1 (counting). A Range Thresholding on Streams (RTS) query *q* consists of a d-dimensional range specification *R*_{q}_ (that a point must fall within to be considered a match), and a threshold *τ _{q} ≥ 1*. When sufficient matching elements have been seen such that their combined weight ≥ τ the query

*matures*(and presumably the author of the query is notified). The

*register*operation registers a new query, and

*terminate*stops and eliminates a query.

RTS may look deceptively easy at first glance: whenever a new element

earrives, check whetherv(e)is inR; if so, increase the weight of_{q}Rby_{q}w(e)– clearly a constant time operation overall. This method, unfortunately, stales poorly to the numbermof queries running at the same time: now it costs O(m) time to processe. The total cost of processingnstream elements thus becomes a prohibitive quadratic term O(nm). This implies that the method is computationally intractable whenmis large. Surprisingly, in spite of the fundamental nature of RTS and the obvious defects of “conventional solutions” like the above, currently no progress has been made to break the quadratic barrier.

Game on! The authors provide the first algorithm to break the quadratic barrier, giving a solution which is polylogarithmic in O(n+m). That makes this kind of difference when addressing the problem at scale (DT here is *Distributed Tracking*, the name of the author’s algorithm):

The most crucial discovery is an observation on the connection between RTS and another suprisingly remote problem called

distributed tracking. The observation then leads to the development of a novel algorithmic paradigm which, we believe, serves as a powerful weapon for implementing real-time triggers such as RTS.

First therefore, we need to take a brief look at the distributed tracking problem. Then we’ll see how it can be applied to solve the RTS problem when each stream element contains a single value (d=1), and its weight is 1 (i.e. we simply need to count matching events). In addition, we’ll assume initially that all queries are known up-front. We will then add in turn: the ability to register and terminate queries dynamically; matching multi-dimensional elements; and introduced varying weights.

### Distributed tracking background

In the distributed tracking problem there are *h* participants and one coordinator. Communication is only allowed between the coordinator and the participants (participants do not talk directly to each other). Each participant has an integer counter *c*, and at each time stamp, *at most one* of the counters is increased by one. The coordinator must report maturity when the sum of all the counter values reaches some threshold τ.

The simple solution is to have each participant send a message to the coordinator whenever its counter increases. This requires O(τ) messages. Whenever *τ ≤ 6h* we use this solution. For *τ > 6h* we can do better. First the coordinator calculates a slack value λ, where *λ = floor(τ/2h)*. The coordinator sends the slack value to each participant, and every participant sends a 1-bit signal to the coordinator each time its counter has increased by λ Once the coordinator has received *h* signals, it collects the precise counter values from all participants and calculates τ‘, τ reduced by the summed count. This completes a round. We keep going until τ’ ≤ 6h, at which point we resort to the simple solution to finish the process.

In the example above where τ is 80 and h = 10, we’ll start off with a slack value of 80/2×10 = 4. When 10 signals have been received the true count must be between 40 and 67 (the other 9 participants could all be one away from sending another signal, 9×3 = 27). The new value of τ will therefore be between 40 and 13, both below 6h, and so the algorithm would then resort to counting individual increments.

This solution is O(h log τ).

### A solution to the constrained RTS problem

Now we can turn our attention to the RTS problem with a fixed set of queries, d=1, and w=1. Recall that each query has an associated range, e.g. *v(e) ∈ [100,105)*. Create a binary search tree based on the *endpoints* of those range intervals. This tree will have height O(log m).

Each node in this endpoint tree will have responsibility for a certain portion of the overall range, called its “jurisdiction interval” as shown in the diagram above.

Along comes a stream element *e*… At each level in the tree, *v(e)* will fall into the range of exactly one node (unless it is less than the leftmost leaf value, in which case it can be discarded). Each node in the tree maintains a counter, and as *e* works its way down through the tree, the counter of every node it passes through is incremented. Once *e* has reached a leaf node it can be discarded (we don’t need to store elements).

The tree itself can be constructed in O(m log m) time, and processing each element takes O(log m) time.

For any given query, define its *canonical node set* *U _{q}* to be the minimum set of nodes with disjoint jurisdiction intervals that cover the range. This will be at most 2 nodes per level of the tree. Continuing our worked example, the canonical node set for a query with range [18,105) is shown below.

Now we can map the RTS problem onto the *distributed tracking* problem we saw earlier. Let the set of participants *h _{q}* be the canonical node set

*U*. Let the counters of each node be the counters of the participants, and have the query itself play the role of coordinator. Whenever the counter of a node goes up as an element passes through the tree, treat this as if the counter of the participant has gone up by one.

_{q}Every query therefore has its own distributed tracking instance, so conceptually there are *m* such instances executing concurrently. The maintenance of node counters is common to every instance. When a node counter changes, we should inspect the condition for all queries that include that node in their canonical node set. If we did that though, we’d be back to essentially quadratic time again. Instead for each node we maintain a *min_heap* per node of all the queries including that node in their canonical set, ordered by the future value of the node counter when the next signal should be sent. Now it suffices only to check against the top entry on the heap, and if the counter has reached that value, send the signal and discard the entry from the heap.

When a query matures, its entry is removed from the heaps of all of its canonical nodes. As more and more queries mature, the space consumed by matured queries needs to be reclaimed at some point. Once the number of currently alive queries has decreased to m/2 (i.e. half of the queries have matured) then a global rebalancing happens and the endpoint query tree is rebuilt using just the currently alive queries (which will have their thresholds adjusted accordingly for the new tree).

See section 4 of the paper for an analysis of the time complexity of this algorithm.

### Dynamic queries

Coping with query termination is easy – simply follow the same process as above for when a query matures. When registering a new query, we must insert its endpoints into the tree, which may trigger tree rebalancing operations. This rebalancing can upset the canonical node sets of many queries.

We circumvent this problem by resorting to another, much more practical algorithmic technique called the

logarithmic method. It is a generic framework that allows one to convert a semi-dynamic structure supporting only deletions to a fully-dynamic structure that is able to support both deletions and insertions.

In brief, this consists of maintaining *g* endpoint trees, where g = O(log m). We start out with one tree, and a new tree is created each time the number of alive queries reaches a new power of 2. The first tree has capacity for one query, the second for two queries, the third for 4 queries, and so on. When a new query comes along to be registered, we find the first tree, Τ_{j} with spare capacity (creating it if needed). Then the queries from all trees up to and including Τ_{j} are rebuilt into a new Τ_{j} (and all trees prior to j are reset to empty). The target thresholds for the pre-existing queries are adjusted appropriately when they are placed into the new tree.

### Multi-dimensional elements

To deal with multi-dimensional elements (queries along more than one dimension), we can follow the principles set out so far:

- Partition the search region
*R*(i.e., an interval when d = 1) into O(1) components, for each of which the number of elements within is kept precisely at a node._{q} - Many queries can share the same components such that the total number of distinct components is O(m
_{alive}).

When d = 1 we used a binary tree to partition the search region. For d > 1 we can borrow ideas from *range trees*. If d = 2, then each query will specify a 2-dimensional region [x_{1},x_{2}) x [y_{1},y_{2}), meaning that we’re looking for elements where *x _{1} ≤ v(e).x < x_{2}* and

*y*.

_{1}≤ v(e).y < y_{2}Make a binary search tree using the endpoints in the first dimension (x) as before. Now for each node *u* in this tree, find all the queries that have *u* in their (x-) canonical node set. Associate *u* with a second binary tree built on the y-dimension query endpoints from those queries.

Here’s an example:

The counters are kept in the secondary trees, and everything else in the algorithm follows the steps previously outlined. If you have more than 2 dimensions, just keep on adding an additional dimension-tree for each tree using the same process.

### Varying weights

This relates to a variant of the distributed tracking problem we started with, but where in each time step the (at most one) counter that gets increased, is increased by any positive integer *w*. A simple approach is to transform this into the unweighted problem by increasing the associated counter by one, *w* times. The drawback is that we incur O(1) cost every time we increment the counter.

We can revise the distributed tracking problem as follows to address this:

- For
*τ > 6h*, Compute slack in the same way*(λ = floor(τ/2h)*, and send to each participant at the start of the round. - Send a signal from a participant when its counter has increased by λ since the last signal (as per the previous algorithm),
*unless the coordinator has announced the end of this round*. - The coordinator collects all counter values when it has received
*h*signals, and*also announces that the round is finished*.

To reach the final version of the DT-RTS algorithm, replace the orginal distributed tracking algorithm we started with, with this updated version.

### The last word

In this paper, we have developed the first algorithm whose running time successfuly escapes the quadratic trap – even better, the algorithm promises computation cost that is near-linear to the input size, up to a polylogarithmic factor. Extensive experimentation has confirmed that the new algorithm, besides having rigourous theoretical guarantees, has excellent performance in practical scenarios, outperforming alternative methods by a factor up to orders of magnitude. It can therefore be immediately leveraged as a reliable fundamental tool for building sophisticated triggering mechanisms in real systems.

Sharing-aware outlier analytics over high-volume data streams Cao et al. *SIGMOD 2016*

With yesterday’s preliminaries on skyline queries out of the way, it’s time to turn our attention to the Sharing-aware Outlier Processing (SOP) algorithm of Cao et al. The challenge that SOP addresses is that of building a stream-based outlier detection system that can handle large numbers of outlier detection queries *simultaneously*. It’s easy to image multiple analysts querying the same core data streams, and even a single analyst wanting to run multiple ongoing queries, perhaps with varying parameters.

Thus, a stream processing system must be able to accommodate a large outlier analytics workload composed of hundreds or more requests covering many, if not all, major parameter settings of an outlier query, and thus striving to capture the most valuable outliers in the stream.

We’re talking about a very particular kind of outlier detection here: detecting outlying *points* (not patterns), and doing so based on a *distance* measure. It’s easy to detect outliers based on fixed thresholds, but where do you set the threshold? Fixed thresholds are also very sensitive to concept drift.

For distance-based outlier methods,

… an outlier is an object

Owith fewer thankneighbours in the datasetD, where a neighbor is defined to be any other object inDthat is within a distance rangerfrom objectO.

In a streaming context we also need to apply sliding window semantics to ensure that outliers are continuously detected based on the most recent portion of the input stream only. This gives us two additional parameter: the *window-size w* and the *slide-size s* (time or count based).

Consider for example monitoring for credit fraud in a banking scenario, where we have a stream of transaction data for individuals with similar income levels. The distance parameter *r* tells us how dissimilar to other transactions the transaction value must be, the *k* parameter tells us how far from the majority the transaction must be, and the window size tells us over how many days. “Tell me all transactions that have less than 5 peer transactions within £10,000 over the last 7 days.”

When processing many such queries simultaneously a given point may of course be an inlier from the perspective of some queries, and an outlier for others. SOP handles this situation while still preserving the property that each point is processed only once.

As (the) foundation of our solution, we make the important observation that a workload composed of multiple outlier requests with arbitrary parameter settings can be correctly answered by utilizing

one single skyband query– a generalization of the well-known skyline concept. Better yet we prove that given one data point, the skyband points discovered by the skyband query are theminimalyetsufficientevidence required to prove its outlier status with respect toalloutlier queries in the workload.

Compared to the previous state-of-the-art (running multiple LEAP queries, or the MCOD algorithm) SOP uses *three orders of magnitude less CPU* and significantly less memory across a wide range of scenarios. “Furthermore, it is the only known method that scales to huge workloads composed of thousands of outlier requests.”

Let’s take a look at how it works…

### Turning distance-based outlier detection into a skyband query

Building up to the full solution, we start out with a group of queries that all use the same window size, slide size, and *k*, but differ in their distance value, *r*. Recall that a K-Skyband query reports all points that are dominated by no more than *K* points. To map the anomaly detection queries into a K-Skyband problem, we therefore have to come up with a suitable definition of the *domination relationship* between any two points in the current data window.

The key observation here is that given any two points p

_{i}and p_{j}, two key factors, namely their relative arrival time and the distance to the pointpunder evaluation, determine whether p_{i}is more important than p_{j}in terms of evaluating the outlier status ofp.

Take all of the queries and arrange them in order of increasing *r* value, such that *r _{m}* represents the distance parameter for the

*m*query.

^{th}Now consider two points p_{i} and *p*, where the distance between them *dist(p _{i},p)* falls between

*r*and

_{m–1}*r*. You can see the possible queries for which p

_{m}_{i}can be a neighbour of

*p*highlighted in blue below (_q

_{m}…

*q*).

_{n}Above we can also see the possible queries for which *p _{j}* can be a neighbour of

*p*, when

*dist(p*falls between

_{j},p)*r*and

_{m}*r*. From this it is clear that

_{m+1}*p*dominates

_{i}*p*(is more important than

_{j}*p*) when evaluating queries.

_{j}Based on this idea, the *normalized distance* between two data points *p* and *p _{i}* is defined as m+1, where

*dist(p,p*falls between

_{i})*r*and

_{m}*r*, with

_{m+1}*r*defined as -∞, and r

_{0}_{n+1}defined as ∞. Henceforth, whenever we refer to distance or ‘dist’ we will mean this normalized distance.

In the time dimension, the *younger* a data point *p _{i}* is, the longer its relationships with

*p*(if any) will persist into the future. Putting the time and distance criteria together, gives us the rules for the domination relationship we are after: point

*p*dominates point

_{i}*p*with respect to point

_{j}*p*if

*p*_{i}.time > p_{j}.time*dist<p,p*_{i}) ≤ dist(p,p_{j})*dist(p,p*_{i}< n

In other words, given a data point

p,_{i}pdominates another point_{i}ponly if_{j}pexpires later than_{i}pin the current window, and it is not further away from_{j}pthanp. The third condition in the domination rule filters out any data point_{j}pthat is not a neighbour of_{i}pfor any query in the query set.

Using this definition of domination, a K-skyband query with *K* specified as *k–1* is both *sufficient* and *necessary* to continuously determine the outlier sttatus of *p* with respect to all queries in the set. The skyband query results will always return the *k* nearest neighbours of *p* as part of the skyband points. The outlier status of *p* with respect to each query can then be determined by examining the distance between *p* and its k-th nearest neighbour. If this falls between *r _{m}* and

*r*then

_{m+1}*p*is guaranteed to be an outlier for queries {q

_{1}, …, q

_{m}} and an inlier for queries {q

_{m+1},…, q

_{n}}.

### Optimising the skyband query with K-SKY

We could use a traditional K-skyband algorithm for this, such as BBS, but the authors introduce a new K-SKY algorithm that more efficiently supports multiple outlier detection queries. Assuming data points are sorted by arrival time on arrival, then K-SKY only needs to consider distance as later arrivals will never be dominated by earlier arrivals.

K-SKY always conducts the search (for a window) with a ‘later arriving data points first’ order. By this if one data point is not dominated by more than k points in the distance attribute, and thus considered to be a skyband point, then it is not necessary to evaluate it again.

When a window slides, assuming that the slide size is less than the window size (so that windows overlap) we can further reduce computation:

In the sliding window context, the K-SKY search is applied in two situations. First, any new point p that just arrived in the current window needs K-SKY to figure out its skyband points in the current window. Second, an existing point

pneeds K-SKY to update its skyband points when the stream slides to the current window. In the first situation, for a newly arriving pointp, K-SKY has to be conducted from stratch to search for the needed information forp. Instead in the second situation the key observation here is that given the skyband points of the previous window, to acquire the skyband points of the new window only a small fraction of data points in the new window need to be evaluated, namely the new arrivals and the unexpired skyband points of the previous window.

The skyband is computed every time the window moves. Key to the effectiveness of K-SKY is the cost of evaluating each point to determine whether or not it is a skyband point. “Since the number of points examined by K-SKY has already been proven to be minimal, the reduction of the second cost per point is now critical for high performance.” The *LSky* layered data structure plays a key role in helping to make this determination efficient.

In LSky, skyband points are organized into a layered two dimensional structure that preserves the order among the skyband points in both the distance and the time dimensions. As shown in Fig. 2, the points in each layer have the same distance to point p based on the normalized distance function. The points in the upper layer always have a smaller distance to p than the points in lower layers. Furthermore, in each layer the points are ordered based on their arrival time with the earliest arrival being at the head. By this, skyband points can be quickly expired when the window slides forward in time.

A point *p _{i}* to be inserted into the LSky structure is guaranteed to be dominated by points falling in the same layer (since they have been inserted before

*p*, and we process the window in reverse time order, they must be more recent) and by points in the layers above it. If in total there are fewer than

_{i}*k*such points, then

*p*will be a skyband point.

_{i}### Varying *k*

So far we’ve assumed all of our queries have the same *k*. To relax this construct, find the largest *k* in the query set, and run a skyband query using K-SKY with this *k*. During LSky processing, it is possible to determine whether a point is an inlier for a given query without introducing any extra overhead. At the end of processing, *p* is reported as an outlier for all queries that do not mark it as an inlier.

### Varying window size

If window sizes vary across queries, but the slide size is constant then all queries will slide to a new window at the same time. We use a similar trick to dealing with varying *k*, and detect outliers with a single skyband query using the largest window in the query set. Skyband points discovered in this window can be used to answer all queries in the group. An additional outlier status evaluation step is needed for all queries with window size smaller than the maximum. Since I’m running out of space for this write-up, I’ll defer you to section 4.1 in the paper to see how this works. The important thing to note for now is that there is a way to make this extra step efficient.

### Varying slide size

A similar trick can be played if window size is held constant, but the slide size varies. Here the single skyband query is issued with the slide size set as the greatest common divisor of the slide sizes of all the anomaly queries.

### Putting it all together

SOP first employs a query parser to divide the queries in a query group into sub-groups based on their

kparameters. Queries with the samekparameter are grouped into the same sub-group. The queries in each sub-group are then sorted based on their r parameters. The queries with the same r parameters are further sorted based on their window sizes. Then the query parser will create one skyband query for each outlier sub-query group. Its window size is set as the largest window size among the member queries the sub-group. Its slide size is then set as the greatest common divisor of the slide sizes of the member queries. After the query parser transforms the outlier detection queries into skyband queries, the K-SKY algorithm for multiple skyband queries will be applied to detect the skyband points. Then the outlier status evaluator determines the outlier status of each data point with respect to the outlier queries using the inlier rule.

SOP only requires a single pass through new data points, each collecting the minimum evidence needed to prove its outlier status with respect to all queries.

### Evaluation

In the most general case, we prepare four workloads composed of 100, 100, 10,000, and 50,000 queries respectively by varying all window-specific and pattern-specific parameters. We observe that similar to the cases of independently varying pattern-specific parameters and window-specific parameters, SOP achieves tremendous gain in CPU utilization compared to (augmented) MCOD and (the non-shared) LELP. Furthermore, SOP shows excellent scalability in the cardinality of the workload.

The memory usage of SOP also consistently outperforms the competition:

## Progressive skyline computation in database systems

Progressive skyline computation in database systems Papadias et al. *SIGMOD 2003*

I’m still working through some of the papers from SIGMOD 2016 (as some of you spotted, that was the unifying them for last week). But today I’m jumping back to 2003 to provide some context for a streaming analytics paper we’ll be looking at tomorrow. The subject at hand is *skyline* computation.

Not that sort of skyline.

The skyline operator is important for several applications involving multi-criteria decision making. Given a set of objects

p, the operator returns all objects_{1}, p_{2}, …, p_{n}psuch that_{i}pis not dominated by another object p_{i}_{j}.

A couple of examples help to make the idea much clearer. The running example throughout the paper is a hotels dataset with two attributes per hotel: price and distance from the beach.

Given some preference or scoring function that is monotone on all attributes (e.g. *min*) we can compute a skyline. A point *p _{i}* dominates another point _p

_{j}if its coordinate on any axis is not larger than the corresponding coordinate of p

_{j}.

```
dominates(h1,h2) :
(h1.distance < h2.distance) &&
(h1.price < h2.price)
```

So given two hotels an equal distance from the beach, we prefer the one that is cheaper. And given two hotels that are the same price, we prefer the one closer to the beach. We compute the skyline by finding the points where there is no point that is better on both (all) dimensions, and joining them. The points that define this skyline are *a, i,* and *k*.

In a SQL style syntax we might express this query as

```
SELECT *, FROM Hotels, Skyline OF Price min, Distance min
```

Here’s another example. Suppose we have a set of data points telling us how well our application performs on a given AWS instance type (for some definition of performance that we can boil down to a single number), and how much that instance type costs. If our scoring function prefers cheaper instance types when performance is equivalent, and instance types that deliver better application performance when cost is equal, then we might draw a skyline that looks like this:

“Progressive skyline computation in database systems” shows us how to efficiently compute a skyline in a database context, as well as introducing a number of variants of the base skyline problem. We also get a review of existing (prior to 2003) algorithms, which I’m going to pass over in favour of explaining the *branch-and-bound skyline* (BBS) algorithm the authors introduce which outperforms them. A good progressive skyline computation algorithm should have the following qualities:

*progressiveness*: first results should be reported to the user almost instantly, and the output size should gradually increase*no false misses*: eventually the algorithm should generate the entire skyline- _no false hits_: a point should never be reported as a skyline point if it will eventually be replaced (dominated)
- _fairness_: the algorithm should not favour points that are particularly good in one dimension
- _incorporation of preferences_: the user should be able to specify the order in which skyline points are reported
- _universality_: the algorithm should be applicable to any dataset distribution and dimensionality, using some index structure.

### The BBS algorithm

BBS is based on a nearest-neighbour search approach, and uses R-trees for data partitioning (though alternative data partitioning data structures would also work).

BBS, similar to the previous algorithms for nearest neighbours and convex hulls, adopts the branch-and-bound paradigm…

First we need an R-tree for the data. Let’s build one for the hotels data set, with node capacity = 3:

And here are the bounding boxes of the nodes of the R-tree. First the leaves:

…and then we can draw on the bounding boxes for the intermediate nodes too:

The BBS algorithm starts from the root node of the R-tree, and inserts all of its entries (in this case *e _{6}* and

*e*) into a sorted list, ordered by their

_{7}*mindist*. That is, sorted by their distance from the the origin.

Distances are computed according to the L

_{1}norm, i.e. themindistof a point equals the sum of its coordinates, and themindistof a minimum bounding rectangle (i.e. intermediate entry) equals themindistof its lower-left corner point.

If you’re following along at home, at this stage therefore we have two entries in the list:

```
e7,4
e6,6
```

The distance of e7 (which points to N7) is 4 (3+1), and the distance of e6 (which points to N6) is 6 (1+5). We also initialise the set of skyline points, *S* to the empty set.

While the list is not empty, we repeatedly process the first entry in the list. Starting with e7 therefore, e7 is removed from the list and replaced by its children e3,e4, and e5.

```
e3, 5
e6, 6
e5, 8
e4, 10
```

The next step is to process e3:

```
i, 5
e6, 6
h, 7
e5, 8
e4, 10
g, 11
```

The top of the list is now *i*. This is not dominated by any entry in *S* (which is empty at this point), so we add it to the set S of skyline points. That leaves e6 at the head of the list, and so we expand e6:

```
h, 7
e5, 8
e1, 9
e4, 10
g, 11
```

During this expansion, e2 is *not* added to the list, since it is dominated by an entry in S (namely, *i*).

At the next step, h7 is discarded since it is dominated by *i*. That leaves e5 at the head of the list, and it is also discarded since it is dominated by *i*. Now we expand e1…

```
a, 10
e4, 10
g, 11
b, 12
c, 12
```

The top of the list is *a*, which is not dominated by any member of *S*, so we add it to *S*. Expanding e4 we add only *k*, since *l* is dominated by *i*.

```
k,10
g, 11
b, 12
c, 12
```

The top of the list is now *k*, which is added to the skyline set S (S = {i,a,k}). Nodes *g*, *b,* and *c* are then pruned in turn since each is dominated by some member in S. At this point the list is empty and the algorithm is completed.

The pseudo-code for the algorithm is shown below (the paper uses ‘heap’ for what I called a sorted-list).

Note that an entry is checked for dominance twice: before it is inserted in the heap, and before it is expanded (processed). The second check is necessary because an entry in the heap may become dominated by some skyline point discovered after its insertion (therefore it does not need to be visited).

The authors show that BBS satisfies all of the criteria given previously, and is I/O optimal meaning the (i) it visits only the nodes that may contain skyline points, and (ii) it does not access the same node twice.

### Incremental maintenance

The skyline may change due to subsequent updates (i.e. insertions and deletions) to the database, and hence should be incrementally maintained to avoid recomputation.

If a new point is dominated by an existing skyline point it is simply discarded. If it is not dominated then it becomes part of a new skyline. BBS performs a window query (on the main-memory R-tree) using dominance region of the new point *p*, to retrieve any skyline points that may now be obsolete.

The query may return nothing, in which case the skyline simply increases by one point:

If points are returned though (i and k in the example below), these are removed from the skyline:

Handling deletions is more complex. First, if the point removed is not in the skyline (which can be easily checked by the main-memory R-tree using the point’s coordinates), no further processing is necessary. Otherwise, part of the skyline must be reconstructed…

Which part? The part that is *exclusively dominated* by the skyline point being deleted (i.e. areas not dominated by other skyline points). This corresponds with the shaded area in the example below:

Recomputing this part of the skyline leaves us with:

### Variations

A *constrained skyline* query returns the most interesting data points in a space defined by constraints (e.g. price between x and y). BBS easily processes such queries by simply pruning any entries that do not intersect the constrained region. (This constrained processing method is used during incremental maintenance of a skyline when a skyline point is deleted).

A *ranked skyline* (*top-K*) skyline query returns the top K skyline points according to some user provided input function – which must be monotone on each attribute. BBS handles such queries by replacing the *mindist* function with the user provided function, and terminating after exactly *K* points have been reported. *mindist* is essentially the special case where the preference function is `pref(x,y) = x + y`

.

A *grouped-by skyline* query returns multiple skylines, one for each group (e.g., skylines for 3-star, 4-star, and 5-star hotels). BBS can be adapted for this by keeping a separate in-memory R-tree for each class, together with a single heap containing all the visited entries.

A *dynamic skyline* query provides dimension functions and returns a skyline in the data space defined by the outputs of those functions. A example makes this clearer, suppose we have the user’s current location, and a database with hotel locations. We could then compute a dimension which is ‘distance from user’ and produce a skyline based on this. BBS can be used for dynamic skylines simply by computing *mindist* in the dynamic space.

An *enumerating* query returns for each skyline point p, the number of points dominated by p. A *K-dominating* query retrieves the *K* points that dominate the largest number of other points (strictly speaking, this is not a skyline query, since the result does not necessarily contain skyline points).

A *K-skyband* query reports the set of points that are dominated by at most *K* points.

Conceptually, K represents the thickness of the skyline; the case that K=0 corresponds to a conventional skyline.

Using the hotels example, here’s what the result of a 2-skyband query would look like.

*Approximate skylines* can be created from a histogram on the dataset, or progressively following the root visit in BBS. See section 5 of the paper for details.

## Spheres of influence for more effective viral marketing

Spheres of influence for more effective viral marketing Mehmood et al. *SIGMOD ’16*

In viral marketing the idea is to spread awareness of a brand or campaign by exploiting pre-existing social networks. The received wisdom is that by targeting a few influential individuals, they will be able to spread your marketing message to a large portion of the network. The *Influence Maximization* problem involves finding the set of *k* nodes (the ‘seed set’) such that activating these nodes maximizes the eventual reach. But Duncan Watts challenged this hypotheses ("Challenging the influentials hypothesis, Watts 2007), stating that influence processes are highly unreliable, and therefore it is better to target a large seed of ordinary individuals, each with a smaller but more reliable sphere of influence.

Inspired by this vision, in this paper we study how to compute the sphere of influence of each node s in the network, together with a measure of stability of such sphere of influence, representing how predictable the cascades generated from s are. We then devise an approach to influence maximization based on the spheres of influence and maximum coverage, which is shown to outperform in quality the theoretically optimal method for influence maximization when the number of seeds grows.

The standard algorithm for influence maximisation is a greedy one, and the solution outlined in this paper is the first (to the best of the author’s knowledge) to show a consistent improvement over it.

### The Typical Cascade problem

Imagine a *probabilistic directed graph*, where each edge has an associated probability that it will participate in a contagion cascade. We’d like to know what set of nodes *C* are likely to be reached from some source node *s*. In addition to viral marketing applications, you can imagine this information being using in studying epidemics, failure propogation in financial and computer networks, and other related areas.

Given the probabilistic nature, what set *C* should be returned from such a query?

One could think to select the most probable cascade, but this would not be a good choice as explained next. If we have |V| = n nodes there are 2

^{n}possible cascades and n is usually large. This means that we have a probability distribution over a very large discrete domain, with all the probabilities that are very small. As a consequence the most probable cascade still has a tiny probability, not much larger than many other cascades. Finally, the most probable cascade might be very different from many other equally probable cascades.

So instead, the authors are interested in the *typical cascade* : the set of nodes which is closest in expectation to all the possible cascades of *s*. If we want to talk about how “close” one set of nodes is to another set, we’ll need some way of defining the distance between them. For this purpose, the Jaccard Distance is used. The Jaccard Distance of two sets A and B is the number of elements that appear only in A or only in B, divided by the total number of elements in A and B:

The goal is to find the set of nodes that minimizes the summed expected cost to all of the random cascades from *s*. This set represents the *typical cascade* of the node *s*, or its *sphere of influence*. The smaller the distance the greater the *stability* (i.e. random cascades from *s* deviate less from the typical cascade).

So far so good. Unfortunately… “computing the expected cost of a set of nodes (a typical cascade) is #P-hard.” (The authors provide a proof, but here I’m prepared to take their word for it). The closely related problem: given the set of all cascades from *s*, find their *Jaccard Median*, also turns out to be NP-hard, although there is a known polynomial time approximation scheme.

A natural approach to deal with some #P-hard problems is by means of Monte-Carlo sampling…

So here’s what we can do:

- Sample
*l*random cascades from a source node*s*, and - Compute the Jaccard median of this smaller set of cascades, as the typical cascade

But how many random cascades do we need in our sample in order to have confidence in the results? The surprising result, is that independent of *n* (the number of nodes in the graph), a sample size of *log(1/α)/α ^{2}* is sufficient to obtain an (1+O(α))-approximate median. (“A 1-median in a finite metric space is a point with the minimum average distance to all other points,” –

*Some results on approximate 1-median selection in metric spaces*). The proof of this is in section 3 of the paper.

### Practical algorithms for the Typical Cascade problem

We have two sub-problems to solve: firstly, efficiently creating a set of *l = O(log(1/α)/α ^{2}* samples; and secondly computing their approximate Jaccard median once we have them. For the second problem, the authors defer to the work of Chierichetti et al. (“Finding the Jaccard median”) using the algorithm described in section 3.2 of their paper (a polynomial time approximation).

That leaves us with the problem of efficiently computing the set of cascades. This is addressed by sampling *l* ‘possible worlds’ G_{i}…G_{l} from the graph G, each of which implicitly defines a sample cascade from a vertex *v ∈ G*.

A key observation that we exploit to speed up this process is that all the vertices in the same strongly connected component (SCC) have the same reachability set: since any two vertices u, v in the same SCC are reachable from each other, any vertex reachable by u is also reachable by v, and viceversa. Therefore we can represent each sampled possible world G

_{i}by its SCC structure. Representing G_{i}in terms of its SCCs yields savings in both space usage and computational runtime, because of the compactness of representation and because a single depth first search is sufficient to identify the reachability set of all vertices in the same component.

Based on this idea an index containing information about the links between SCCs, denoted *C _{i}* (obtained by contracting each

*component*of the graph to a single vertex), and a mapping from vertex to connected component, is constructed. Then, …

Given a node

vandi ∈ [l], the cascade ofvinGcan be obtained as follows: look at the identifier of the SCC of_{i}vinG; recursively follow the links from the associated condensed vertex in C_{i}_{i}to find all the reachable components; and output the union of the elements in the reachable components. The time to perform this computation is linear in the number of nodes of the output and the number of edges of the condensation C_{i}, which is typically much smaller than the number of edges of G_{i}.

### Application to influence maximisation

Given *typical cascades* (computed as above) for all vertices in a graph, we can use a standard greedy algorithm that runs for *k* iterations. Let the set of ‘covered nodes’ be the union of all the nodes in the typical cascades of the chosen seed nodes so far. At each iteration, the algorithm simply choses the node which increases the size of the the ‘covered nodes’ set the most.

… our main practical result: the fact that our method for influence maximization based on spheres of influence outperforms the standard influence maximization method for what concerns quality, i.e., the the expected spread achieved.

(See figure 6 in the paper for a series of charts demonstrating this).

Why does the greedy algorithm based on typical cascades beat the standard greedy algorithm, which is based on chosing the node with the maximum expected spread at each stage? The answer is that the standard algorithm *saturates* earlier (that is, finds it harder to detect meaningful differences amongst the remaining choices). “… our method has more power to discriminate among interesting nodes, as it reaches the saturation point much later.”

This was a pretty meaty paper for a Friday, and I had to work hard to follow parts of it. I hope I have at least managed to convey the essence of the idea.

In reading around the topic a little bit, I also came across ‘Measuring influence in Twitter: the million follower fallacy’ (Cha et al. 2010), and ‘Who creates trends in online social media: the crowd or opinion leaders?’ (Zhang et al. 2014). Short version: you maximise your own influence by keeping your tweets to a well-defined topic area; and early participation of the crowd (not just a small number of influencers) is important for eventual large-scale coverage.

DBSherlock: A performance diagnostic tool for transactional databases Yoon et al. *SIGMOD ’16*

…tens of thousands of concurrent transactions competing for the same resources (e.g. CPU, disk I/O, memory) can create highly non-linear and counter-intuitive effects on database performance.

If you’re a DBA responsible for figuring out what’s going on, this presents quite a challenge. You might be awash in stats and graphs (MySQL maintains 260 statistics and variables for example), but still sorely lacking the big picture… “as a consquence, highly-skilled and highly-paid DBAs (a scarce resource themselves) spend many hours diagnosing performance problems through different conjectures and manually inspecting various queries and log files, until the root cause is found.”

*DBSherlock* (available at http://dbseer.org) is a *performance explanation framework* that helps DBAs diagnose performance problems in a more principled manner. The core of the idea is to compare a region of interest (an anomaly) with normal behaviour by analyzing past statistics, to try and find the most likely causes of the anomaly. The result of this analysis is one or both of:

- A set of concise predicates describing the combination of system configurations or workload characteristics causing the performance anomaly. For example, when explaining an anomaly caused by a network slowdown:

network_send < 10KB ∧ network_recv < 10KB ∧ client_wait_times > 100ms ∧ cpu_usage < 5

- A high-level diagnosis based on existing causal models in the system. An example cause presented here might be “Rotation of the redo log file.”

Suppose in the first of these two cases the predicates are presented to the user, who diagnoses the root cause as a network slowdown based on these hints. This information is fed back to DBSherlock by accepting the predicates and labelling them with the actual cause.

This ‘cause’ and its corresponding predicates comprise a causal model, which will be utilized by Sherlock for future diagnoses.

There are a number of similarities here to works on time series discretization, anomaly detection, and correlation, but the emphasis in this work is on *explaining* anomalies (largely assumed to be already detected by the user because they’re of the obvious ‘performance is terrible’ kind, though automated anomaly detection can be integrated with the system) rather than *detecting* them.

The starting point is a graph like the following, where the user can highlight a region and ask “what’s going on here?”

There are three main layers to answering that question inside DBSherlock: firstly, predicate generation seeks to find a small number of predicates that have high separation power to divide the anomalous and normal region behaviours; secondly, the set of predicates can be further pruned using simple rules encoding limited domain knowledge; and finally the causal model layer seeks to learn and map sets of predicates to higher-level causes, capturing the knowledge and experience of the DBA over time.

DBSherlock is integrated as a module in DBSeer, an open-source suite of database administration tools for monitoring and predicting database performance. This means DBSherlock can rely on DBSeer’s API for collecting and visualizising performance statistics. At one-second intervals, the following data is collected:

- OS resource consumption statistics
- DBMS workload statistics
- Timestamped query logs
- Configuration parameters from the OS and DBMS

From the raw data, DBSeer computes aggregate statistics about transactions executed during each time interval, and aligns them with the OS and DBMS statistics according to their timestamps. It is these resulting time series that DBSherlock using as the starting point for diagnosis.

### Finding predicates of interest

Starting from a given abnormal region, and a normal region, the aim is to find predicates that can separate them as cleanly as possible. The *separation power* of a predicate is defined as the difference between percentage of abnormal tuples that satisfy it, and the percentage of normal tuples that satisfy it.

Identifying predicates with high separation power is challenging. First, one cannot find a predicate of high separation power by simply comparing the values of an attribute in the raw dataset.This is because real-world datasets and OS logs are noisy and attribute values often fluctuate regardless of the anomaly. Second, due to human error, users may not specify the boundaries of the abnormal regions with perfect precision. The user may also overlook smaller areas of anomaly, misleading DBSherlock to treat them as normal regions. These sources of error compound the problem of noisy datasets. Third, one cannot easily conclude that predicates with high separation power are the actual cause of an anomaly. They may simply be correlated with, or be symptoms themselves of the anomaly, and hence, lead to incorrect diagnoses.

The first step is to discretize the time series (or each attribute within a multi-dimensional time series, if you prefer to think of it that way) within the ranges of interest. For numeric attributes, the maximum and minimum values within the ranges of interest are found, and the resulting range is divided into *R* equal sized buckets or *partitions*. For example, with a range 0–100 and R = 5 these would be [0,20), [20,40), [40,60), [60,80), and [80,100). By default DBSherlock uses R = 1000. For categorical attributes, one partition is created for each unique attribute value found. The result of this step is a *partition space*. (*Why not use a standard time series algorithm such as SAX for this step? *)

The next step is to *label* the partitions. For each partition, all the tuples that fall within that partition are examined. If all of these tuples are from the abnormal region, the partition is labeled as abnormal. If all of the the tuples are instead from the normal region, then the partition is labeled as normal. Otherwise it is left empty (unlabeled).

At this stage, due to the noise in the data, there will probably be a mixed set of abnormal and normal partitions such as this:

To cope with this noise, a *filtering* and *filling* process now takes place. Filtering helps to find a sharper edge separating normal and abnormal. This is a two-phase process. In the first phase each non-empty (i.e. marked as abnormal or normal) partition is examined. If it’s nearest non-empty partition in either direction is not of the same type (e.g. an abnormal partition has a normal partition as its first non-empty neighbour) then that partition is marked. In the second phase, all marked partitions are switched to empty.

Our filtering strategy aims to separate the Normal and Abnormal partitions that are originally mixed across the partition space (e.g., due to the noise in the data or user errors). If there is a predicate on attribute Attr, that has high separation power, the Normal and Abnormal partitions are very likely to form well-separated clusters after the filtering step. This step mitigates some of the negative effects of noisy data or user’s error, which could otherwise exclude a predicate with high separation power from the output.

After filtering, we will be left with larger blocks of consecutive normal and abnormal partitions, separated by empty partitions. The empty partitions are now *filled* by marking them as either normal or abnormal:

- If the nearest adjacent partitions to an empty partition both have the same type (e.g. abnormal), then the partition is assigned that type.
- If the nearest adjacent partitions have different types, then the partition is assigned the label of the closer partition. The distance calculation includes an
*anomaly distance multiplier*factor δ by which the distance to an*abnormal*adjacent neighbour is multiplied before doing the comparison. If δ > 1 of course, this favours marking a partition as normal. DBSherlock uses δ = 10 by default.

Finally it comes time to extract a candidate predicate for the attribute. A candidate predicate is generated for a numeric attribute if the following two conditions hold:

- There is a single block of abnormal partitions
- After normalizing the attribute values so that they all fall on a spectrum from 0..1, the average value of the attribute in the normal range and the average value of the attribute in the abnormal range are separated by at least θ, where θ is a configurable
*normalized difference threshold*.

If a numeric attribute passes these two tests, a predicate with one of the following forms will be generated:

- attr < x
- attr > x
- x < attr < y

Categorical attribute predicates are simpler, and DBSherlock simply generates the predicate *Attr ∈ {p | p.label = abnormal}*.

### Incorporating domain knowledge

Our algorithm extracts predicates that have a high diagnostic power. However, some of these predicates may be secondary symptoms of the root cause, which if removed, can make the diagnosis even easier.

The secondary symptom predicates can be pruned with a small amount of domain knowledge. This knowledge is expressed as a set of rules of the form *Attr _{i} → Attr_{j}* with the meaning that if predicates for both Attr

_{i}and Attr

_{j}are generated, then the predicate for Attr

_{j}is likely to be a secondary symptom. Cycles are not allowed in the rule set.

As an example, for MySQL on Linux, there are just four rules:

- DBMS CPU Usage → OS CPU Usage. (if DBMS CPU Usage is high, we probably don’t need to also flag OS CPU Usage)
- OS Allocated Pages → OS Free Pages; OS Used Swap Space → OS Free Swap Space; OS CPU Usage → OS CPU Idle. (For these three, one attribute is always completely determined by the other, and so is uninteresting).

### Causal models

Previous work on performance explanation has only focused on generating explanations in the form of predicates. DBSherlock improves on this functionality by generating substantially more accurate predicates (20–55% higher F1-measure;). However, a primary objective of DBSherlock is to go beyond raw predicates, and offer explanations that are more human-readable and descriptive.

Suppose DBSherlock finds three predicates with high predictive power for an anomaly, and after investigation the user confirms with the help of these clues that the true cause was rotation of the redo log file. “Rotation of the redo log file” is then added to the system as a causal model and linked with these three predicates.

After that, for every diagnosis inquiry in the future, DBSherlock calculates the confidence of this causal model and ‘Log Rotation’ is reported as a possible cause if its confidence is higher than a threshold λ. By default, DBSherlock displays only those causes whose confidence is higher than λ=20%. However, the user can modify λ as a knob (i.e., using a sliding bar) to interactively view fewer or more causes.

When multiple causal models are created for the same root cause while analysing different anomalies over time, DBSherlock can merge these into one by keeping only the predicates for attributes that appear in both models, and by combining the boundary conditions (or categories) on attributed-matched predicates. In evaluation, it turns out that merged causal models are on average 30% more accurate than the original models.

### DBSherlock in action

Section 8 of the paper contains an evaluation of DBSherlock in action looking for ten common types of database performance anomalies as shown below:

(click to enlarge)

For details, I refer you to the full paper. The short version is:

Our extensive experiments show that our algorithm is highly effective in identifying the correct explanations and is more accurate than the state-of-the art algorithm. As a much needed tool for coping with the increasing complexity of today’s DBMS, DBSherlock is released as an open-source module in our workload management toolkit.

Note that the principles used to build DBSherlock are not tied to the domain of database performance explanation in any deep way, so it should be entirely possible to take these ideas and apply them in other contexts – for example: “why has the latency just shot up on requests to this microservice?”