Patience is a virtue: revisiting merge and sort on modern processors – Chandramouli and Goldstein 2014
This is a wonderful story. The seeds of an algorithm laid down over 50 years ago, rediscovered, brought up to date, and found to be highly relevant.
Our investigation has resulted in some surprising discoveries about a mostly ignored 50-year old sorting technique called Patience Sort.
Earlier in the series we found out that the cache management algorithm that comes first to mind for many of us, LRU, has been outperformed by a relatively straightforward scheme called ARC (Adaptive Replacement Cache). But who would have thought we could still improve on sort of all things after all these years! Sorting algorithms are one of the first things CS students learn – everyone has heard of quicksort. How many of us thought to ourselves “I wonder if there’s a better way?,” and then quickly ruled out any earnest attempt to find one on the grounds that so many people had already studied the problem extensively.
Chandramouli and Goldstein took up that challenge:
We revisit the problem of sorting and merging data in main memory, and show that a long-forgotten technique called Patience Sort can, with some key modifications, be made competitive with today’s best comparison-based sorting techniques for both random and almost sorted data.
The ‘sorting almost-sorted data’ problem is highly relevant given today’s highly distributed systems, since it’s what we do every time we merge logs in time order for example. (Let’s put the questions of distributed clocks and synchronized time to one side for the moment).
The refined algorithm uses an easily understandable technique called Ping-Pong Merge (because it bounces between two arrays merging sorted runs into the final overall order) to create a sorting algorithm P3-Sort (Ping-Pong Patience Sort).
So how does it do? First the case for almost-sorted data:
Looking to the literature on sorting almost sorted data, the best-known current technique is Timsort, which is the system sort in Python, and is used to sort arrays of non-primitive type in Java SE 7, on the Android platform, and in GNU Octave.
Fairly well entrenched then…. but here comes P3-Sort:
In all cases but already sorted data, P3 Sort is faster than Timsort. For the case where 5% of the data is disordered by a large amount, P3 Sort is between 3x and 4x faster than Timsort.
3 to 4x faster than the best known prior algorithm is astonishing! In the worst case of already sorted data, P3-Sort is only 10% slower than Timsort.
What about sorting of truly random data?
P3-Sort is about 20% faster than GNU Quicksort on random data
20% for a problem that has been so well studied over the years is once again, a massive leap forward.
Then of course there's the replacement selection sort use case – sorting of datasets too large to fit into main memory.
The faster approach, which deeply integrates P3 Sort into selection sort, improves CPU performance/throughput by 3x – 20x over classical selection sort
The authors believe that this new version can reduce the number of passes needed to sort logs with bounded disorder, from 2 passes down to 1. Thus interest in replacement sort may revive after 15 years in the doldrums.
In the conclusions the authors also note the potential for the techniques to be applied inside a DBMS, and that many further optimisations to P3-Sort are anticipated.