Experiences with approximating queries in Microsoft’s production big-data clusters

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., VLDB'19 I’ve been excited about the potential for approximate query processing in analytic clusters for some time, and this paper describes its use at scale in production. Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of … Continue reading Experiences with approximating queries in Microsoft’s production big-data clusters

Software engineering for machine learning: a case study

Software engineering for machine learning: a case study Amershi et al., ICSE'19 Previously on The Morning Paper we’ve looked at the spread of machine learning through Facebook and Google and some of the lessons learned together with processes and tools to address the challenges arising. Today it’s the turn of Microsoft. More specifically, we’ll be … Continue reading Software engineering for machine learning: a case study

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

RobinHood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor Berger et al., OSDI'18 It’s time to rethink everything you thought you knew about caching! My mental model goes something like this: we have a set of items that probably follow a power-law of popularity. We have a certain finite cache capacity, and … Continue reading RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

Orca: differential bug localization in large-scale services

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI'18 Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus … Continue reading Orca: differential bug localization in large-scale services

REPT: reverse debugging of failures in deployed software

REPT: reverse debugging of failures in deployed software Cui et al., OSDI'18 REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix … Continue reading REPT: reverse debugging of failures in deployed software

Columnstore and B+ tree – are hybrid physical designs important?

Columnstore and B+ tree - are hybrid physical designs important? Dziedzic et al., SIGMOD'18 Earlier this week we looked at the design of column stores and their advantages for analytic workloads. What should you do though if you have a mixed workload including transaction processing, decision support, and operational analytics? Microsoft SQL Server supports hybrid … Continue reading Columnstore and B+ tree – are hybrid physical designs important?

Medea: scheduling of long running applications in shared production clusters

Medea: scheduling of long running applications in shared production clusters Garefalakis et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). We’re sticking with schedulers today, and a really interesting system called Medea which is designed … Continue reading Medea: scheduling of long running applications in shared production clusters