When correlation (or lack of it) can be causation

March 13, 2020May 25, 2020 ~ Adrian Colyer ~ 1 Comment

Rex: preventing bugs and misconfiguration in large services using correlated change analysis, Mehta et al., NSDI'20 and Check before you change: preventing correlated failures in service updates, Zhai et al., NSDI'20 Today's post is a double header. I've chosen two papers from NSDI'20 that are both about correlation. Rex is a tool widely deployed across ... Continue Reading

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

February 28, 2020May 25, 2020 ~ Adrian Colyer ~ 4 Comments

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure, Li et al., NSDI'20 Modern software systems at scale are incredibly complex ever changing environments. Despite all the pre-deployment testing you might employ, this makes it really tough to change them with confidence. Thus it's common to use some form of phased rollout, ... Continue Reading

Experiences with approximating queries in Microsoft’s production big-data clusters

September 9, 2019May 25, 2020 ~ Adrian Colyer ~ 2 Comments

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., VLDB'19 I’ve been excited about the potential for approximate query processing in analytic clusters for some time, and this paper describes its use at scale in production. Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of ... Continue Reading

Software engineering for machine learning: a case study

July 8, 2019May 25, 2020 ~ Adrian Colyer ~ 10 Comments

Software engineering for machine learning: a case study Amershi et al., ICSE'19 Previously on The Morning Paper we’ve looked at the spread of machine learning through Facebook and Google and some of the lessons learned together with processes and tools to address the challenges arising. Today it’s the turn of Microsoft. More specifically, we’ll be ... Continue Reading

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

October 26, 2018May 25, 2020 ~ Adrian Colyer ~ 11 Comments

RobinHood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor Berger et al., OSDI'18 It’s time to rethink everything you thought you knew about caching! My mental model goes something like this: we have a set of items that probably follow a power-law of popularity. We have a certain finite cache capacity, and ... Continue Reading

Orca: differential bug localization in large-scale services

October 19, 2018May 25, 2020 ~ Adrian Colyer ~ 8 Comments

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI'18 Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus ... Continue Reading

REPT: reverse debugging of failures in deployed software

October 17, 2018May 25, 2020 ~ Adrian Colyer ~ 9 Comments

REPT: reverse debugging of failures in deployed software Cui et al., OSDI'18 REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix ... Continue Reading

Columnstore and B+ tree – are hybrid physical designs important?

September 28, 2018 ~ Adrian Colyer ~ 12 Comments

Columnstore and B+ tree - are hybrid physical designs important? Dziedzic et al., SIGMOD'18 Earlier this week we looked at the design of column stores and their advantages for analytic workloads. What should you do though if you have a mixed workload including transaction processing, decision support, and operational analytics? Microsoft SQL Server supports hybrid ... Continue Reading

Medea: scheduling of long running applications in shared production clusters

June 13, 2018 ~ Adrian Colyer ~ Leave a comment

Medea: scheduling of long running applications in shared production clusters Garefalakis et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). We’re sticking with schedulers today, and a really interesting system called Medea which is designed ... Continue Reading

ServiceFabric: a distributed platform for building microservices in the cloud

June 5, 2018 ~ Adrian Colyer ~ 19 Comments

ServiceFabric: a distributed platform for building microservices in the cloud Kakivaya et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). Microsoft’s Service Fabric powers many of Azure’s critical services. It’s been in development for around ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Microsoft