RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

RobinHood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor Berger et al., OSDI'18 It’s time to rethink everything you thought you knew about caching! My mental model goes something like this: we have a set of items that probably follow a power-law of popularity. We have a certain finite cache capacity, and … Continue reading RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

Orca: differential bug localization in large-scale services

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI'18 Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus … Continue reading Orca: differential bug localization in large-scale services

REPT: reverse debugging of failures in deployed software

REPT: reverse debugging of failures in deployed software Cui et al., OSDI'18 REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix … Continue reading REPT: reverse debugging of failures in deployed software

Columnstore and B+ tree – are hybrid physical designs important?

Columnstore and B+ tree - are hybrid physical designs important? Dziedzic et al., SIGMOD'18 Earlier this week we looked at the design of column stores and their advantages for analytic workloads. What should you do though if you have a mixed workload including transaction processing, decision support, and operational analytics? Microsoft SQL Server supports hybrid … Continue reading Columnstore and B+ tree – are hybrid physical designs important?

Medea: scheduling of long running applications in shared production clusters

Medea: scheduling of long running applications in shared production clusters Garefalakis et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). We’re sticking with schedulers today, and a really interesting system called Medea which is designed … Continue reading Medea: scheduling of long running applications in shared production clusters

ServiceFabric: a distributed platform for building microservices in the cloud

ServiceFabric: a distributed platform for building microservices in the cloud Kakivaya et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). Microsoft’s Service Fabric powers many of Azure’s critical services. It’s been in development for around … Continue reading ServiceFabric: a distributed platform for building microservices in the cloud

Azure accelerated networking: SmartNICs in the public cloud

Azure accelerated networking: SmartNICs in the public cloud Firestone et al., NSDI'18 We’re still on the ‘beyond CPUs’ theme today, with a great paper from Microsoft detailing their use of FPGAs to accelerate networking in Azure. Microsoft have been doing this since 2015, and hence this paper also serves as a wonderful experience report documenting … Continue reading Azure accelerated networking: SmartNICs in the public cloud