ServiceFabric: a distributed platform for building microservices in the cloud

ServiceFabric: a distributed platform for building microservices in the cloud Kakivaya et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). Microsoft’s Service Fabric powers many of Azure’s critical services. It’s been in development for around ... Continue Reading

SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters

SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters Hsu et al., ASPLOS'18 What do you do when your theory of constraints analysis reveals that power has become your major limiting factor? That is, you can’t add more servers to your existing datacenter(s) without blowing your power budget, and you don’t want to ... Continue Reading

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers Lee et al., ASPLOS'18 (The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.) How do you know how well ... Continue Reading

Anna: A KVS for any scale

Anna: A KVS for any scale Wu et al., ICDE'18 This work comes out of the RISE project at Berkeley, and regular readers of The Morning Paper will be familiar with much of the background. Here’s how Joe Hellerstein puts it in his blog post introducing the work: As researchers, we asked the counter-cultural question: ... Continue Reading

Fail-slow at scale: evidence of hardware performance faults in large production systems

Fail-slow at scale: evidence of hardware performance faults in large production systems Gunawi et al., FAST’18 The first thing that strikes you about this paper is the long list of authors from multiple different establishments. That’s because it’s actually a study of 101 different fail-slow hardware incidents collected across large-scale cluster deployments in 12 different ... Continue Reading