Machine learning systems are stuck in a rut

June 28, 2019June 22, 2019 ~ adriancolyer ~ 18 Comments

Machine learning systems are stuck in a rut Barham & Isard, HotOS'19 In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year old benchmarks, but gradually making it harder to explore innovative machine … Continue reading Machine learning systems are stuck in a rut

Designing far memory data structures: think outside the box

June 26, 2019June 22, 2019 ~ adriancolyer ~ 14 Comments

Designing far memory data structures: think outside the box Aguilera et al., HotOS'19 Last time out we looked at some of the trade-offs between RInKs and LInKs, and the advantages of local in-memory data structures. There’s another emerging option that we didn’t talk about there: the use of far-memory, memory attached to the network that … Continue reading Designing far memory data structures: think outside the box

Fast key-value stores: an idea whose time has come and gone

June 24, 2019June 21, 2019 ~ adriancolyer ~ 2 Comments

Fast key-value stores: an idea whose time has come and gone Adya et al., HotOS'19 No controversy here! Adya et al. would like you to stop using Memcached and Redis, and start building 11-factor apps. Factor VI in the 12-factor app manifesto, "Execute the app as one or more stateless processes," to be dropped and … Continue reading Fast key-value stores: an idea whose time has come and gone

What bugs cause cloud production incidents?

June 21, 2019June 21, 2019 ~ adriancolyer ~ 13 Comments

What bugs cause production cloud incidents? Liu et al., HotOS'19 Last time out we looked at SLOs for cloud platforms, today we're looking at what causes them to be broken! This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of … Continue reading What bugs cause cloud production incidents?

Nines are not enough: meaningful metrics for clouds

June 19, 2019June 13, 2019 ~ adriancolyer ~ 16 Comments

Nines are not enough: meaningful metrics for clouds Mogul & Wilkes, HotOS'19 It’s hard to define good SLOs, especially when outcomes aren’t fully under the control of any single party. The authors of today’s paper should know a thing or two about that: Jeffrey Mogul and John Wilkes at Google1! John Wilkes was also one … Continue reading Nines are not enough: meaningful metrics for clouds

Towards multiverse databases

June 17, 2019June 13, 2019 ~ adriancolyer ~ 16 Comments

Towards multiverse databases Marzoev et al., HotOS'19 A typical backing store for a web application contains data for many users. The application makes queries on behalf of an authenticated user, but it is up to the application itself to make sure that the user only sees data they are entitled to see. Any frontend can … Continue reading Towards multiverse databases

A case for managed and model-less inference serving

June 14, 2019June 13, 2019 ~ adriancolyer ~ 1 Comment

A case for managed and model-less inference serving Yadwadkar et al., HotOS'19 HotOS’19 is presenting me with something of a problem as there are so many interesting looking papers in the proceedings this year it’s going to be hard to cover them all! As a transition from the SysML papers we’ve been looking at recently, … Continue reading A case for managed and model-less inference serving

Beyond data and model parallelism for deep neural networks

June 12, 2019June 15, 2019 ~ adriancolyer ~ 12 Comments

Beyond data and model parallelism for deep neural networks Jia et al., SysML'2019 I’m guessing the authors of this paper were spared some of the XML excesses of the late nineties and early noughties, since they have no qualms putting SOAP at the core of their work! To me that means the "simple" object access … Continue reading Beyond data and model parallelism for deep neural networks

PyTorch-BigGraph: a large-scale graph embedding system

June 10, 2019June 9, 2019 ~ adriancolyer

PyTorch-BigGraph: a large-scale graph embedding system Lerer et al., SysML'19 We looked at graph neural networks earlier this year, which operate directly over a graph structure. Via graph autoencoders or other means, another approach is to learn embeddings for the nodes in the graph, and then use these embeddings as inputs into a (regular) neural … Continue reading PyTorch-BigGraph: a large-scale graph embedding system

Towards federated learning at scale: system design

June 7, 2019June 4, 2019 ~ adriancolyer ~ 1 Comment

Towards federated learning at scale: system design Bonawitz et al., SysML 2019 This is a high level paper describing Google’s production system for federated learning. One of the most interesting things to me here is simply to know that Google are working on this, have a first version in production working with tens of millions … Continue reading Towards federated learning at scale: system design