Snap: a microkernel approach to host networking

November 11, 2019November 10, 2019 ~ adriancolyer ~ 7 Comments

Snap: a microkernel approach to host networking Marty et al., SOSP'19 This paper describes the networking stack, Snap, that has been running in production at Google for the last three years+. It’s been clear for a while that software designed explicitly for the data center environment will increasingly want/need to make different design trade-offs to … Continue reading Snap: a microkernel approach to host networking

Procella: unifying serving and analytical data at YouTube

September 11, 2019September 8, 2019 ~ adriancolyer ~ 13 Comments

Procella: unifying serving and analytical data at YouTube Chattopadhyay et al., VLDB'19 Academic papers aren’t usually set to music, but if they were the chorus of Queen’s "I want it all (and I want it now...)" seems appropriate here. Anchored in the primary use case of supporting Google’s YouTube business, what we’re looking at here … Continue reading Procella: unifying serving and analytical data at YouTube

Fast key-value stores: an idea whose time has come and gone

June 24, 2019June 21, 2019 ~ adriancolyer ~ 2 Comments

Fast key-value stores: an idea whose time has come and gone Adya et al., HotOS'19 No controversy here! Adya et al. would like you to stop using Memcached and Redis, and start building 11-factor apps. Factor VI in the 12-factor app manifesto, "Execute the app as one or more stateless processes," to be dropped and … Continue reading Fast key-value stores: an idea whose time has come and gone

Nines are not enough: meaningful metrics for clouds

June 19, 2019June 13, 2019 ~ adriancolyer ~ 16 Comments

Nines are not enough: meaningful metrics for clouds Mogul & Wilkes, HotOS'19 It’s hard to define good SLOs, especially when outcomes aren’t fully under the control of any single party. The authors of today’s paper should know a thing or two about that: Jeffrey Mogul and John Wilkes at Google1! John Wilkes was also one … Continue reading Nines are not enough: meaningful metrics for clouds

Towards federated learning at scale: system design

June 7, 2019June 4, 2019 ~ adriancolyer ~ 1 Comment

Towards federated learning at scale: system design Bonawitz et al., SysML 2019 This is a high level paper describing Google’s production system for federated learning. One of the most interesting things to me here is simply to know that Google are working on this, have a first version in production working with tens of millions … Continue reading Towards federated learning at scale: system design

Software-defined far memory in warehouse scale computers

May 22, 2019May 16, 2019 ~ adriancolyer ~ 13 Comments

Software-defined far memory in warehouse-scale computers Lagar-Cavilla et al., ASPLOS'19 Memory (DRAM) remains comparatively expensive, while in-memory computing demands are growing rapidly. This makes memory a critical factor in the total cost of ownership (TCO) of large compute clusters, or as Google like to call them "Warehouse-scale computers (WSCs)." This paper describes a "far memory" … Continue reading Software-defined far memory in warehouse scale computers

Dynamic control flow in large-scale machine learning

June 7, 2018June 1, 2018 ~ adriancolyer ~ 2 Comments

Dynamic control flow in large-scale machine learning Yu et al., EuroSys'18 (If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site). In 2016 the Google Brain team published a paper giving an overview of TensorFlow, "TensorFlow: a system for … Continue reading Dynamic control flow in large-scale machine learning

Andromeda: performance, isolation, and velocity at scale in cloud network virtualization

May 2, 2018April 27, 2018 ~ adriancolyer ~ 2 Comments

Andromeda: performance, isolation, and velocity at scale in cloud network virtualization Dalton et al., NSDI'18 Yesterday we took a look at the Microsoft Azure networking stack, today it’s the turn of the Google Cloud Platform. (It’s a very handy coincidence to have two such experience and system design report papers appearing side by side so … Continue reading Andromeda: performance, isolation, and velocity at scale in cloud network virtualization

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers

April 23, 2018April 22, 2018 ~ adriancolyer

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers Lee et al., ASPLOS'18 (The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.) How do you know how well … Continue reading WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers

Google workloads for consumer devices: mitigating data movement bottlenecks

April 18, 2018April 15, 2018 ~ adriancolyer ~ 3 Comments

Google workloads for consumer devices: mitigating data movement bottlenecks Boroumand et al., ASPLOS'18 What if your mobile device could be twice as fast on common tasks, greatly improving the user experience, while at the same time significantly extending your battery life? This is the feat that the authors of today’s paper pull-off, using a technique … Continue reading Google workloads for consumer devices: mitigating data movement bottlenecks