SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters

SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters Hsu et al., ASPLOS'18 What do you do when your theory of constraints analysis reveals that power has become your major limiting factor? That is, you can’t add more servers to your existing datacenter(s) without blowing your power budget, and you don’t want to ... Continue Reading

Canopy: an end-to-end performance tracing and analysis system

Canopy: an end-to-end performance tracing and analysis system Kaldor et al., SOSP’17 In 2014, Facebook published their work on ‘The Mystery Machine,’ describing an approach to end-to-end performance tracing and analysis when you can’t assume a perfectly instrumented homogeneous environment. Three years on, and a new system, Canopy, has risen to take its place. Whereas ... Continue Reading

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services Veeraraghavan et al. (Facebook) OSDI 2016 How do you know how well your systems can perform under stress? How can you identify resource utilization bottlenecks? And how do you know your tests match the condititions experienced with live ... Continue Reading

Realtime data processing at Facebook

Realtime Data Processing at Facebook Chen et al. SIGMOD 2016 ‘Realtime Data Processing at Facebook’ provides us with a great high-level overview of the systems Facebook have built to support real-time workloads. At the heart of the paper is a set of five key design decisions for building such systems, together with an explanation of ... Continue Reading

SocialHash: An assignment framework for optimizing distributed systems operations on social networks

SocialHash: An assignment framework for optimizing distributed systems operations on social networks - Shalita et al., NSDI '16 Large scale systems frequently need to partition resources or load across multiple nodes. How you do that can make a big difference. A common approach is to use a random distribution (e.g. via consistent hashing), which usually ... Continue Reading

Memory Networks

Memory Networks Weston et al. 2015 As with the Neural Turing Machine that we look at yesterday, this paper looks at extending machine learning models with a memory component. The Neural Turing Machine work was developed at Google by the DeepMind team, today's paper on Memory Networks was developed by the Facebook AI Research group. ... Continue Reading