Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

May 15, 2019May 25, 2020 ~ Adrian Colyer ~ 2 Comments

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., ASPLOS'19 Last time around we looked at the DeathStarBench suite of microservices-based benchmark applications and learned that microservices systems can be especially latency sensitive, and that hotspots can propagate through a microservices architecture in interesting ways. Seer is ... Continue Reading

Identifying impactful service system problems via log analysis

December 19, 2018May 25, 2020 ~ Adrian Colyer ~ 1 Comment

Identifying impactful service system problems via log analysis He et al., ESEC/FSE'18 If something is going wrong in your system, chances are you’ve got two main sources to help you detect and resolve the issue: logs and metrics. You’re unlikely to be able to get to the bottom of a problem using metrics alone (though ... Continue Reading

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

October 24, 2018May 25, 2020 ~ Adrian Colyer ~ 4 Comments

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently Veeraraghavan et al., OSDI'18 Here’s a really valuable paper detailing four plus years of experience dealing with datacenter outages at Facebook. Maelstrom is the system Facebook use in production to mitigate and recover from datacenter-level disasters. The high level idea is simple: drain traffic ... Continue Reading

Orca: differential bug localization in large-scale services

October 19, 2018May 25, 2020 ~ Adrian Colyer ~ 8 Comments

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI'18 Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus ... Continue Reading

REPT: reverse debugging of failures in deployed software

October 17, 2018May 25, 2020 ~ Adrian Colyer ~ 9 Comments

REPT: reverse debugging of failures in deployed software Cui et al., OSDI'18 REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix ... Continue Reading

Capturing and enhancing in situ system observability for failure detection

October 15, 2018May 25, 2020 ~ Adrian Colyer ~ 2 Comments

Capturing and enhancing in situ system observability for failure detection Huang et al., OSDI'18 The central idea in this paper is simple and brilliant. The place where we have the most relevant information about the health of a process or thread is in the clients that call it. Today the state of the practice is ... Continue Reading

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding

October 10, 2018 ~ Adrian Colyer ~ 3 Comments

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding Hundman et al., KDD'18 How do you effectively monitor a spacecraft? That was the question facing NASA’s Jet Propulsion Laboratory as they looked forward towards exponentially increasing telemetry data rates for Earth Science satellites (e.g., around 85 terabytes/day for a Synthetic Aperture Radar satellite). Spacecraft are ... Continue Reading

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold

November 3, 2017 ~ Adrian Colyer ~ 15 Comments

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold Zhao et al., SOSP’17 Logging has become an overloaded term. In this paper logging is used in the context of recording information about the execution of a piece of software, for the purposes of aiding troubleshooting. For these kind of logging statements ... Continue Reading

The Morning Paper on Operability

September 21, 2016November 11, 2019 ~ Adrian Colyer ~ 3 Comments

I gave a 30 minute talk at the Operability.io conference yesterday on the topic of “The Morning Paper meets operability.” In a first for me, I initially prepared the talk as a long blog post, and then created a set of supporting slides at the end. Today’s post is the text of that talk - ... Continue Reading

DBSherlock: A performance diagnostic tool for transactional databases

July 14, 2016November 11, 2019 ~ Adrian Colyer ~ 5 Comments

DBSherlock: A performance diagnostic tool for transactional databases Yoon et al. SIGMOD ’16 …tens of thousands of concurrent transactions competing for the same resources (e.g. CPU, disk I/O, memory) can create highly non-linear and counter-intuitive effects on database performance. If you’re a DBA responsible for figuring out what’s going on, this presents quite a challenge. ... Continue Reading

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Operations