Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., ASPLOS'19 Last time around we looked at the DeathStarBench suite of microservices-based benchmark applications and learned that microservices systems can be especially latency sensitive, and that hotspots can propagate through a microservices architecture in interesting ways. Seer is … Continue reading Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices
Tag: Operations
Identifying impactful service system problems via log analysis
Identifying impactful service system problems via log analysis He et al., ESEC/FSE'18 If something is going wrong in your system, chances are you’ve got two main sources to help you detect and resolve the issue: logs and metrics. You’re unlikely to be able to get to the bottom of a problem using metrics alone (though … Continue reading Identifying impactful service system problems via log analysis
Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently
Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently Veeraraghavan et al., OSDI'18 Here’s a really valuable paper detailing four plus years of experience dealing with datacenter outages at Facebook. Maelstrom is the system Facebook use in production to mitigate and recover from datacenter-level disasters. The high level idea is simple: drain traffic … Continue reading Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently
Orca: differential bug localization in large-scale services
Orca: differential bug localization in large-scale services Bhagwan et al., OSDI'18 Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus … Continue reading Orca: differential bug localization in large-scale services
REPT: reverse debugging of failures in deployed software
REPT: reverse debugging of failures in deployed software Cui et al., OSDI'18 REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix … Continue reading REPT: reverse debugging of failures in deployed software
Capturing and enhancing in situ system observability for failure detection
Capturing and enhancing in situ system observability for failure detection Huang et al., OSDI'18 The central idea in this paper is simple and brilliant. The place where we have the most relevant information about the health of a process or thread is in the clients that call it. Today the state of the practice is … Continue reading Capturing and enhancing in situ system observability for failure detection
Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding
Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding Hundman et al., KDD'18 How do you effectively monitor a spacecraft? That was the question facing NASA’s Jet Propulsion Laboratory as they looked forward towards exponentially increasing telemetry data rates for Earth Science satellites (e.g., around 85 terabytes/day for a Synthetic Aperture Radar satellite). Spacecraft are … Continue reading Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding
Log20: Fully automated optimal placement of log printing statements under specified overhead threshold
Log20: Fully automated optimal placement of log printing statements under specified overhead threshold Zhao et al., SOSP’17 Logging has become an overloaded term. In this paper logging is used in the context of recording information about the execution of a piece of software, for the purposes of aiding troubleshooting. For these kind of logging statements … Continue reading Log20: Fully automated optimal placement of log printing statements under specified overhead threshold
The Morning Paper on Operability
I gave a 30 minute talk at the Operability.io conference yesterday on the topic of “The Morning Paper meets operability.” In a first for me, I initially prepared the talk as a long blog post, and then created a set of supporting slides at the end. Today’s post is the text of that talk - … Continue reading The Morning Paper on Operability
DBSherlock: A performance diagnostic tool for transactional databases
DBSherlock: A performance diagnostic tool for transactional databases Yoon et al. SIGMOD ’16 …tens of thousands of concurrent transactions competing for the same resources (e.g. CPU, disk I/O, memory) can create highly non-linear and counter-intuitive effects on database performance. If you’re a DBA responsible for figuring out what’s going on, this presents quite a challenge. … Continue reading DBSherlock: A performance diagnostic tool for transactional databases