Brownout: building more robust cloud applications – Klein et al. 2014
How can we design cloud applications to be resilient in the face of varying resources and user load, and always deliver the best possible user experience? That’s a pretty important question these days, and Klein et al. report on a very interesting new development combining control theory and adaptive application behaviour with impressive results.
Our work borrows from the concept of brownout in electrical grids. Brownouts are an intentional voltage drop often used to prevent blackouts through load reduction in case of emergency. In such a situation, incandescent light bulbs dim, hence originating the term.
Applications can saturate – i.e. become unable to serve users in a timely manner. Some users may experience high latencies, while others may not receive any service at all. The authors argue that it is better to downgrade the user experience and continue serving a larger number of clients with reasonable latency.
We define a cloud application as brownout compliant if it can gradually downgrade user experience to avoid saturation.
This is actually very reminiscent of circuit breakers, as described in Nygard’s ‘Release It!’ and popularized by Netflix. If you’re already designing with circuit breakers, you’ve probably got all the pieces you need to add brownout support to your application relatively easily.
To lower the maintenance effort, brownouts should be automatically triggered. This enables cloud applications to rapidly and robustly avoid saturation due to unexpected environmental changes, lowering the burden on human operators.
Of course, the other thing we might do is provide the application with more resources. Studies later on in the paper look at what happens when brownout controls are applied as resources are added and removed. The results indicate that brownout control should be able to smooth out the application response and maximise user experience during such transitions.
How does the brownout model work?
- Application designers need to identify the parts of the response that may be considered optional (for example, return product information but not recommendations, or showing a post but not comments), and make it possible to activate the optional computations on a per-request basis.
- The application needs to export a dynamically changeable runtime parameter called the ‘dimmer’. The setting of the dimmer controls the probability that the optional computations will be performed when generating a given response.
- A new application component called the controller is added, its goal is to adjust the dimmer as a function of the current performance.
So whereas a circuit breaker is triggered on failure & timeouts, the dimmer switch acts more like a flow control valve determining how many requests get to execute the optional components.
we synthesize a control-theoretical solution to automatically decide when to activate those optional features
I’ve long said that adapting an application to changing demands is a control-theory problem (and implementing a RabbitMQ-based autoscaler for a SpringOne keynote a couple of years ago made that abundantly clear) so it’s great to see this approach being used here. It’s also why I have a copy of ‘Feedback Control‘ on my Kindle waiting to be read.
…control theory allows us to provide some formal guarantees on the system. Our main aim is to close a loop around a cloud application, constraining the application to have a behaviour that is as predictable as possible.
If your knowledge of control theory is better than mine, you might be able to follow along with the derivation of the controller algorithm! The end result (after a bit of time spent decoding on my part) actually seems pretty straightforward. It’s a little bit like the wikipedia page on PID Controllers that I had to refer to: scroll past lots of theory and math, till you get to the ‘pseudocode’ section at the bottom and you’ll see what I mean!
The question everyone wants answered of course is ‘does it work?’ Experiments suggest a very strong yes. Tests were performed first with a constant load, and varying resources (e.g. to simulate failure or loss of nodes and subsequent recovery); then with constant resources and varying load (e.g. to simulate usage spikes); and finally varying both load and resources.
The time-series results show that the self-adaptive application behaves as intended. The controller adapts the dimmer both to the available capacity and number of users as expected, and keeps the perceived latencies close to the setpoint. Moreover, the advantages that the brownout paradigm brings to previously non-adaptive applications can clearly be observed from the results.
The paper includes a number of charts that show very significant improvements in the ability to continue serving user requests within the desired latency targets when the system is under stress. A word of caution though, they’re not the easiest to interpret.
…self-adaptation through brownout can allow applications to support more users or run on fewer resources than their non-adaptive counterparts. Hence our proposition enables cloud applications to more robustly deal with unexpected peaks or unexpected failures, without requiring spare capacity.
The work in this paper only considers a single server! There’s an important extension for multiple servers, some much easier to follow charts, and a discussion on the implications for load balancing that we’ll look at next time…