Oncall control of software systems

By Will Angley
23 Feb 2017

I read On-call is broken: Kahneman and Tversky told me so, by Google’s Niall Murphy, a couple weeks ago and have been thinking about it on and off since. I’ve been oncall for complicated systems before, like Google Search Appliances, and it is not the most fun place to be.

I’ve also been re-reading Feedback Control of Computer Systems and think they have something to say about each other.

Feedforward and feedback control

Feedforward control takes a controller, and a system, and tries to figure out the correct action for a system based on the controller’s understanding of the world. The controller often looks at many things, and is complicated, and difficult to test correctly.

Feedback control takes a controller, a system, and a sensor that monitors the result of the system’s interaction with the world. It tries to control the system based only on the sensor.

Feedback controllers can be built from simple math when they’re controlling one input to a system, but are hard – or impossible – to build when controlling more than one input at a time.

Production software systems always have multiple inputs that need control

The first input is “whatever the system is operating on,” like orders for a Google Shopping Express warehouse, and the second input is the version of the software itself. If the system is receiving input from software, the version of the software providing its inputs is also an input.

How can the upstream software version be an input? Upstream version is an input when a change upstream causes your service to start crashing, and the crash is due to a bug that has always existed in your software and was not uncovered by testing.

You can’t recover your system by rolling it back in this case. So you need to roll back upstream. (Or fix the crash and roll your system forward, but this is both riskier and even harder to do under automatic control.)

Oncall supplies this control

Oncall is essentially adding a human into the control loop of a malfunctioning system. The human is likely to be a fairly limited controller for all of the reasons given in Niall’s post; they will consistently underperform a machine at feedback control, and you should be automating feedback loops where you can!

But humans can easily outperform machines when the issues require symbolic reasoning about software. This year was the first year that a computer program participated in DEF CON’s Capture the Flag challenge, and even though Mayhem was the best computer program available it came in last place against humans.

And complicated controllers are inherently perilous things. It’s very easy to write control rules that will fail catastrophically, especially if the system in question is interacting with other systems under the same control rules.

I.e if you try to have your own software rolling back automatically when it starts crashing, but the crash bug was always latent, how long will it take your automation to realize this and give up?

As I’ve heard the story, this is why it took so long to get from TCP CUBIC to TCP BBR, and why systems like TCP Remy were never deployed on the public Internet; since the learned control laws in Remy were beyond a human’s ability to write, they were even farther beyond a human’s ability to prove safe from collapse!

We can do much better at oncall

That said, oncall today neglects lessons in human performance that were learned decades ago, and are universally used by i.e. commercial airline pilots.

We often don’t train in the situations we’re expected to resolve and we seem to have copied practices from doctors, who need to be physically present, to oncall shifts that don’t need that physical presence.

  • If your team relies on engineers being oncall at night, have they ever paged them at night for a simulated outage?

  • And if that doesn’t work all that well, have you tried paging a friend in Europe who doesn’t know your system all that well, but is fully awake, against waking themselves up at 3AM?

    There might be a lot of roadblocks to making this work at scale in your company’s culture, but given what we know about how bad sleep deprivation is it’s surprising no one asks.

I imagine that “being awake” matters more than we give it credit for, and that our response plans would look very different – and rely much less on waking people up at 3AM! – if we did.

Similarly, aircraft manufacturers engineer the user interface of their cockpits to give pilots the information they need to fly the airplane, and can point to scientific research that justifies these design decisions.

Software monitoring consoles don’t seem nearly as well grounded in research or tradition. The consoles I’ve used for monitoring are both more complicated than the process control charts in Deming’s The New Economics, and don’t display the summary statistics that let you see if a service has slipped out of control!

These problems aren’t what SRE would normally work on, but I think there’s enough low hanging fruit here that “making humans better at oncall” might still be more productive for the near future than getting an oncall AI right.