When Feature Flags Go Wrong
Most people don’t like to write about their code disasters which is a shame because upon the salt flats of their dried lake of tears, we can all mine the lithium of Valuable Lessons For The Future.
So we scoured Youtube for the best videos on absolute disasters caused by feature flags.
We found this one thanks to Andy Davies’ wonderful talk. You can find him on Twitter here, or watch his speech in full below.
He gives us the sorry tale of the the market trading company Knight Capital. Knight Capital were happily and innocently testing in QA their new feature for making rapid automated trades, doing everything by the book, it all worked, and now they knew they were ready to deploy to production.
They had already had a feature toggle to do these trades before. 8 years before. And when they deployed to their servers, one of those servers did not have the toggle for the new feature, but instead of just not doing anything, the server used the old feature flag. It therefore started making as many trades as fast as it possibly could, without limit.
Long story short, rolling back didn’t work, so they solved the problem by switching off the feature toggle. They then presumably sat in their office in sickened silence for a bit. Because, the impact of those rapidly made automated trades was very very bad.
In the 45 minutes that it took to halt the rogue element, they lost nearly half a billion dollars.
Knight Capital’s Wikipedia is now written in the past tense.
For the rest of the valuable lessons to be learned, watch Andy’s video to hear him explain why you need his three golden rules to avoid the same catastrophe.
Rule 1: never reuse a feature toggle.
Rule 2: feature toggles should have a short life span
Rule 3: name it something sensible for crying out loud.
Andy also gives us some examples of feature toggles done well, he’s a proponent of trunk-based delivery, he takes us through the four types of toggles (compile toggles, startup toggles for microservices, periodic toggles, and activity toggles) and accidentally invents the word “contrologging”, controlling with toggles, which should totally be a thing.
Edith Harbaugh presents this next video.
Edith is a firm believer in feature flags, and she also lists good use cases for them, including uses such as kill switches. Intriguingly she also includes an example of feature flags used for customer experience and marketing purposes:
The example is to do with early releases, some people really like getting early access to things and feature flags can facilitate that, meaning you’re able to control those early releases for PR purposes, premium clients or brand fans — a really powerful thing to do. And vice versa, some people hate getting early releases, like B2B software users who are used to what the tool is right now and hate changes. Feature flags mean they don’t get those new changes until they’re cast iron and as bug free as Dettol.
Edith also mentions Knight Capital, but she also has another valuable example of feature flags gone rogue for us.
This company remains nameless and their error didn’t kill their company. Still, it’s a Looney Toons farce that your company could do without.
These guys didn’t properly look after their old feature flags either. But we’re focusing on another mistake too: they gave access to almost everybody, even non-devs.
In the business context, it made sense, but still. Humans were involved where they shouldn’t be.
The problem came home to roost when one day the company realised that no one could upload a file to the CMS. This escalated quickly to the CEO who was getting bombarded with angry phone calls from customers.
Three hours later they discovered the issue was caused by an old feature flag which had been turned off, be it through fat fingers or obtuseness. The feature flag had been designed to throttle traffic which is an admirable reason for a feature flag, but now it was crippling the whole product.
The immediate solution was simple. They switched the feature flag back on, and everything worked again.
But what horrifies me is that as Edith says in the video, the company didn’t then actually take further action. “We didn’t really know what it did so we left it there.” Oof.
Ok fine, at least they made sure only those qualified to touch it were able to access it?
They left it unsecured and open to basically any employee. Instead they just put a label on it, which said “do not ever touch this button.”
6 months later someone touched the button.
The last we hear about this company is that they still haven’t limited access or removed the old feature flag. Yes, the “do not touch touch” label is still there, but it’s now “DO NOT EVER TOUCH THIS BUTTON”.
And for all we know, it’s working. But, maybe, don’t do that in your company…
And, if you know any more feature flag disasters, or best use cases, send them my way. We’re obsessed with feature flags because that’s why we built Bullet Train in the first place.
Also, more feature flag content coming up, and continuous integration stuff, for more info on all of that, visit our content here.