When Feature Flags Go Wrong

Flagsmith
4 min readSep 20, 2018

--

Most people don’t like to write about their code disasters, which is a shame because upon the salt flats of their dried lake of tears, we can all mine the lithium of Valuable Lessons For The Future.

We scoured YouTube for the best videos on absolute disasters caused by feature flags.

Jumping on the dried tears of failure

Feature Flag Disaster One

We found this one thanks to Andy Davies’ wonderful talk. You can find him on Twitter here, or watch his speech in full below.

He gives us the sorry tale of the market trading company, Knight Capital. The Knight Capital team were happily and innocently testing in QA their new feature for making rapid automated trades, doing everything by the book, it all worked, and now they knew they were ready to deploy to production.

Except.

They had already had a feature toggle to do these trades before. 8 years before. And when they deployed to their servers, one of those servers did not have the toggle for the new feature, but instead of just not doing anything, the server used the old feature flag. It started making as many trades as fast as it possibly could, without limit.

Long story short, rolling back didn’t work, so they solved the problem by switching off the feature toggle. They then presumably sat in their office in sickened silence for a bit. The impact of those rapidly made automated trades was very very bad.

In the 45 minutes that it took to halt the rogue element, they lost nearly half a billion dollars.

Knight Capital’s Wikipedia is now written in the past tense.

For the rest of the valuable lessons to be learned, watch Andy’s video to hear him explain why you need his three golden rules to avoid the same catastrophe.

Rule 1: Never reuse a feature toggle.

Rule 2: Feature toggles should have a short lifespan

Rule 3: Name it something sensible, for crying out loud.

Andy also gives us some examples of feature toggles done well. He’s a proponent of trunk-based delivery, and he takes us through the four types of toggles (compile toggles, startup toggles for microservices, periodic toggles, and activity toggles) and accidentally invents the word “contrologging”, controlling with toggles, which should totally be a thing.

Feature Flag Disaster Two

Edith Harbaugh presents this next video.

Edith is a firm believer in feature flags, and she also lists good use cases for them, including uses such as kill switches. Intriguingly she also includes an example of feature flags used for customer experience and marketing purposes:

The example is to do with early releases, some people really like getting early access to things and feature flags can facilitate that, meaning you’re able to control those early releases for PR purposes, premium clients or brand fans — a really powerful thing to do. And vice versa, some people hate getting early releases, like B2B software users who are used to what the tool is right now and hate changes. Feature flags mean they don’t get those new changes until they’re cast iron and as bug-free as Dettol.

Edith also mentions Knight Capital, but she also has another valuable example of feature flags gone rogue for us.

This company remains nameless and their error didn’t kill their company. Still, it’s a Looney Toons farce that your company could do without.

A poorly managed feature toggle ^

These guys didn’t properly look after their old feature flags either. But we’re focusing on another mistake too: they gave access to almost everybody, even non-devs.

In the business context, it made sense, but still. Humans were involved where they shouldn’t be.

The problem came home to roost when one day the company realised that no one could upload a file to the CMS. This escalated quickly to the CEO who was getting bombarded with angry phone calls from customers.

Three hours later they discovered the issue was caused by an old feature flag which had been turned off, be it through fat fingers or obtuseness. The feature flag had been designed to throttle traffic which is an admirable reason for a feature flag, but now it was crippling the whole product.

The immediate solution was simple. They switched the feature flag back on, and everything worked again.

But what horrifies me is that as Edith says in the video, the company didn’t then actually take further action. “We didn’t really know what it did so we left it there.” Oof.

Ok fine, at least they made sure only those qualified to touch it were able to access it?

Nope.

They left it unsecured and open to basically any employee. Instead they just put a label on it, which said “do not ever touch this button.”

Six months later someone touched the button.

The last we hear about this company is that they still haven’t limited access or removed the old feature flag. Yes, the “do not touch touch” label is still there, but it’s now “DO NOT EVER TOUCH THIS BUTTON”.

DO NOT TOUCH THE BUTTON

And for all we know, it’s working. But, maybe, don’t do that in your company…

Conclusion

If you know any more feature flag disasters, or best use cases, send them my way. We’re obsessed with feature flags because that’s why we built Flagsmith in the first place.

More Feature Flag Content

For more feature flag content, best practices, and things to consider when building out feature flags, visit our content here.

--

--

Flagsmith
Flagsmith

Written by Flagsmith

Ship features with confidence. Flagsmith lets you manage features flags and remote config across web, mobile and server side applications.

No responses yet