When your app starts to get bigger and more complex, the idea of needing to rest...

thraxil · 2025-02-01T19:38:32 1738438712

I would add two things:

It's often important that flag changes be atomic. Having subsequent requests get different flag values because they got routed to different backend nodes while a change is rolling out could cause some nasty bugs. A big part of the value of feature flags is to help avoid those kind of problems with rolling out config changes; if your flags implementation suffers from the same problem, it's not very useful.

Second, config changes are notorious as the cause of incidents. It's hard to "unit test" config changes to the production environment the same way you can with application code. Having people editing a config every time they want to change a flag setting (we're a tiny company and we change our flags multiple times per day) seems like a recipe for disaster.

withinboredom · 2025-02-01T21:49:00 1738446540

Making changes atomic is literally impossible, it's easier to just assume they won't be than chasing down something computer science tells us is impossible. I assume you are saying "every node sees the same change at the same time" when you say "atomic."

As for unit testing flags, you better unit test them! Just mock out your feature flag provider/whatever and test your feature in isolation; like everything else.

thraxil · 2025-02-02T17:29:33 1738517373

It seems like you've kind of missed both of my points.

If you're doing canary deploys to a fleet of 2000 nodes, it might take hours for the config to make it to all of them (I've seen systems where a fleet upgrade can take a week to make it all the way out). If your feature flags are configured that way, there's a long time that the state of a flag will be in that in-between state. We put feature flags in the database not config/environment so that we can turn a feature on or off more or less atomically. Ie, an admin goes into the management interface, flips a flag from off to on and then every single request that the system serves after that reflects that state. As long as you're using a database that supports transactions, you absolutely can have a clear point in time that delineates before/after that change. Rolling out a config change to a large fleet, you don't get that.

On the second point, what I'm saying is that (talk to your friendly local SRE if you don't believe me), a large percentage of production incidents in large systems are because of configuration changes, not application changes. This is because those things are significantly harder to really test than application code. Eg, if someone sets an environment variable for the production environment like `REDIS_IP=10.0.0.13` how do you know that's the correct IP address in that environment? You can add a ton of linting, you can do reviews, etc, but ultimately, it's a common vector for mistakes and it's one of the hardest areas to completely prevent human error from creating a disaster. One of the best strategies we have is to structure the system so you don't have to make manual environment/config changes that often. If you implement your feature flag system with environment variables/config, you'll be massively increasing the frequency that people are editing and changing that part of the system, which increases the chances of somebody making a typo, forgetting to close a quote, missing a trailing comma in a json file, etc.

Where I work we make production config changes maybe once a week or so and it's done by people who know the infrastructure very well, there's a bunch of linting and validation, and the change is rolled out with a canary system. In contrast, feature flags are in the database and we have a nice, very safe custom UI so folks on the Product and Support teams can manage the flags themselves, turning them on/off for different customers without having to go through an engineer; they might toggle flags a dozen times a day.

traverseda · 2025-02-01T18:57:28 1738436248

How do you do software upgrades if you don't have a good system for handling process restarts without downtime?

tetha · 2025-02-01T19:52:56 1738439576

Then again, speed and performance.

At my last job, updating a productive game server cluster took an hour or so with minimal to no customer interruption. Though you could still see and measure how the systems needed another hour or two to get their JIT'ers, database caches, code caches and all of these things back on track. Maybe you can just say "then architect better" or "just use rust instead of Java", but the system was as it was and honestly, it performed vey very well.

On the other hand, the game servers checked once a minute what promotion events should be active every minute from the marketing backend and reacted to it without major caching/performance impacts.

Similar things at my current place. Teams have stable and reliable deployment mechanisms that can bring code to Prod in 10 - 15 minutes, including rollbacks if necessary. It's still both safer to gate new features behind feature toggles, and faster to turn feature toggles on and off. Currently, such per-customer configs apply in 30 - 60 seconds across however many applications deem it relevant.

I would have to think quite a bit to bring binaries to servers that quickly, as well as coordinate restarts properly. The latter would dominate the time easily.

jitl · 2025-02-01T19:02:17 1738436537

Software updates happen once per two hours, config changes happen once per 5 minutes or faster.

A few days ago I’m tuning performance parameters for a low latency stream processing system, I can iterate in 90 seconds by twiddling some config management bits for 30s in the CLI, watch the graphs for 60s, then repeat.

zoogeny · 2025-02-01T19:12:06 1738437126

I mean, isn't that even worse?

If I have 100 servers and I'm doing rolling deploys then I'm going to be in a circumstance where some ratio of my services are in one state and some ratio are in another state.

If I am reading per-request from redis (even with a server cache) I have finer-grained control.

For me it is a question of "is the config valid for the life of this process" vs. "is this config something that might change while this process is alive".