Engineer at a unicorn SaaS startup: comparing the number of Github PRs for two spikes of my maximum output (had to write a lot of production code for weeks) before and after AI shows 27% increase.
My company heavily relies on Amazon SQS for background jobs. We use Redis as well but it is hard to run at scale. Hence, anything critical goes to SQS by default. SQS usage is so ubiquitous I can’t imagine anyone be interested in writing a blog post or presenting on a conference. Once you get used to SQS specifics (more than once delivery, message size limit, client/server tooling built, expiration settings, DLQ) I doubt there’s anything that can beat it in terms of performance/reliability Unless you have resources to run Redis/Kafka/etc yourself. I would recommend searching for talks by Shopify eng folks on their experience, in particular from Kir (e.g. https://kirshatrov.com/posts/state-of-background-jobs)
I think there are two separate perspectives. For developers Open Telemetry is a clear win - high-quality vendor agnostic instrumentation backed by a reputable orgs. I instrumented with traces many business critical repos at my company (major customer support SaaS) with OTEL in Ruby, Python, JS. Not once was I confused/blocked/distracted by the presence of logs/metrics in the spec. However, can’t say much from the observability vendor perspective trying to be fully compatible with OTEL spec including metrics/logs. Article mentions customers having issues with using tracing instrumentation - it would’ve been great to back this up with corresponding github issues explaining the problems. Based on the presented JS snippet (just my guess) maybe the issue is with async code where the “span.operation” span gets immediately closed w/o waiting for the doTheThing()? Yeah - that’s tricky in JS given its async primitives. We ended up just maintaining a global reference to the currently active span and patching some OTEL packages to respect that. FWIW Sentry JS instrumentation IS really good and practical. Would have been great if Sentry could donate/contribute/influence to OTEL JS SIG with specific improvements - would be a win-win. As much as I hate DataCanine pricing they did effectively donated their Ruby tracing instrumentation to OTEL which I think is one of the best ones out there.
I've been rinsing my mouth after brushing my teeth my whole life. Turns out you should not. Logically, why even buy that expensive toothpaste only to negate all its positive impact by rinsing it away?
Would be great if the timeline covered 19 minutes of 6:32 – 06:51. How long did it take to get the right people on the call? How long did it take to identify deployment as a suspect?
Another massive gap is the rollback: 6:58 – 7:42 – 44 minutes! What exactly was going on and why did it take so long? What were those back-up procedures mentioned briefly? Why engineers where stepping on each other toes? What's the story with reverting reverts?
Adding more automation, tests and fixing that specific ordering issue of course is an improvement. But that adds more complexity and any automation ultimately will fail some day.
Technical details are all appreciated. But it is going to be something else next time. Would be great to learn more about human interactions. That's where the resilience of a socio-technical system happened and I bet there is some room for improvement there.
It would be fun to be a fly on the wall when shit hits the fan in general. From Nuclear meltdowns to 9/11 ATC recordings, it is fascinating to see how emergencies play out and what kind of things go on with boots-on-ground, all-hands-on-deck situations.
Like, does Cloudflare have an emergency procedure for escalation? What does that look like? How does the CTO get woken up in the middle of the night? How to get in touch with critical and most important engineers? Who noticed Cloudflare down first? How do quick decisions get made and decided? Do people get on a giant zoom call? Or emails going around? What if they can't get hold of the most important people that can flip switches? Do they have a control room like the movies? CTO looking over the shoulder calling "Affirmative, apply the fix." followed by a progress bar painfully moving towards completion.
I think blaming the slide/presentation is severely affected by the hindsight bias - of course, knowing the outcomes, we can find loads of issues with the slide. More importantly, it is always easy to declare a "human factor" incident and blame the human for "bad slides". But the very fact that such an important decision (re-entry) was (presumably) made as a result of Boeing engineers presenting to NASA officials/managers is eyebrow raising. The fact that this type of an issue (foam hitting the tiles) was well know in advance and yet not properly addressed indicates a systemic organizational problems.
It would be great to study those organizational factors and processes that resulted in both tragedies: how the risk was managed? how NASA "drifted" into failure? I believe focusing on a slide completely misses the point.