Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> “Speed was the most important thing,” said Jeff Gardner, a senior user experience designer at CrowdStrike who said he was laid off in January 2023 after two years at the company. “Quality control was not really part of our process or our conversation.”

This type of article - built upon disgruntled former employees - is worth about as much as the apology GrubHub gift card.

Look, I think just as poorly about CrowdStrike as anyone else out there... but you can find someone to say anything, especially when they have an axe to grind and a chance at some spotlight. Not to mention this guy was a designer and wouldn't be involved in QC anyway.

> Of the 24 former employees who spoke to Semafor, 10 said they were laid off or fired and 14 said they left on their own. One was at the company as recently as this summer. Three former employees disagreed with the accounts of the others. Joey Victorino, who spent a year at the company before leaving in 2023, said CrowdStrike was “meticulous about everything it was doing.”

So basically we have nothing.



>>So basically we have nothing.

Except the biggest IT outage ever. And a postmortem showing their validation checks were insufficient. And a rollout process that did not stage at all, just rawdogged straight to global prod. And no lab where the new code was actually installed and run prior to global rawdogging.

I'd say there's smoke, and numerous accounts of fire, which this can be taken in the context of.


"Everyone" piles on Tesla all the time; a worthwhile comparison would be how Tesla roll out vehicle updates.

Sometimes people are up in arms "where's my next version" (eg when adaptive headlights was introduced), yet Tesla prioritise a safe, slow roll out. Sometimes the updates fail (and get resolved individually), but never on a global scale. (None experienced myself, as a TM3 owner on the "advanced" update preference).

I understand the premise of Crowdstrike's model is to have up to date protection everywhere but clearly they didn't think this through enough times, if at all.


You can also say the same thing about Google. Just go look at the release notes on the App Store for the Google Home app. There was a period of more than six months where every single release said "over the next few weeks we're rolling out the totally redesigned Google Home app: new easier to navigate 5-tab layout."

When I read the same release notes so often I begin to question whether this redesign is really taking more than six months to roll out. And then I read the Sonos app disaster and I thought that was the other extreme.


> Just go look at the release notes on the App Store for the Google Home app. [...] When I read the same release notes so often I begin to question whether this redesign is really taking more than six months to roll out.

Google is terrible at release notes. Since several years ago, the release notes for the "Google" app on the Android app store always shows the exact same four unchanging entries, loosely translating from Portuguese: "enhanced search page appearance", "new doodles designed for app experience", "offline voice actions (play music, enable Wi-Fi, enable flashlight) - available only in the USA", "web pages opened directly within the app". I heavily doubt it's taking these many years to roll out these changes; they probably simply don't care anymore, and never update these app store release notes.


The sentence you quoted clearly meant, from the context, "clearly we have nothing [to learn from the opinions of these former employees]". Nothing in your comment is really anything to do with that.


Triangulation versus new signal.


There definitely was a huge outage, but based on the given information we still can't know for sure how much they invested in testing and quality control.

There's always a chance of failure even for the most meticulous companies.

Now I'm not defending or excusing the company, but a singular event like this can happen to anyone and nothing is 100%.

If thorough investigation revealed poor quality control investment compared to what would be appropriate for a company like this, then we can say for sure.


Two things are clear though

Nobody ran this update

The update was pushed globally to all computers

With that alone we know they have failed the simplest of quality control methods for a piece of software as widespread as theirs. This is even excluding that there should have been some kind of error handling to allow the computer to boot if they did push bad code.


While I agree with this, from a software engineering perspective I think it's more useful to look at the lessons learned. I think it's too easy to just throw "Crowdstrike is a bunch of idiots" against the wall, and I don't think that's true.

It's clear to me that CrowdStrike saw this as a data update vs. a code update, and that they had much more stringent QA procedures for code updates that they did data updates. It's very easy for organizations to lull themselves into this false sense of security when they make these kinds of delineations (sometimes even subconsciously at first), and then over time they lose site of the fact that a bad data update can be just as catastrophic as a bad code update. I've seen shades of this issue elsewhere many times.

So all that said, I think your point is valid. I know Crowdstrike had the posture that they wanted to get vulnerability files deployed globally as fast as possible upon a new threat detection in order to protect their clients, but it wouldn't have been that hard to build in some simple checks in their build process (first deploy to a test bed, then deploy globally) even if they felt a slower staged rollout would have left too many of their clients unprotected for too long.

Hindsight is always 20/20, but I think the most important lesson is that this code vs data dichotomy can be dangerous if the implications are not fully understood.


It could have been ok to expedite data updates, should the code treat configuration data as untrusted input, as if it could be written by an attacker. It means fuzz testing and all that.

Obviously the system wasn't very robust, as a simple, within specs change could break it. A company like CrowdStrike, which routinely deals with memory exploits and claims to do "zero trust" should know better.

As often, there is a good chance it is an organization problem. The team in charge of the parsing expected that the team in charge of the data did their tests and made sure the files weren't broken, while on the other side, they expected the parser to be robust and at worst, a quick rollback could fix the problem. This may indeed be the sign of a broken company culture, which would give some credit to the ex-employees.


> Obviously the system wasn't very robust, as a simple, within specs change could break it.

From my limited understanding, the file was corrupted in some way. Lots of NULL bytes, something like that.


That rumor floated around Twitter but the company quickly disavowed it. The problem was that they added an extra parameter to a common function but never tested it with a non-wildcard value, revealing a gap in their code coverage review:

https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann...


From the report, it seems the problem is that they added a feature that could use 21 arguments, but there was only enough space for 20. Until now, no configuration used all 21 (the last one was a wildcard regex, which apparently didn't count), but when they finally did, it caused a buffer overflow and crashed.


> It's clear to me that CrowdStrike saw this as a data update vs. a code update, and that they had much more stringent QA procedures for code updates that they did data updates.

It cannot have been a surprise to Crowdstrike that pushing bad data had the potential to bork the target computer. So if they had such an attitude that would indicate striking incompetence. So perhaps you are right.


> It's clear to me that CrowdStrike saw this as a data update vs. a code update

> Hindsight is always 20/20, but I think the most important lesson is that this code vs data dichotomy can be dangerous if the implications are not fully understood.

But it's not some new condition that the industry hasn't already been dealing with for many many decades (i.e. code vs config vs data vs any other type of change to system, etc.).

There are known strategies to reduce the risk.


If they weren't idiots they wouldn't be parsing data in the kernel level module


Crowdstrike is a bunch of idiots


I'm sorry but there comes a point where you have to call a spade a spade.

When you have the trifecta of regex, *argv packing and uninitialized memory you're reaching levels of incompetence which require being actively malicious and not just stupid.


Also it's the _second_ time that they had done this in a few short months.

They had previous bricked linux hosts earlier with a similar type of update.

So we also know that they don't learn from their mistakes.


The blame for the Linux situation isn’t as clear cut as you make it out to be. Red hat rolled out a breaking change to BPF which was likely a regression. That wasn’t caused directly by a crowdstrike update.


At least one of the incidents involved Debian machines, so I don't understand how Red Hat's change would be related.


Sorry, that’s correct it was Debian, but Debian did apply a RHEL specific patch to their kernel. That’s the relationship to red hat.


It's not about the blame, it's about how you respond to incidents and what mitigation steps you take. Even if they aren't directly responsible, they clearly didn't take proper mitigation steps when they encountered the problem.


How do you mitigate the OS breaking an API below you in an update? Test the updates before they come out? Even if you could, you'd still need to deploy a fix before the OS update hits the customers, and anyone that didn't update would still be affected.

The linux case is just _very_ different from the windows case. The mitigation steps that could have been taken to avoid the linux problem would not have helped for the windows outage anyways, the problems are just too different. The linux update was about an OS update breaking their program, while the windows issue was about a configuration change they made triggering crashes in their driver.


You're missing the forest for the trees.

It's: a) an update, b) pushed out globally without proper testing, c) that bricked the OS.

It's an obvious failure mode that if you have a proper incident response process would be revealed from that specific incident and flagged for needing mitigation.

I do this specific thing for a living. You don't just address the exact failure that happened but try to identify classes of risk in your platform.

> Even if you could, you'd still need to deploy a fix before the OS update hits the customers, and anyone that didn't update would still be affected.

And yet the problem would still only affect Crowdstrike's paying customers. No matter how much you blame upstream your paying customers are only ever going to blame their vendor because the vendor had discretion to test and not release the update. As their customers should.


Sure, customers are free to blame their vendor. But please, we’re on HN, we aren’t customers, we don’t have beef in this game. So we can do better here, and properly allocate blame, instead of piling on the cs hate for internet clout.

And again, you cannot prevent your vendor breaking you. Sure, you can magic some convoluted process to catch it asap. But that won’t help the poor sods who got caught in-between.


> there should have been some kind of error handling

This is the point I would emphasize. A kernel module that parses configuration files must defend itself against a failed parse.


> If thorough investigation revealed poor quality control investment compared to what would be appropriate for a company like this, then we can say for sure.

We don't really need that thorough of an investigation. They had no staged deploys when servicing millions of machines. That alone is enough to say they're not running the company correctly.


Totally agree.

I’d consider staggering a rollout to be the absolute basics of due diligence.

Especially when you’re building a critical part of millions of customer machines.


I also fall on the side of "stagger the rollout" (or "give customers tools to stagger the rollout"), but at the same time I recognize that a lot of customers would not accept delays on the latest malware data.

Before the incident, if you asked a customer if they would like to get updates faster even if it means that there is a remote chance of a problem with them... I bet they'd still want to get updates faster.


There must be balance


I would say that canary release is an absolute must 100%. Except I can think of cases where it might still not be enough. So, I just don't feel comfortable judging them out of the box. Does all the evidence seem to point against them? For sure. But I just don't feel comfortable giving that final verdict without knowing for sure.

Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.

If there's deadlines that you can go over, and nothing bad happens, for sure. Always have canary releases, and perfect QA, monitoring everything thoroughly, but I'm just saying, there can be cases where damage that could be done if you don't act fast enough, is just so much worse.

And I don't know that it wasn't the case for them. I just don't know.


> Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.

This is severely overstating the problem: an extra few minutes is not going to be the difference between their customers being compromised. Most of the devices they run on are never compromised, because anyone remotely serious has defense in depth.

If it was true, or even close to true, that would make the criticism more rather than less strong. If time is of the essence, you invest in things like reviewing test coverage (their most glaring lapse), fuzz testing, and common reliability engineering techniques like having the system roll back to the last known good configuration after it’s failed to load. We think of progressive rollouts as common now but they got to get that mainstream in large part because the Google Chrome team realized rapid updates are important but then asked what they needed to do to make them safe. CrowdStrike’s report suggests that they wanted rapid but weren’t willing to invest in the implementation because that isn’t a customer-visible feature – until it very painfully became one.


In this case, they pretty much caused a worst case scenario…


They literally half-assed their deployment process - one part enterprisey, one part "move fast and break things".

Guess which part took down much of the corporate world?

from Preliminary Post Incident Review at https://www.crowdstrike.com/falcon-content-update-remediatio... :

"CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.

...

The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.

The event of Friday, July 19, 2024 was not triggered by Sensor Content, which is only delivered with the release of an updated Falcon sensor. Customers have complete control over the deployment of the sensor — which includes Sensor Content and Template Types.

...

Rapid Response Content is used to perform a variety of behavioral pattern-matching operations on the sensor using a highly optimized engine.

Newly released Template Types are stress tested across many aspects, such as resource utilization, system performance impact and event volume. For each Template Type, a specific Template Instance is used to stress test the Template Type by matching against any possible value of the associated data fields to identify adverse system interactions.

Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."


> one part enterprisey, one part "move fast and break things".

When there's 0day, how enterprisey you would like to catch the 0day?


Not sure, but definitely more enterprisey than "release a patch to the entire world at once before running it on a single machine in-house".


So it would be preferable to have your data encrypted, taken hostage unless you pay, and be down for days, instead of 6 hours of just down?


Do you seriously believe that all CrowdStrike on Windows customers were at such imminent risk of ransomware that one-two hours to run this on one internal setup and catch the critical error they released would have been dangerous?

This is a ludicrous position, and has been proven obviously false by the proceedings: all systems that were crashed by this critical failure were not, in fact, attacked with ransomware once the CS agent was un-installed (at great pain).


I'd challenge you to be a CISO :)

You don't want to be in a situation where you're taken hostage and asked hundred mills ransomeware just because you're too slow to mitigate the situation.


That's a false dichotomy


Crowdstrike exploited their own 0-day. Their market cap went down by several billion dollars.

A patch should, at minimum:

1. Let the app run 2a. Block the offending behaviour 2b. Allow normal behaviour

Part 1. can be assumed if Parts 2a and 2b work correctly.

We know CrowdStrike didn't ensure 2a or 2b since the app caused the machine to reboot when the patch caused a fault in the app.

CrowdStrike's Root Cause Analysis, https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann..., lists what they're going to do:

====

Mitigation: Validate the number of input fields in the Template Type at sensor compile time

Mitigation: Add runtime input array bounds checks to the Content Interpreter for Rapid Response Content in Channel File 291 - An additional check that the size of the input array matches the number of inputs expected by the Rapid Response Content was added at the same time. - We have completed fuzz testing of the Channel 291 Template Type and are expanding it to additional Rapid Response Content handlers in the sensor.

Mitigation: Correct the number of inputs provided by the IPC Template Type

Mitigation: Increase test coverage during Template Type development

Mitigation: Create additional checks in the Content Validator

Mitigation: Prevent the creation of problematic Channel 291 files

Mitigation: Update Content Configuration System test procedures

Mitigation: The Content Configuration System has been updated with additional deployment layers and acceptance checks

Mitigation: Provide customer control over the deployment of Rapid Response Content updates

====


Nonsense. You don’t need any staged deploys if you simply make no mistakes.

/s


[flagged]


Could you please stop posting unsubstantive comments and/or flamebait? Posts like this one and https://news.ycombinator.com/item?id=41542151 are definitely not what we're trying for on HN.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


> And no lab where the new code was actually installed and run prior to global rawdogging.

I thought the new code was actually installed, the running part depends on the script input...?


I just don't think a company like Crowdstrike has a leg to stand on when leveling the "disgruntled" label in the face of their, let's face it, astoundingly epic fuck up. It's the disgruntled employees that I think would have the most clear picture of what was going on, regardless of them being in QA/QC or not because they, at that point, don't really care any more and will be more forthright with their thoughts. I'd certainly trust their info more than a company yes-man which is probably where some of that opposing messaging came from.


Why would you trust a company no-man any more than a company yes-man? They both have agendas and biases. Is it just that you personally prefer one set of biases (anti-company) more than the other (pro-company)?


Yes, I am very much biased toward being anti-company and I make no apologies for that. I've been in the corporate world long enough to know first-hand the sins that PR and corporate management commits on the company's behalf and the harm it does. I find information coming from the individual more reliable than having it filtered through corpo PR, legal, ass-covering nonsense, the latter group often wanting to preserve the status quo than getting out actual info.


OK just checking. Nice that you at least acknowledge your bias.


Because there is still an off-hand chance that an employee who has been let go isn't speaking out of spite and merely stating the facts - depends on a combination of their honesty and the feeling they harbor about being let go. Everyone who is let go isn't bitter and/or a liar.

However, every company yes-man is paid to be a yes-man and will speak in favor of the company without exception - that literally is the job. Otherwise they will be fired and will join the ranks of the aforementioned people.

So logically it makes more sense for me to believe the former more than the latter. The two-sides are not equivalent (as you may have alluded) in term of trustworthiness.


Agreed. As a data point, i'm not disgruntled (i'm quoted in this article).

Mostly disappointed.


Well, in this case, we know one side (pro-company) fucked up big time. The other side (anti-company) may or may not have fucked up.

That makes it easier to trust one side over another.


You’ve kind of set yourself up in a no-lose situation here.

If the employees fucked up then you’ll say the company still fucked up because it wasn’t managing the employees well.

And then in that situation you’ll still believe the lying employees who say its the company’s fault while leaving out their culpability.


> So basically we have nothing.

No, what we have is a publication who is claiming that the people they talked to were credible and had points that were interesting and tended to match one another and/or other evidence.

You can make the claim that Semafor is bad at their jobs, or even that they're malicious. But that's a hard claim to make given that in the paragraph you've quoted they are giving you the contrary evidence that they found.

And this is a process many of us have done informally. When we talk to one ex-employee of a company, well maybe it was just that guy, or just where he was in the company. But when a bunch of people have the same complaint, it's worth taking it much more seriously.


This is like online reviews. If you selectively take positive or negative reviews and somehow censor the rest, the reviews are worthless. Yet, if you report on all the ones you find, it's still useful.

Yes, I'm more likely to leave reviews if I'm unsatisfied. Yes, people are more likely to leave CS if they were unhappy. Biased data, but still useful data.


If design isn’t involved in QC you’re not doing QC very well. If design isn’t plugged into development process enough to understand QC then you’re not doing design very well.


Why would a UX designer be involved in any way, shape, or form in kernel level code patches? They would literally never ship an update if they had that many hands in the pot for something completely unrelated. Should they also have their sales reps and marketing folks pre-brief before they make any code changes?


A UX designer might have told them it was a bad idea to deploy the patch widely without testing a smaller cohort, for instance. That’s an obvious measure that they skipped this time.


But that doesn't have anything to do with what UX designers typically do


I can't believe people on HN are posting this stuff over and over again. Either you are holistically disconnected from what proper software development should look like or outright creating the same environments that resulted in the crowdstrike issue.

Software security and quality is the responsibility of everyone on the team. A good UX designer should be thinking of ways a user can escape the typical flow or operate in unintended ways and express that to testers. And in decisions where management is forcing untested patches everyone should chime in.


Not true; UX designers typically are responsible for advocating for a robust, intuitive experience for users. The fact that kernel updates don’t have a user interface doesn’t make them exempt from asking the simple question: how will this affect users? And the subsequent question: is there a chance that deploying this eviscerates the user experience?

Granted, a company that isn’t focused on the user experience as much as it is on other things might not prioritise this as much in the first place.


the person you're replying will not take any sane argument once they decided that UX must be involved in kernel technical decision...


How would it not be related? Jamming untested code down the pipe with no way for users to configure when it's deployed and then rendering their machines inoperable is an extremely bad user experience and I would absolutely expect a UX expert to step in to try to avoid that.


Pick any large company that has a division working on Linux kernel (say Android).

I bet my ass UX is not anywhere close to the low-level OS team.

UX is definitely embedded in the App level team but not in low-level.


Pfft, I never said that at all. I’m not talking about technical decisions. OP was talking about QC, which is verifying software for human use. If you don’t have user-centered people involved (UX or product or proserve) then you end up with user-hostile decisions like these people made.


I would agree if it was a UI designer, but a good UX designer designs for the users, which in this case including the system admins who will be updating kernel level code patches. Ensuring they have a good experience e.g no crashes, is their job. A recommendation would likely be for example small roll-outs to minimise the number of people having a bad user experience on a roll-out that goes wrong.


I'm going with principle of least astonishment, where productivity is more highly valued in most companies than quality control.


I feel like crowdstrike is perfectly capable of mounting its own defense


There are some very specific accusations backed up by non-denials from crowdstrike.

Ex-employees said bugs caused the log monitor to drop entries. Crowdstrike responded the project was never designed to alert in real time. But Crowdstrike's website currently advertises it as working in real time.

Ex-employees said people trained to monitor laptops were assigned to monitor AWS accounts with no extra training. Crowdstrike replied that "there were no experienced ‘cloud threat hunters’ to be had" in 2022 and that optional training was available to the employees.


> Quality control was not really part of our process or our conversation.

Is anyone really surprised or learned any new information? For us that have worked for tech companies, this is one of those repeating complaints that you hear across orgs that indicates a less than stellar engineering culture.

I've worked with numerous F500 orgs and I would say 3/5 orgs that I worked in, their code was so bad that it made me wonder how they haven't had a major incident yet.


In principle yes, I agree that former employees' sentiments have an obvious bias, but if they all trend in the same direction - people who worked in different times and functions and didn't know each other while on the job - that points to a likely underlying truth.


Well they certainly don't care about the speed of the endpoints their malware runs on. Shit has ruined my macos laptop's performance.


All EDR software does (at least on macos)

Source: me, a developer who also codes in free time and notices how bad fs perf is especially.

I've had the CrowdStrike sensor, and my current company is using cyberhaven.

So.. while 2 data points don't technically make a pattern, it does begin to raise suspicion.


> This type of article - built upon disgruntled former employees - is worth about as much as the apology GrubHub gift card

To you and me, maybe. To the insurers and airlines paying out over the problem, maybe not.


I do agree with having to expect bias there, but who else do you really expect to speak out?Any current employee would very quickly become an ex-employee if they speak out with any specifics.

I would expect any contractor that may have worked for CrowdStrike, or done something like a third-party audit, would be under an NDA covering their work.

Who's left to speak out with any meaningful details?


Here's some anecdotal evidence - a friend worked at CrowdStrike and was horrified at how incredibly disorganised the whole place was. They said it was completely unsurprising to them that the outage occurred. More surprising to them was that it hadn't happened more often given what a clusterfrock the place was.


> So basically we have nothing.

Except the fact that CrowdStrike fucked up the one thing they weren't supposed to fuck up.

So yeah, at this point I'm taking the ex-employees' word, because it confirms the results that we already know -- there is no way that update could have gone out had there been proper "safety first" protocols in place and CrowdStrike was "meticulous".


Disgruntled are the Crowdstrike customers that had to deal with the outage. These employees have a lot of reputation to lose for coming forward. Crowstrike is a disgrace of a company and many others like it are doing the same behaviors but they just haven't gotten caught yet. Software development has become a disgrace when the bottom line of squeezing margins to please investors took over.


> These employees have a lot of reputation to lose for coming forward.

Employees don't typically have much reputation to ruin. I am perfectly content putting this on their shoulders.


Honestly, this article describes nearly all companies (from the perspective of the engineers) so I’m not sure I find it hard to believe this one is the same.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: