Redsauce's Software QA Blog

Crowdstrike and the global collapse bug

Posted by Pablo Gómez

{1570}

Friday, July 18, 2024. Airports, train and bus stations are buzzing, we're starting our summer vacations, we're a week away from the Olympic Games and... crash. Blue screen, the screen of death, BSoD (Blue Screen of Death) or whatever you want to call it in terminals around the world.

This large-scale blackout, which affects airlines, hospitals, emergency services, national TV stations, banks, large supermarkets, etc. has direct consequences such as delayed flights, delayed medical services, cancelled live broadcasts, impossible payments... A collapse that started in the USA and spread to the rest of the world.


BSoD, EFE

BSoD image, blue screen of death. EFE.

But, why did the collapse of Crowdstrike affect all these companies?

They have their services running on Windows operating systems that also use an antivirus from Crowdstrike, a U.S. cybersecurity company founded in 2011 and a Microsoft vendor.


Crowdstrike's antivirus, called “Falcon Agent”, performed an update on the night of Thursday to Friday that was incompatible with the normal operation of the operating system, causing its shutdown and the global collapse that followed.

A few hours later, the INCIBE qualified the incident as critical (value 5) indicating that the following steps should be taken to solve the problem:

  1. start the computer in safe mode or in the Windows recovery environment.

  2. Go to the folder C:WindowsSystem32driversCrowdStrike.

  3. Delete the file “C-00000291*.sys”.

  4. Restart the computer.

It also indicates that it is possible that following the fourth step alone will solve the problem.

If Crowdstrike has not been cyber-attacked, why is it classified as a security incident?

Because Confidentiality, Integrity and Availability (known as the CIA triad) are the three legs of an organization's computer security. And in this case it was a clear availability failure. A system that is not available has a tremendous impact on reputation and user trust. Likewise, you may be more exposed to attacks when trying to restore service. This includes potential exploits during reboot or emergency patching. Attackers can also take advantage of the lack of availability to trick users with social engineering tactics, making them believe that they must take actions that compromise their security.

How did this happen to Crowdstrike?

What happened is that the antivirus that FAILED had a DEFECT (bug) in its update process due to an ERROR in its development. And this error is human. What can lead to confusion is that in this case, the company is a security specialist, but the defect was due to a problem in the development life cycle and the failure has caused a security problem due to lack of availability.

The post-mortem analysis of the incident will give more details of what happened although it is likely that they will not transcend outside of Crowdstrike's IT team. Everything indicates that they have skipped one or several phases of the validation process of this update in pre-productive environments, safe places where tests can be carried out manual and automatic, functional and non-functional without risk of compromising the business. Or that the pre-productive environments were not reliable enough and did not generate the error. Or that the tests were not detailed enough. Or there was not a sufficiently thorough code review done. Or maybe all the testing was done, it was successful, and the deployment process got the branch wrong. Maybe with a progressive deployment...

Be that as it may, the hand of man is engaged in a quality process that has shown that it has clear points of improvement. It is also true that it is very very difficult (actually impossible) to present your software as 100% defect-free. Probably the reason for not having detected the defect before reaching production was a foolish oversight; one of those that make a lot of noise.

Consequences

The problem has not been with Microsoft, who has rather been one of the victims, but it remains to be seen the consequences for him and especially for Crowdstrike, who already on the same day left double-digit values ??in the stock market price. It is often not considered relevant to have a QA role in a technical team or to implement automatic validation processes as a safety net to avoid failures in production... when everything is going well. But cases like today's show that the return on investment can be imminent and not only in economic terms, but also in terms of brand image, credibility, security, etc.


PS: my solidarity with the IT teams of many companies. Being Friday, they are going to have an busy weekend.

About us

You have reached the blog of Redsauce, a team of experts in QA and software development. Here we will talk about agile testing, automation, programming, cybersecurity… Welcome!