fmt::format vs. std::format

The excellent {fmt} largely served as the blueprint for the C++20 standard formatting library. That alone speaks for its quality. But I was curious: should you now just use std::format for everything, or is fmt::format still a good option? In this particular instance, I wanted to know which one is faster, so I wrote a small benchmark. Of course, the outcome very much depends on the standard library you are using. In my case, I’m using Visual Studio 17.13.0 and its standard library, and {fmt} version 11.1.3.

I started with a benchmark helper function:


template <std::invocable<> F> steady_clock::duration benchmark(std::string_view label, F f)
{
  auto start = steady_clock::now();
  f();
  auto end = steady_clock::now();
  auto time = end - start;
  auto us = duration_cast<nanoseconds>(time).count() / 1000.0;
  std::cout << std::format("{0} took {1:.3f}us", label, us) << std::endl;
  return time;
}

Then I called it with a lambda like this, with NUMBER_OF_ITERATIONS set to 500000:

int integer = 567800;
float real = 1234.0089f;
for (std::size_t i = 0; i < NUMBER_OF_ITERATIONS; ++i)
  auto _ = fmt::format("an int: {}, and a float: {}", integer, real);

… and the same thing with std::format.

Interestingly, fmt::format only needed about 75%-80% of time of std::format in a release build, while the situation reversed for a debug build to about 106%-108%.

It seems hard to construct a benchmark with low overhead of other things, while still avoiding that the compiler can optimize everything away. My code assumes the compiler keeps the formatting even after throwing it away. So take all my results with a grain of salt!

Exploring Dynamic SQL in Oracle: Counting Specific Values Across Multiple Tables

Imagine you have a large database where multiple tables contain a column named BOOK_ID. Perhaps you’re tasked with finding how many times a particular book (say, with an ID of 12345) appears across all these tables. How would you approach this?

In this post, I’ll explain a small piece of Oracle PL/SQL code that uses dynamic SQL to search for a specific value in any table that has a column with a specific name.

Finding the relevant tables

First, we need to determine which tables in the schema have a column with the desired name. To do this, we must look at a metadata table or view that contains information about all the tables in the schema and their columns. Most database systems offer such metadata tables, although their names and their structures vary greatly between different systems. In Oracle, the metadata table relevant to our task is called DBA_TAB_COLUMNS. Therefore, to find the names of all tables that contain a column BOOK_ID, you can use the following query:

SELECT table_name 
  FROM dba_tab_columns 
  WHERE column_name = 'BOOK_ID';

The output might be:

TABLE_NAME
-----------------
BOOKS
LIBRARY_BOOKS
ARCHIVED_BOOKS
TEMP_BOOKS
OLD_BOOKS

Looping through the tables

Now we want to loop through these tables in order to execute an SQL query for each of them. In Oracle we use a PL/SQL FOR loop to do this:

BEGIN
  FOR rec IN (SELECT table_name 
              FROM dba_tab_columns 
              WHERE column_name = 'BOOK_ID')
  LOOP
    -- do something for each record
  END LOOP;
END;
/

Dynamic SQL Construction

We can use the loop variable rec to dynamically create an SQL statement as a string by using the string concatenation operator || and assign it to a variable, in this case v_sql:

v_sql := 'SELECT COUNT(BOOK_ID) FROM ' || rec.table_name || ' WHERE BOOK_ID = :val';

The :val part is a placeholder that will become relevant later.

Of course, the variable needs to be declared for the PL/SQL code block first, so we add a DECLARE section:

DECLARE
  v_sql VARCHAR2(4000);
BEGIN
  -- ...
END;

How do we execute the SQL statement that we stored in the variable? By using the EXECUTE IMMEDIATE statement and two other variables; let’s call them v_result and v_value:

EXECUTE IMMEDIATE v_sql INTO v_result USING v_value;

This will execute the SQL in v_sql, replace the :val placeholder by the value in v_value, and store the result in v_result. The latter will capture the result of our dynamic query, which is the count of occurrences.

Of course, we have to declare these two variables as well. We’ll set v_value to the book ID we are looking for. The whole code so far is:

DECLARE
  v_sql VARCHAR2(4000);
  v_value NUMBER := 12345;  -- Value to search for
  v_result VARCHAR2(4000);
BEGIN
  FOR rec IN (SELECT table_name 
              FROM dba_tab_columns 
              WHERE column_name = 'BOOK_ID')
  LOOP
    v_sql := 'SELECT COUNT(BOOK_ID) FROM ' || rec.table_name || ' WHERE BOOK_ID = :val';

    EXECUTE IMMEDIATE v_sql INTO v_result USING v_value;
  END LOOP;
END;
/

Printing the results

If we execute the code above, we might be a little disappointed because it is accepted and executed without any errors, but nothing is printed out. How can we see the results? For that, we need to include DBMS_OUTPUT.PUT_LINE calls:

DBMS_OUTPUT.PUT_LINE('Found in ' || rec.table_name || ': ' || v_result);

But how do we handle the cases if no record was found or if there was an error in the SQL query? We’ll wrap it in an EXCEPTION handling block:

BEGIN
  EXECUTE IMMEDIATE v_sql INTO v_result USING v_value;
  DBMS_OUTPUT.PUT_LINE('Found in ' || rec.table_name || ': ' || v_result);
EXCEPTION
  WHEN NO_DATA_FOUND THEN
    NULL;  -- No matching rows found in this table
  WHEN OTHERS THEN
    DBMS_OUTPUT.PUT_LINE('Error in ' || rec.table_name || ': ' || SQLERRM);
END;

There’s still one thing to do. We first have to enable server output to see any of the printed lines:

SET SERVEROUTPUT ON;

This command ensures that any output generated by DBMS_OUTPUT.PUT_LINE will be displayed in your SQL*Plus or SQL Developer session.

Let’s put it all together. Here’s the full code:

SET SERVEROUTPUT ON;

DECLARE
  v_sql VARCHAR2(4000);
  v_value NUMBER := 12345;  -- Value to search for
  v_result VARCHAR2(4000);
BEGIN
  FOR rec IN (SELECT table_name 
              FROM dba_tab_columns 
              WHERE column_name = 'BOOK_ID')
  LOOP
    v_sql := 'SELECT COUNT(BOOK_ID) FROM ' || rec.table_name || ' WHERE BOOK_ID = :val';

    BEGIN
      EXECUTE IMMEDIATE v_sql INTO v_result USING v_value;
      DBMS_OUTPUT.PUT_LINE('Found in ' || rec.table_name || ': ' || v_result);
    EXCEPTION
      WHEN NO_DATA_FOUND THEN
        NULL; -- No rows found in this table
      WHEN OTHERS THEN
        DBMS_OUTPUT.PUT_LINE('Error in ' || rec.table_name || ': ' || SQLERRM);
    END;
  END LOOP;
END;
/

Here’s what the output might look like when the block is executed:

Found in BOOKS: 5
Found in LIBRARY_BOOKS: 2
Found in ARCHIVED_BOOKS: 0

Alternatively, if one of the tables throws an error (for instance, due to a permissions issue or if the table doesn’t exist in the current schema), you might see an output like this:

Found in BOOKS: 5
Error in TEMP_BOOKS: ORA-00942: table or view does not exist
Found in LIBRARY_BOOKS: 2

Conclusion

Dynamic SQL is particularly useful when the structure of your query is not known until runtime. In this case, since the table names come from a data dictionary view (dba_tab_columns), the query must be constructed dynamically.

Instead of writing a separate query for each table, the above code automatically finds and processes every table with a BOOK_ID column. It works on any table with the right column, making it useful for large databases.

Building and running SQL statements on the fly allows you to handle tasks that are not possible with static SQL alone.

Local Javascript module development

https://www.viget.com/articles/how-to-use-local-unpublished-node-packages-as-project-dependencies/

yalc: no version upgrade, no publish etc.

Building an application using libraries – called packages or modules in Javascript – is a common practice since decades. We often use third-party libraries in our projects to not have to implement everything ourselves.

In this post I want to describe the less common situation where we are using a library we developed on our own and/or are actively maintaining. While working on the consuming application we need to change the library sometimes, too. This can lead to a cumbersome process:

  1. Need to implement a feature or fix in the application leads to changes in our library package.
  2. Make a release of the library and publish it.
  3. Make our application reference the new version of our library.
  4. Test everything and find out, that more changes are needed.
  5. Goto 1.

This roundtrip-cycle takes time, creates probably useless releases of our library and makes our development iterations visible to the public.

A faster and lighter alternative

Many may point to npm link or yarn link but there are numerous problems associated with these solutions, so I tried the tool yalc.

After installing the tool (globally) you can make changes to the library and publish them locally using yalc publish.

In the dependent project you add the local dependency using yalc add <dependency_name>. Now we can quickly iterate without creating public releases of our library and test everything locally until we are truly ready.

This approach worked nicely for me. yalc has a lot more features and there are nice guides and of course its documentation.

Conclusion

Developing several javascript modules locally in parallel is relatively easy provided the right tooling.

Do you have similar experiences? Do you use other tools you would recommend?

You are mislead about the Big-O notation

One statement I have people say and people repeat a lot, especially in the data-oriented design bubble, is that Big-O notation cannot accurately real-life performance of contemporary computer programs, especially in the presence of multi-tier memory hierarchies like L1/L2/L3-caches for RAM. This is, at best, misleading and gives this fantastic tool a bad reputation.

At it’s core, Big-O is just a way to categorize functions in how they scale. There’s nothing in the formal definition about performance at all. Of course, it is often used to categorize performance of algorithms and implementations of them. But to use it for that, you need two other things: A machine model and a metric for it.

Traditionally, when performance categorization using Big-O is taught, the machine model is either the Turing-machine or the slightly closer-to-reality RAM-machine. The metric is a number of operations. The operation that is counted has a huge impact. For example, insertion sort can easily be implemented in O(n*log(n)) when counting the number of comparisons (by using binary search to find the insertion point), but is in O(n²) when counting the number of memory moves/swaps.

Neither the model nor the metric is intrinsic to Big-O. To use in in the context of memory hierarchies, you just need to start counting what matters to you, e.g. memory accesses, cache misses or branch mispredictions. This is not new either, I learned about cache-aware and cache-oblivious machine models for this in university over 15 years ago.

TL;DR: Big-O is not obsolete, you just have to use it to count the appropriate performance-critical element in your algorithm.

Integrating API Key Authorization in Micronaut’s OpenAPI Documentation

In a Java Micronaut application, endpoints are often secured using @Secured(SecurityRule.IS_AUTHENTICATED), along with an authentication provider. In this case, authentication takes place using API keys, and the authentication provider validates them. If you also provide Swagger documentation for users to test API functionalities quickly, you need a way for users to specify an API key in Swagger that is automatically included in the request headers.

For a general guide on setting up a Micronaut application with OpenAPI Swagger and Swagger UI, refer to this article.

The following article focuses on how to integrate API key authentication into Swagger so that users can authenticate and test secured endpoints directly within the Swagger UI.

Accessing Swagger Without Authentication

To ensure that Swagger is always accessible without authentication, update the application.yml file with the following settings:

micronaut:  
  security:
    intercept-url-map:
      - pattern: /swagger/**
        access:
          - isAnonymous()
      - pattern: /swagger-ui/**
        access:
          - isAnonymous()
    enabled: true

These settings ensure that Swagger remains accessible without requiring authentication while keeping API security enabled.

Defining the Security Schema

Micronaut supports various Swagger annotations to configure OpenAPI security. To enable API key authentication, use the @SecurityScheme annotation:

import io.swagger.v3.oas.annotations.security.SecurityScheme;
import io.swagger.v3.oas.annotations.enums.SecuritySchemeIn;
import io.swagger.v3.oas.annotations.enums.SecuritySchemeType;

@SecurityScheme(
    name = "MyApiKey",
    type = SecuritySchemeType.APIKEY,
    in = SecuritySchemeIn.HEADER,
    paramName = "Authorization",
    description = "API Key authentication"
)

This defines an API key security scheme with the following properties:

  • Name: MyApiKey
  • Type: APIKEY
  • Location: Header (Authorization field)
  • Description: Explains how the API key authentication works

Applying the Security Scheme to OpenAPI

Next, we configure Swagger to use this authentication scheme by adding it to @OpenAPIDefinition:

import io.swagger.v3.oas.annotations.info.*;
import io.swagger.v3.oas.annotations.security.SecurityRequirement;

@OpenAPIDefinition(
    info = @Info(
        title = "API",
        version = "1.0.0",
        description = "This is a well-documented API"
    ),
    security = @SecurityRequirement(name = "MyApiKey")
)

This ensures that the Swagger UI recognizes and applies the defined authentication method.

Conclusion

With these settings, your Swagger UI will display an Authorization field in the top-left corner.

Users can enter an API key, which will be automatically included in all API requests as a header.

This is just one way to implement authentication. The @SecurityScheme annotation also supports more advanced authentication flows like OAuth2, allowing seamless token-based authentication through a token provider.

By setting up API key authentication correctly, you enhance both the security and usability of your API documentation.

How React components can know their actual dimensions

Every once in a while, styling a Web Application can be oh so frustrating quite interesting because stuff that appears easy does actually not comply with any of your suggestions. And there are some fields that ambush me more often than I’d like to admit, and with each application there appears some unique quirk that makes a universal solution hard.

Right now, I’m thinking about a CSS-only nested layout of several areas on your available screen that need to make good use of the available space, but still be somewhat dynamic in order to be maintainable.

Web Apps are especially delicate in layout things because if you ask most customers, a fully responsive layout is never the goal (as in, way too expensive for their use case), but no matter how often you make them assure you that there is only a small set of target resolutions, there will be one day where something changed and well-yeah-these-ones-too-of-course.

It is also commonly encountered that “looking good” is “not that important”, but as progress goes, everyone still knows that that was a pure lie.

Of course, this is a manifestation of Feature Creep, but one that is hard to argue about. And we do not want to argue with customers anyway, we want to solve their problems with as little friction as possible.

So by now, one would have thought that CSS would have evolved quite enough in order to at least place dynamic content somewhat predictable. There are flexbox and grid displays and these are useful as hell, but still.

And while, for some reason or another, the width of dynamic nested content can usually be accounted for in some pure CSS solution that one can find in under a day’s work; getting the height quite right is a problem that is officially harder than all multi-order corrections I ever encountered in my studies of quantum field theory. Only solvable in some oversimplified use-cases.

The limits of “height: 100%;” are reached in cases where content is dominated by its content instead of their container; as in nested <svg> elements that love to disagree about the meaning of “100%”. Dynamic SVG content is especially more cumbersome because you neither want distorted nor cut-off content, and you can try to get along with viewBox and preserveAspectRatio, but even then.

Maybe it won’t budge, and maybe that’s the point where I find it acceptable to read the actual DOM elements even from within a React component, an approach that is usually as dangerous as it is intrusive,

but is it a code smell if it is rather concise and reliantly does the job?

const useHeightAwareRef = () => {
    const [height, setHeight] = useState({
        initialized: false,
        value: null,
    });
    const ref = useRef(null);

    useEffect(() => {
        if (!ref.current || height.initialized) {
            return;
        }

        const adjustHeight = () => {
            const rect = ref.current?.getBoundingClientRect();
            setHeight({
                value: rect?.height ?? null,
                initialized: true
            });
        };

        adjustHeight();
        window.addEventListener("resize", adjustHeight);
        return () => {
            window.removeEventListener("resize", adjustHeight);
        };
    }, [height.initialized]);

    return {
        height: height.initialized ? height.value : null,
        ref
    };
};

// then use this like:

const SomeNestedContent = () => {
    const {height, ref} = useHeightAwareRef();

    return (
        <div ref={ref}>
        {
            height &&
            <svg height={height} width={"100%"}>
                { /* ... Dragons be here ... */ }
            </svg>
        }
        </div>
    );
};

I find this worthfile to have in your toolbox. If you manage your super-dynamic* content in some other super-responsive** fashion in a way that is super-arguable*** to your customer, sure, go by it. But remember, at some point each, possibly,

  • (*) your customer might have data outside the mutually agreed use cases,
  • (**) your customer might have screens outside the mutually agreed ones,
  • (***) your customer might have less patience / time than originally intended,

so maybe move the idea of “there must be one super-elegant pure-CSS solution in the year of 2025” back into your dreams and shoehorn that <svg> & friends into where they belong :´)

Beware of using Git LFS on Github

In my private game programming projects, I am often using data files alongside my code for all kinds of game assets like images and sounds. So I thought it might be a good idea to use the Git Large File Storage (=LFS) extension for that.

What is Git LFS?

Essentially, if you’re not using it, the file will be in your local .git folder if it was part of your repository at any time in your history. E.g. if you accidentally added&committed a 800mb video files and then deleted it again, they will still be in your local .git folder. This problem multiplies when using a CI with many branches: each branch will typically have a copy of all files ever used in your repository. This is not a problem with source code files, because they are not that big and they can be compressed really well with different versions of themselves, which is what git typically does.

With Git LFS, the big files are only stored as references in the .git folder. This means that you might need an additional request to your remote when checking them out again, but it will save you lots space and traffic when cloning repositories.

In my previous projects on github, I just did not enable LFS for my assets. And that worked fine, as my assets are usually pretty small and I don’t change them often. But this time I wanted to try it.

Sorry, Github, what?

Imagine my suprise when I got an e-mail from github last month warning me that my LFS traffic quota is almost reached and I have to pay to extend it. What? I never had and traffic quota problems without LFS. Github doesn’t even seem to have one, if I just keep my big files in ‘pure’ git. So that’s what I get for trying to safe Github traffic.

Now the LFS quota is a meager 1 gb per month with Github Pro. That’s nothing. Luckily, my current project is not asset heavy: the full repo is very small at ~60mb. But still the quota was reached with me as a single developer. How did that happen? I just enabled CI for my project on my home server and I was creating lots of branches my CI wanted to build. That’s only 12 branches cloned for the 80% warning to be reached.

Workarounds

Jenkins, which I’m using as a CI tool, has the ability to use a ‘reference repository’ when cloning. This can be used to get the bulk of the data from a local remote, while getting the rest from Github. This is what I’m now using to avoid excess LFS traffic. It is a bit of a pain to set up: you have to manually maintain this reference repository, Jenkins will not do it for you, and you have to do that on each agent. I only have one at this point, so that’s an okay trade-off. But next time, Isure won’t use Git LFS on Github, if I can avoid it.

Inline and Implicit Foreign Key Constraints in SQL

Foreign key constraints are a key part of database design, ensuring that relationships between tables are consistent and reliable. They create a relationship between two tables, ensuring that data matches across them. For example, a column in an “Orders” table (like CustomerID) might reference a column in a “Customers” table. This guarantees that every order belongs to a valid customer.

In earlier versions of SQL systems, defining foreign key constraints often required separate ALTER TABLE statements after the table was created:

CREATE TABLE Orders (
  OrderID    INT PRIMARY KEY,
  CustomerID INT NOT NULL,
  OrderDate  DATE
);

ALTER TABLE Orders
ADD CONSTRAINT FK_Customer FOREIGN KEY (CustomerID)
REFERENCES Customers(CustomerID);

This two-step process was prone to errors and required careful management to ensure all constraints were applied correctly.

Inline Foreign Key Constraints

Most of the popular SQL database systems – PostgreSQL, Oracle, SQL Server, and MySQL since version 9.0, released in July 2024 – now support inline foreign key constraints. This means you can define the relationship directly in the column definition, making table creation easier to read:

CREATE TABLE Orders (
  OrderID    INT PRIMARY KEY,
  CustomerID INT NOT NULL REFERENCES Customers(CustomerID),
  OrderDate  DATE
);

Fortunately, this syntax is the same across these systems. However, MySQL 9 additionally supports implicit foreign key constraints:

CREATE TABLE Orders (
  OrderID    INT PRIMARY KEY,
  CustomerID INT NOT NULL REFERENCES Customers,
  OrderDate  DATE
);

By leaving out the (CustomerID) in the REFERENCES clause it will assume that you want to reference the primary key of the parent table. This syntax is unique to MySQL, and you should avoid it if you need to write SQL DDL statements that works across multiple database systems.

No more Schneide Dev Brunches

In the year 2004, we had an idea: What if we met on Sunday for a late breakfast or early lunch (hence the word “brunch”) and talked about software, software development and IT in general while we ate?

The “brunch” theme pinpointed the time: 11 o’clock. The presence of food implied the presence of a dining table, something we found in our company kitchen (and later directly besides it). This defined the place: The kitchen of the Softwareschneiderei became the location to meet, eat and talk every other month, six times a year.

The brunches had no real structure or guest list, we showed up and were open to contributions, ideas and nearly everything that happened, as long as it fitted the shared interest of the participants.

For 20 years, we continued this series of events. It was great fun, real inspiration and always a source for thoughts and new ideas. Because it was designed as a meeting in presence, we had some challenges over the years:

First, the list of regular guests grew until the demand for some kind of transcript of meetings that were missed added the habit of “recap blog entries”. You can still find them, they have their own category on our blog:

https://schneide.blog/category/events/dev-brunch/

The recap was dropped when we shared more details about topics on the mailing list that acted as a planning tool for the meetings.

Second, in early 2020, we were faced with the reality that in-person meetings wouldn’t be feasible for quite some time. We didn’t have a crystal ball that could predict the future of the corona pandemic, so we couldn’t be sure, but we had some knowledge about the 1918 influenza pandemic from the book “Pale Rider”, written by Laura Spinney that some of us read in 2018. We took it as a prediction of the things that might come and switched to online meetings, which changed the character of the brunch, not least because nobody ate on camera anymore. The beginning of the pandemic was an eventful period for us, as we not only changed the Dev Brunch, but also everything else in our daily work.

A third thing became obvious when it would have been feasible to meet in person again: Everybody lives somewhere else now. We would have a stark drop in attendance if we went back to in-person brunches, just because of basic geography. We considered hybrid events, but they might have been the “worst of both worlds” instead of a good compromise.

The fourth thing that is about to change is the discontinuation of our mailing list infrastructure. This might sound like a small thing, but we take privacy and data protection really seriously and don’t want to move e-mail addresses around without proper consent by everybody. If we build up a new equivalent to the mailing list, we will start from scratch with proper agreements. Again, this might sound over the top in comparison to other companies’ conduct in regard to e-mail addresses, but that’s no reason to act the same.

So, this is farewell to a series of events that helped shape us the way we are. The Dev Brunch was a wonderful idea that facilitated our passion: software development. That doesn’t mean we are less passionate or less inclined to talk for hours about software development. It just means that the setting of future talks will be different.

Let us stay in touch!

Surviving the “Big One” in IT – Part I

For every kind of natural disaster, there is a “Big One”. Everybody who lived through it remembers the time, everybody else has at least heard stories about it. Every time a similar natural disaster occurs, it gets compared to this one.

We just remembered the “Boxing Day Tsunami” twenty years ago. Another example might be “The Big One”, the devastating earthquake of San Francisco in 1906. From today’s viewpoint, it wasn’t the strongest earthquake since, but it was one of the first to be extensively covered by “modern” media. It preceded the Richter scale, so we can’t directly compare it to current events.

In the rather young history of IT, we had our fair share of “natural” disasters as well. We used to give the really bad ones nicknames. The first vulnerability that was equipped with a logo and its own domain was heartbleed in 2014, ten years ago.

Let’s name-drop some big incidents:

The first entry in this list is different from the others in that it was a “near miss”. It would have been a vertitable catastrophe with millions of potentially breached and compromised systems. It just got discovered and averted right before it would have been distributed worldwide.

Another thing we can deduce from the list is the number of incidents per year:

https://www.cve.org/about/Metrics

From around 5k published vulnerabilities per year until 2014 (roughly one every two hours) it rose to 20k in 2021 and 30k in 2024. That’s 80 reports per day or 4 per hour. A single human cannot keep up with these numbers. We need to rely on filters that block out the noise and highlight the relevant issues for us.

But let’s assume that the next “Big One” happens and attains our attention. There is one common characteristic for all incidents I witnessed that is similar to earthquakes or floods: It happens everywhere at once. Let me describe the situation at the example of Log4Shell:

The first reports indicated a major vulnerability in the log4j package. That seemed bad, but it was a logging module, what could possibly happen? We could lose the log files?

It soon became clear that the vulnerability can be used from a distance by just sending over a malicious request that gets logged. Like a web request without proper authentication to a route that doesn’t exist. That’s exactly what logging is for: Capturing the outliers and preserving them for review.

Right at the moment that it dawned on us that every system with any remote accessibility was at risk, the first reports of automated attacks emerged. It was now friday late evening, the weekend just started and you realized that you are in a race against bots. The last thing you can do is call it a week and relax for 2 days. In these 48 hours, the war is lost and the systems are compromised. You know that you have at most 4 hours to:

  • Gather a list of affected projects/systems
  • Assess the realistic risk based on current knowledge
  • Hand over concrete advice to the system’s admins
  • Or employ the countermeasures yourself

In our case, that meant to review nearly 50 projects, document the decision and communicate with the operators.

While we did that, during friday night, new information occurred that not only log4j 2.x, but also 1.x was susceptible to similar attacks.

We had to review our list and decisions based on the new situation. While we were doing that, somebody on the internet refuted the claim and proclaimed the 1.x versions safe.

We had to split our investigation into two scenarios that both got documented:

  • scenario 1: Only log4j 2.x is affected
  • scenario 2: All versions of log4j are vulnerable

We employed actions based on scenario 1 and held our breath that scenario 2 wouldn’t come true.

One system with log4j 1.x was deemed “low impact” if down, so we took it off the net as a precaution. Spoiler: scenario 2 was not true, so this was an unnecessary step in hindsight. But in the moment, it was one problem off the list, regardless of scenario validity.

The thing to recognize here is that the engagement with the subject is not linear and not fixed. The scope and details of the problem change while you work on it. Uncertainties arise and need to be taken into account. When you look back on your work, you’ll notice all the unnecessary actions that you did. They didn’t appear unnecessary in the moment or at least you weren’t sure.

After we completed our system review and had carried out all the necessary actions, we switched to “survey and communicate” mode. We monitored the internet talk about the vulnerability and stayed in contact with the admins that were online. I remember an e-mail from an admin that copied some excerpts from the server logfiles with the caption: “The attacks are here!”.

And that was the moment my heart sank, because we had totally forgotten about the second front: Our own systems!

Every e-mail is processed by our mailing infrastructure and one piece of it is the mail archive. And this system is written in Java. I raced to gather insights what specific libraries are used in it. Because if a log4j 2.x library were included, the friendly admin would have just inadvertently performed a real attack on our infrastructure.

A few minutes after I finished my review (and found a log4j 1.x library), the producer of the product sent an e-mail, validating my result by saying that the product is not at risk. But those 30 minutes of uncertainty were pure panic!

In case of an airplane emergency, they always tell you to make sure you are stable first (i.e. place your own oxygen mask first). The same thing can be said about IT vulnerabilities: Mind your own systems first! We would have secured our client’s systems and then fallen prey to friendly fire if the mail archive would have been vulnerable.

Let’s re-iterate the situation we will find ourselves in when the next “Big One” hits:

  • We need to compile a list of affected instances, both under our direct control (our own systems) and under our ministration.
  • We need to assess the impact of immediate shutdown. If feasible, we should take as many systems as possible out of the equation by stopping or airgapping them.
  • We need to evaluate the risk of each instance in relation to the vulnerability. These evaluations need to be prioritized and timeboxed, because they need to be performed as fast as possible.
  • We need to document our findings (for later revision) and communicate the decision or recommendation with the operators.

This situation is remarkably similar to real-world disaster mitigation:

  • The lists of instances are disaster plans
  • The shutdowns are like evacuations
  • The risk evaluation is essentially a triage task
  • The documentation and delegation phase is the command and control phase of disaster relief crews

This helps a lot to see which elements can be prepared beforehands!

The disaster plans are the most obvious element that can be constructed during quiet times. Because no disaster occurs according to plan and plans tend to get outdated quickly, they need to be intentionally fuzzy on some details.

The evacuation itself cannot be fully prepared, but it can be facilitated by plans and automation.

The triage cannot be prepared either, but supported by checklists and training.

The documentation and communication can be somewhat formalized, but will probably happen in a chaotic and unpredictable manner.

With this insight, we can look at possible ideas for preparation and planning in the next part of this blog series.