Surviving the “Big One” in IT – Part I

For every kind of natural disaster, there is a “Big One”. Everybody who lived through it remembers the time, everybody else has at least heard stories about it. Every time a similar natural disaster occurs, it gets compared to this one.

We just remembered the “Boxing Day Tsunami” twenty years ago. Another example might be “The Big One”, the devastating earthquake of San Francisco in 1906. From today’s viewpoint, it wasn’t the strongest earthquake since, but it was one of the first to be extensively covered by “modern” media. It preceded the Richter scale, so we can’t directly compare it to current events.

In the rather young history of IT, we had our fair share of “natural” disasters as well. We used to give the really bad ones nicknames. The first vulnerability that was equipped with a logo and its own domain was heartbleed in 2014, ten years ago.

Let’s name-drop some big incidents:

The first entry in this list is different from the others in that it was a “near miss”. It would have been a vertitable catastrophe with millions of potentially breached and compromised systems. It just got discovered and averted right before it would have been distributed worldwide.

Another thing we can deduce from the list is the number of incidents per year:

https://www.cve.org/about/Metrics

From around 5k published vulnerabilities per year until 2014 (roughly one every two hours) it rose to 20k in 2021 and 30k in 2024. That’s 80 reports per day or 4 per hour. A single human cannot keep up with these numbers. We need to rely on filters that block out the noise and highlight the relevant issues for us.

But let’s assume that the next “Big One” happens and attains our attention. There is one common characteristic for all incidents I witnessed that is similar to earthquakes or floods: It happens everywhere at once. Let me describe the situation at the example of Log4Shell:

The first reports indicated a major vulnerability in the log4j package. That seemed bad, but it was a logging module, what could possibly happen? We could lose the log files?

It soon became clear that the vulnerability can be used from a distance by just sending over a malicious request that gets logged. Like a web request without proper authentication to a route that doesn’t exist. That’s exactly what logging is for: Capturing the outliers and preserving them for review.

Right at the moment that it dawned on us that every system with any remote accessibility was at risk, the first reports of automated attacks emerged. It was now friday late evening, the weekend just started and you realized that you are in a race against bots. The last thing you can do is call it a week and relax for 2 days. In these 48 hours, the war is lost and the systems are compromised. You know that you have at most 4 hours to:

  • Gather a list of affected projects/systems
  • Assess the realistic risk based on current knowledge
  • Hand over concrete advice to the system’s admins
  • Or employ the countermeasures yourself

In our case, that meant to review nearly 50 projects, document the decision and communicate with the operators.

While we did that, during friday night, new information occurred that not only log4j 2.x, but also 1.x was susceptible to similar attacks.

We had to review our list and decisions based on the new situation. While we were doing that, somebody on the internet refuted the claim and proclaimed the 1.x versions safe.

We had to split our investigation into two scenarios that both got documented:

  • scenario 1: Only log4j 2.x is affected
  • scenario 2: All versions of log4j are vulnerable

We employed actions based on scenario 1 and held our breath that scenario 2 wouldn’t come true.

One system with log4j 1.x was deemed “low impact” if down, so we took it off the net as a precaution. Spoiler: scenario 2 was not true, so this was an unnecessary step in hindsight. But in the moment, it was one problem off the list, regardless of scenario validity.

The thing to recognize here is that the engagement with the subject is not linear and not fixed. The scope and details of the problem change while you work on it. Uncertainties arise and need to be taken into account. When you look back on your work, you’ll notice all the unnecessary actions that you did. They didn’t appear unnecessary in the moment or at least you weren’t sure.

After we completed our system review and had carried out all the necessary actions, we switched to “survey and communicate” mode. We monitored the internet talk about the vulnerability and stayed in contact with the admins that were online. I remember an e-mail from an admin that copied some excerpts from the server logfiles with the caption: “The attacks are here!”.

And that was the moment my heart sank, because we had totally forgotten about the second front: Our own systems!

Every e-mail is processed by our mailing infrastructure and one piece of it is the mail archive. And this system is written in Java. I raced to gather insights what specific libraries are used in it. Because if a log4j 2.x library were included, the friendly admin would have just inadvertently performed a real attack on our infrastructure.

A few minutes after I finished my review (and found a log4j 1.x library), the producer of the product sent an e-mail, validating my result by saying that the product is not at risk. But those 30 minutes of uncertainty were pure panic!

In case of an airplane emergency, they always tell you to make sure you are stable first (i.e. place your own oxygen mask first). The same thing can be said about IT vulnerabilities: Mind your own systems first! We would have secured our client’s systems and then fallen prey to friendly fire if the mail archive would have been vulnerable.

Let’s re-iterate the situation we will find ourselves in when the next “Big One” hits:

  • We need to compile a list of affected instances, both under our direct control (our own systems) and under our ministration.
  • We need to assess the impact of immediate shutdown. If feasible, we should take as many systems as possible out of the equation by stopping or airgapping them.
  • We need to evaluate the risk of each instance in relation to the vulnerability. These evaluations need to be prioritized and timeboxed, because they need to be performed as fast as possible.
  • We need to document our findings (for later revision) and communicate the decision or recommendation with the operators.

This situation is remarkably similar to real-world disaster mitigation:

  • The lists of instances are disaster plans
  • The shutdowns are like evacuations
  • The risk evaluation is essentially a triage task
  • The documentation and delegation phase is the command and control phase of disaster relief crews

This helps a lot to see which elements can be prepared beforehands!

The disaster plans are the most obvious element that can be constructed during quiet times. Because no disaster occurs according to plan and plans tend to get outdated quickly, they need to be intentionally fuzzy on some details.

The evacuation itself cannot be fully prepared, but it can be facilitated by plans and automation.

The triage cannot be prepared either, but supported by checklists and training.

The documentation and communication can be somewhat formalized, but will probably happen in a chaotic and unpredictable manner.

With this insight, we can look at possible ideas for preparation and planning in the next part of this blog series.

String Representation and Comparisons

Strings are a fundamental data type in programming, and their internal representation has a significant impact on performance, memory usage, and the behavior of comparisons. This article delves into the representation of strings in different programming languages and explains the mechanics of string comparison.

String Representation

In programming languages, such as Java and Python, strings are immutable. To optimize performance in string handling, techniques like string pools are used. Let’s explore this concept further.

String Pool

A string pool is a memory management technique that reduces redundancy and saves memory by reusing immutable string instances. Java is a well-known language that employs a string pool for string literals.

In Java, string literals are automatically “interned” and stored in a string pool managed by the JVM. When a string literal is created, the JVM checks the pool for an existing equivalent string:

  • If found, the existing reference is reused.
  • If not, a new string is added to the pool.

This ensures that identical string literals share the same memory location, reducing memory usage and enhancing performance.

Python also supports the concept of string interning, but unlike Java, it does not intern every string literal. Python supports string interning for certain strings, such as identifiers, small immutable strings, or strings composed of ASCII letters and numbers.

String Comparisons

Let’s take a closer look at how string comparisons work in Java and other languages.

Comparisons in Java

In this example, we compare three strings with the content “hello”. While the first comparison return true, the second does not. What’s happening here?

String s1 = "hello";
String s2 = "hello";
String s3 = new String("hello");

System.out.println(s1 == s2); // true
System.out.println(s1 == s3); // false

In Java, the == operator compares references, not content.

First Comparison (s1 == s2): Both s1 and s2 reference the same object in the string pool, so the comparison returns true.

Second Comparison (s1 == s3): s3 is created using new String(), which allocates a new object in heap memory. By default, this object is not added to the string pool, so the object reference is unequal and the comparison returns false.

You can explicitly add a string to the pool using the intern() method:

String s1 = "hello";
String s2 = new String("hello").intern();

System.out.println(s1 == s2); // true

To compare the content of strings in Java, use the equals() method:

String s1 = "hello";
String s2 = "hello";
String s3 = new String("hello");

System.out.println(s1.equals(s2)); // true
System.out.println(s1.equals(s3)); // true
Comparisons in Other Languages

Some languages, such as Python and JavaScript, use == to compare content, but this behavior may differ in other languages. Developers should always verify how string comparison operates in their specific programming language.

s1 = "hello"
s2 = "hello"
s3 = "".join(["h", "e", "l", "l", "o"])

print(s1 == s2)  # True
print(s1 == s3)  # True

print(s1 is s2)  # True
print(s1 is s3)  # False

In Python, the is operator is used to compare object references. In the example, s1 is s3 returns False because the join() method creates a new string object.

Conclusion

Different approaches to string representation reflect trade-offs between simplicity, performance, and memory efficiency. Each programming language implements string comparison differently, requiring developers to understand the specific behavior before relying on it. For example, some languages differentiate between reference and content comparison, while others abstract these details for simplicity. Languages like Rust, which lack a default string pool, emphasize explicit memory management through ownership and borrowing mechanisms. Languages with string pools (e.g., Java) prioritize runtime optimizations. Being aware of these nuances is essential for writing efficient, bug-free code and making informed design choices.

Python desktop applications with asynchronous features (part 1)

The Python world has this peculiar characteristic that for nearly all ideas that exist out there in the current programming world, there is a prepared solution that appears well-rounded at first and when you then try to “just plug it together”, after a certain while you encounter some most specific cases that make all the “just” go up in smoke and leave you with some hours of research.

That is also, why now, for quite some of these “solutions” there seem to be various similar-but-distinct packages that you better care to understand; which is again, hard, because every Python developer always likes to advertise their package with very easy sounding words.

But that is the fun of Python. If I choose it as suitable for a project, this is the contract I sign 🙂

So I recently ran into a surprisingly non-straightforward case with no simple go-to-solution. Maybe you know one and then I’ll be thankful to try that and discuss it through; sometimes one is just blind from all the options. (It’s also what you sign when choosing Python. I cope.)

Now, I thought a lot about my use case and I will split these findings up into multiple blog posts – thinking asynchronously (“async”) is tricky in many languages, but the Python way, again, is to hide its intricacies in very carefully selected places.

A desktop app with TPC socket HANDLING

As long as your program is a linear script, you do not need async features. But if there is some thing to be done for a longer time (or endless), you cannot afford to wait for it or otherwise intervene.

E.g for our desktop application (employing Python’s very basic tkinter) we sooner or later run into the tk.mainloop() , which, from the OS thread it was called from, is a endless loop that draws the interface, handles input events, updates the interface, repeat. This is blocking, i.e. that thread can now only do other stuff from the event handlers acting on that interface.

You might know: Any desktop UI framework really hates if you try to update its interface from “outside the UI thread”. Just think of any order of unpredictable behaviour if you would e.g. draw a button, then handle its click event and while you’re at it, a background thread removes that button – etc.

The good thing is, you will quickly be told not to do such a thing, the bad thing is, you might end up with an unexpected or difficult to parse error and some question marks about what else to do.

The UI-thread problem is a specific case of “doing stuff in parallel requires you to really manage who can actually access a given resource”. Just google race condition. If you think about it, this holds for managing projects / humans in general, but also for our desktop app that allows simple network connections via TCP socket.

Now the first clarifications have to be done. Python gives you

  • on any tkinter widget, you have “.after()” to call any function some time in the future, i.e. you enqueue that function call to be executed after a time has passed; so this is nice for making stuff happen in the UI thread.
  • But even small stuff like writing some characters to a file might delay the interface reaction time and people nowadays have no time for that (maybe I’m just particularly impatient.)
  • Python’s standard library gives us packages like threading, asyncio and multiprocessing package, now for long enough to consider them mature.
  • There are also more advanced solutions, like looking into PyQt, or the mindset “everything is a Web App nowadays”, and they might equip you with asynchronous handling from the start, but remember – The Schneiderei in Softwareschneiderei means that we prefer tailor-made software over having to integrating bloated dependencies that we neither know nor want to maintain all year long.

We conclude with shining our light to the general choice in this project, and a few reasons why.

  1. What Python calls “threads” are not the same thing as what operating systems understands as a thread (which also differs) among them. See also the Python Global Interpreter Lock. We do not want to dive into all of that.
  2. multiprocessing is the only way to do anything outside the current sytem process, which is what you need for running CPU-heavy tasks in parallel. It’s out our use case, and while it is more “true parallel” it comes with slower startup, more expensive communication costs and also some constraints for such exchange (e.g. to serializable data).
  3. We are IO-bound, i.e. we have no idea when the next network package arrives, so we want to wait in separate event loops that would be blocking on their own (i.e. “while True”, similar to what tkinter does with our main thread.).
  4. Because we have the main thread stolen by tkinter anyway, we would use a threading.Thread anyway to allow concurrency, but then we face the choice to construct everything with Threads or to employ the newer asyncio features.

The basic block to do that would use a threading.Thread:

class ListenerThread(threading.Thread):

    # initialize somehow

    def run(self):
        while True:
            wait_for_input()
            process_input()
            communicate_with_ui_thread()


# usage like
#   listener = ListenerThread(...) 
#   listener.start(...)

Now to think naively, the pseudo-code communicate_with_ui_thread() could then either lead to the tkinter .after() calls, or employ callback functions passed either in the initialization or the .start() call of the thread. But sadly, it was not as easy as that, because you run several risks of

  • just masking your intention and still executing your callbacks blockingly (that thread can freeze the UI)
  • still pass UI widget references to your background loop (throwing bad not-the-main-thread-error exceptions)
  • have memory leaks in the background gobble up more than you ever wanted
  • Lock starvation: bad error telling you RuntimeError: can't allocate lock
  • deadlocks, by interdependent callbacks

This list is surely not exhaustive.

So what are the options? Let me discuss these in an upcoming post. For now, let’s just linger all these complications in your brain for a while.

Trait-queries for my C++ dependency injection container

This posts builds upon my previous posts on my C++ dependency-injection container: Automated instance construction in C++, Improved automated instance construction in C++ and Even better automated instance construction in C++. I was actually quite happy with the version from the last post and didn’t really touch the implementation for a good long while. But lately, I identified a few related requirements that could be solved elegantly by an extension to the container, so I decided to give it a go.

New Requirements

  1. Sometimes I need to get services from the DI just to create them. They would then register themselves with an event bus or some other system. I would not really call into them actively, and therefore I did not need access to the instances created. This could previously be done via something like (void)provider.get<my_autonomous_system>(), after all the services were registered. That works, but doesn’t scale up very well once you have a few of those. It would be much better to have something like provider.instantiate_all_autonomous_systems().
  2. Some groups of systems I would instantiate and keep around just to call them in a totally homogeneous way, like system_one.update(), system_two.update(), etc.. Again it would be better to not require the concrete types at the call site and instead just get the requested systems and call their update() in a loop.

Query Interface

It turns out that both requirements can be solved by requesting instances for “a group” of registered services. In the case of the first requirement, that’s actually all that is needed, but for the second requirement, the instances also need to be processed in some way, e.g. upcasting or other forms of type-erasure. Here’s how I wanted it to look:

di di;
di.insert_unique<actual_update_thing_one>().trait<update_trait>();
di.insert_unique<actual_update_thing_two>().trait<update_trait>();

auto updaters = di.query_trait<update_trait>();
for (auto const& each : updaters)
  each->update();

After registration with the DI, types can be marked with one or many traits, which can later be queried. For this example, the trait looks like this:

struct update_trait
{
  using type = update_service*;
  static update_service* type_erase(update_service* x)
  {
    return x;
  }
};

It really just does an upcast to update_service, which is derived-from by both of the types. But it would be equally possible to use std::function<> in case the types are only compatible via duck-typing:

struct update_trait
{
  using type = std::function<void()>;
  template <class T> static std::function<void()> type_erase(T* x)
  {
    return [x]
    {
      x->update();
    };
  }
};

Of course, that changes the final loop in the example to:

for (auto const& each : updaters)
  each();

So a traits type needs to contain a type-alias for the target type and a function to process the instance pointer into that target type, be it by wrapping it in some sort of adaptor or via upcasting. They type is separate, and not the return type of the function, because it has to be independent of the instance type that goes in, while the function can be a template and thus have different return (which is fine if they all convert to the target type).

Implementation

When you add a trait for a type T via the .trait<Trait>() template, I register a what I call a ‘resolver’, which is just a std::function<typename Trait::type()> that invokes Trait::type_erase(get_ptr<T>()). These are all put into a std::vector<>:

template <typename Trait> using trait_resolvers =
  std::vector<std::function<typename Trait::type()>>;

For all the traits, these are stored in an std::unordered_map<std::type_index, std::any> where the key is typeid(Trait).

On query_trait<Trait>, I look into that map, get the trait_resolvers<Trait> out of it, and call each resolver to fill a new std::vector<typename Trait::type>, which is then returned and can be iterated by the user.

This implementation maps better to to the second use-case, but the first can be done with bogus type_erase function in the trait like this:

struct auto_create_trait
{
  using type = int;
  template <class T>
  static int type_erase(T* x)
  {
    return 0;
  }
};

This creates an std::vector<int> that isn’t needed, which is not ideal but not a deal-breaker either. On the other hand, it is not too hard to properly support void as the type with just two if constexpr (std::is_same_v<typename Traits::type, void>), one in the resolver lambda that omits the type_erase call and one in query_trait that omits storing the resolver result. This way, I can also use [[nodiscard]] on query_trait, and the trait can be written as just struct auto_create_trait { using type = void; };.

Dependent Subqueries and LATERAL in PostgreSQL

When working with databases, you often need to run a query that depends on results from another query. PostgreSQL offers two main ways to handle this: Dependent Subqueries and LATERAL joins.

A dependent subquery is like a mini-query inside another query. The inner query (subquery) depends on the outer query for its input. For every row in the outer query, the subquery runs separately.

Imagine you have two tables: customers, which holds information about customers (e.g., their id and name), and orders, which holds information about orders (e.g., customer_id, order_date, and amount). Now, you want to find the latest order date for each customer. You can use a dependent subquery like this:

SELECT id AS customer_id,
  name,
  (SELECT order_date
     FROM orders
    WHERE customer_id=customers.id
    ORDER BY order_date DESC
    LIMIT 1) AS latest_order_date
FROM customers;

For each customer in the customers table, the subquery looks for their orders in the orders table. The subquery sorts the orders by date (ORDER BY order_date DESC) and picks the most recent one (LIMIT 1).

This works, but it has a drawback: If you have many customers, this approach can be slow because the subquery runs once for every customer.

LATERAL join

A LATERAL join is a smarter way to solve the same problem. It allows you to write a subquery that depends on the outer query, but instead of repeating the subquery for every row, PostgreSQL handles it more efficiently.

Let’s solve the “latest order date” problem using LATERAL:

SELECT
  c.id AS customer_id,
  c.name,
  o.order_date AS latest_order_date
FROM customers c
LEFT JOIN LATERAL (
  SELECT order_date
   FROM orders
  WHERE orders.customer_id=c.id
  ORDER BY order_date DESC
  LIMIT 1
) o ON TRUE;

For each customer (c.id), the LATERAL subquery finds the latest order in the orders table. The LEFT JOIN ensures that customers with no orders are included in the results, with NULL for the latest_order_date.

It’s easier to read and often faster, especially for large datasets, because PostgreSQL optimizes it better.

Both dependent subqueries and LATERAL joins allow you to handle scenarios where one query depends on the results of another. Dependent subqueries are straightforward and good enough for simple tasks with small datasets. However, you should consider using LATERAL for large datasets where performance matters.