Data minimization

The European General Data Protection Regulation (GDPR) often refers to the principle of data minimization. But what does this mean, and how can it be applied in practice?

The EU website states the following: “The principle of “data minimisation” means that a data controller should limit the collection of personal information to what is directly relevant and necessary to accomplish a specified purpose. They should also retain the data only for as long as is necessary to fulfil that purpose. In other words, data controllers should collect only the personal data they really need, and should keep it only for as long as they need it.”

But what is necessary? In the following, I will give you an example of how I was able to implement data minimization in a new project.

Verification of training certificates

One of my customers has to train his employees so that they can use certain devices. The training certificates are printed out and kept. So if an employee wants to use the device, the employee has to look for the corresponding certificate at reception, and it cannot be verified whether the printout was really generated by the training system.

Now they wanted us to build an admin area in which the certificates are displayed so that reception can search for it in this area.

The first name, surname, birthday etc. of the employees should be saved with the certificate for a long time.

What does the GDPR say about this?
I need the data so that reception can see and check everything. It is necessary! Or is it not?

Let’s start with the purpose of the request. The customer wants to verify whether an employee has the certificate.

If I want to display a list of all employees in plain text, I need the data. But is this the only solution to achieve the purpose, or can I perhaps find a more data-minimizing option? The purpose says nothing about displaying the names in plain text. A hash comparison, for example, is sufficient for verification, as is usually done with passwords.

So, how do I implement the project? Whenever I issue a certificate, I hash the personal data. In the administration area, there is a dialog in which the recipient can enter the personal data to check whether a valid certificate is available. The data is also hashed, and a hash comparison is used to determine the match.
Instead of an application that juggles a large amount of personal data, my application no longer stores any personal data at all. And the purpose can still be achieved.

Conclusion

Data minimization is therefore not only to be considered in the direct implementation of a function. Data minimization starts with the design.

Web Components – Reusable HTML without any framework magic, Part 1

Lately, I decided to do the frontend for a very small web application while learning something new, and, for a while, tried doing everything without any framework at all.

This worked only for so long (not very), but along the way, I found some joy in figuring out sensible workflows without the well-worn standards that React, Svelte and the likes give you. See the last paragraph for a quick comment about some judgement.

Now many anything-web-dev-related people might have heard of Web Components, with their custom HTML elements that are mostly supported in the popular browsers.

Has anyone used them, though? I personally haven’t had, and now I did. My use case was pretty easy – I wanted several icons, and wanted to be able to style them in a unified fashion.

It shouldn’t be too ugly, so why not take something like Font Awesome or heroicons, these give you pure SVG elements but now I have the Font Awesmoe “Magic Sparkles Wand” like

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 576 512"><!--!Font Awesome Free 6.5.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free Copyright 2024 Fonticons, Inc.--><path d="M234.7 42.7L197 56.8c-3 1.1-5 4-5 7.2s2 6.1 5 7.2l37.7 14.1L248.8 123c1.1 3 4 5 7.2 5s6.1-2 7.2-5l14.1-37.7L315 71.2c3-1.1 5-4 5-7.2s-2-6.1-5-7.2L277.3 42.7 263.2 5c-1.1-3-4-5-7.2-5s-6.1 2-7.2 5L234.7 42.7zM46.1 395.4c-18.7 18.7-18.7 49.1 0 67.9l34.6 34.6c18.7 18.7 49.1 18.7 67.9 0L529.9 116.5c18.7-18.7 18.7-49.1 0-67.9L495.3 14.1c-18.7-18.7-49.1-18.7-67.9 0L46.1 395.4zM484.6 82.6l-105 105-23.3-23.3 105-105 23.3 23.3zM7.5 117.2C3 118.9 0 123.2 0 128s3 9.1 7.5 10.8L64 160l21.2 56.5c1.7 4.5 6 7.5 10.8 7.5s9.1-3 10.8-7.5L128 160l56.5-21.2c4.5-1.7 7.5-6 7.5-10.8s-3-9.1-7.5-10.8L128 96 106.8 39.5C105.1 35 100.8 32 96 32s-9.1 3-10.8 7.5L64 96 7.5 117.2zm352 256c-4.5 1.7-7.5 6-7.5 10.8s3 9.1 7.5 10.8L416 416l21.2 56.5c1.7 4.5 6 7.5 10.8 7.5s9.1-3 10.8-7.5L480 416l56.5-21.2c4.5-1.7 7.5-6 7.5-10.8s-3-9.1-7.5-10.8L480 352l-21.2-56.5c-1.7-4.5-6-7.5-10.8-7.5s-9.1 3-10.8 7.5L416 352l-56.5 21.2z"/></svg>

Say I want to have multiple of these and I want them to have different sizes. And I have no framework for that. I might, of course, write JavaScript functions that create a SVG element, equip it with the right attributes and children, and use that throughout my code, like

// HTML part
<div class="magic-sparkles-container">
</div>

// JS part
for (const element of [...document.getElementsByClassName("magic-sparkles-container")]) {
    elements.innerHTML = createMagicSparkelsWand({size: 24});
}

// note that you need the array destructuring [...] to convert the HTMLCollection to an Array

// also note that the JS part would need to be global, and to be executed each time a "magic-sparkles-container" gets constructed again

But one of the main advantages of React’s JSX is that it can give you a smooth look on your components, especially when the components have quite speaking names. And what I ended up to have is way smoother to read in the HTML itself

// HTML part
<magic-sparkles></magic-sparkles>
<magic-sparkles size="64"></magic-sparkles>

// global JS part (somewhere top-level)
customElements.define("magic-sparkles", MagicSparklesIcon);

// JS class definition
class MagicSparklesIcon extends HTMLElement {
    connectedCallback() {
        // take "size" attribute with default 24px
        const size = this.getAttribute("size") || 24;
        const path = `<path d="M234.7..."/>`;
        this.innerHTML = `
            <svg
                xmlns="http://www.w3.org/2000/svg"
                viewBox="0 0 576 512"
                width="${size}"
                height="${size}"
            >
                ${path}
            </svg>
        `;
    }
}

The customElements.define needs to be defined very top-level, once, and this thing can be improved, e.g. by using shadowRoot() and by implementing the attributesChangedCallback etc. but this is good enough for a start. I will return to some refinements in upcoming blog posts, but if you’re interested in details, just go ahead and ask 🙂

I figured out that there are some attribute names that cause problems, that I haven’t really found documented much yet. Don’t call your attributes “value”, for example, this gave me one hard to solve conflict.

But other than that, this gave my No-Framework-Application quite a good start with readable code for re-usable icons.

To be continued…

In the end – should you actually go framework-less?

In short, I wouldn’t ever do it for any customer project. The above was a hobby experience, just to have went down that road for once, but it feels there’s not much to gain in avoiding all frameworks.

I didn’t even have hard constraints like “performance”, “bundle size” or “memory usage”, but even if I did, there’s post like Framework Overhead is bikeshedding that might question the notion that something like React is inherently slower.

And you pay for such “lightweightness” dearly, in terms of readability, understandibility, code duplication, trouble with code separation, type checks, violation of Single-Level-of-Abstraction; not to mention that you make it harder for your IDE to actually help you.

Don’t reinvent the wheel more often than necessary. Not for a customer that just wants his/her product soon.

But it can be fun, for a while.

My Favorite Pattern

It has become somewhat of an internal meme that I do not like it when programmers use the word “wrapper”. When someone does say it, I usually get a cue from one of the others to start complaining about it. Do not get me wrong, though. I am very much in favor of wrapping things, but with purpose. And my favorite one is the façade.

When simple becomes complex

Many times, APIs start out simple and elegant. This usually works for a while and the API gets used a lot precisely because of its beauty and simplicity. But eventually, a new use case comes along that demands more of the API than it can currently serve. It has to be extended. This usually takes the form of an additional method or function parameter, or an additional function that needs to be called. Using the API now becomes more complex all its users.

Do not underestimate this effect. I have only anecdotal evidence, but in my experience, a lot of unnecessary software complexity can be attributed to this1. The Pareto-Principle applies here: A single use case causes all the users of the previously simple API to deal with new complexity (e.g. 10% of the use cases cause 90% of the complexity in the user-/call-sites).

Façades make it look beautiful

Luckily, it can be dealt with beautifully: using the façade pattern. This pattern abstracts a complex API behind a simple API. The trade-off, of course, is that it is less powerful than the “full API”. In our example though, all of the previous use-cases can keep using the simple API via a façade.

When to apply this

The aforementioned example, extending an API, is a very nice opportunity to apply the façade. Just keep the interface of the old API around, and re-implement it using the new, extended API, which is usually created by modifying the old API’s implementation. Now all the old call-sites can stay the same, yet you can have a more powerful API for those rare cases that need it.

Of course, you can also identify common usage patterns and refactor them using a façade, but that’s usually much harder to do.

What exactly are façades made of?

Façades do not hide the more complex API in the sense that the APIs users are not allowed to use it. Yes, façades make APIs look beautiful, but that is where the metaphor ends. You can still access what is behind the façade. You can even write more façades for the behind. Many APIs have multiple common cases and only very few complex ones.

So… Classes? Functions? Data? Any of those, in fact. Whenever you enable writing something in a simpler way for a common case, you have a façade . Very often, a small function with a simple signature is all the façade you need.

But it makes all the difference.

Now can someone please tell me what that little hook under the c is called?

  1. Façades can, of course, also contribute to creating complexity by growing the codebase and creating ‘variants’. But they rarely do. ↩︎

SQL Database Window Functions

Window functions allow users to perform calculations across a set of rows that are somehow related to the current row. This can include calculations like running totals, moving averages, and ranking without the need to group the entire query into one aggregate result.

Despite their flexibility, window functions are sometimes underutilised, either because users are unaware of them or because they’re considered too complex for everyday tasks. Learning how to effectively use window functions can improve the efficiency and readability of SQL queries, particularly for reporting and data analysis purposes. This article will explore several use cases.

Numbering Rows

The simplest application area for window functions is the numbering of rows. The ROW_NUMBER() function assigns a unique number to each row within the partition of a result set. The numbering is sequential and starts at 1. It’s useful for creating a unique identifier for rows within a partition, even when the rows are identical in terms of data.

Consider the following database table of library checkouts:

bookcheckout_datemember_id
The Great Adventure2024-02-15102
The Great Adventure2024-01-10105
Mystery of the Seas2024-01-20103
Mystery of the Seas2024-03-01101
Journey Through Time2024-02-01104
Journey Through Time2024-02-18102

We want to assign a unique row number to each checkout instance for every book, ordered by the checkout date to analyze the circulation trend:

SELECT
 book,
checkout_date,
member_id,
ROW_NUMBER() OVER (PARTITION BY book ORDER BY checkout_date) AS checkout_order
FROM library_checkouts;

The result:

bookcheckout_datemember_idcheckout_order
The Great Adventure2024-01-101051
The Great Adventure2024-02-151022
Mystery of the Seas2024-01-201031
Mystery of the Seas2024-03-011012
Journey Through Time2024-02-011041
Journey Through Time2024-02-181022

Ranking

In the context of SQL and specifically regarding window functions, “ranking” refers to the process of assigning a unique position or rank to each row within a partition of a result set based on a specified ordering.

The RANK() function provides a ranking for each row within a partition, with gaps in the ranking sequence when there are ties. It’s useful for ranking items that have the same value.

Consider the following database table of scores in a game tournament:

playergamescore
AliceSpace Invaders4200
BobSpace Invaders5700
CharlieSpace Invaders5700
DanaDonkey Kong6000
EveDonkey Kong4800
FrankDonkey Kong6000
AliceAsteroids8500
BobAsteroids9300
CharlieAsteroids7600

We want to rank the players within each game based on their score, with gaps in rank for ties:

SELECT
 player,
 game,
score,
RANK() OVER (PARTITION BY game ORDER BY score DESC) AS rank
FROM scores;

The result looks like this:

playergamescorerank
BobSpace Invaders57001
CharlieSpace Invaders57001
AliceSpace Invaders42003
DanaDonkey Kong60001
FrankDonkey Kong60001
EveDonkey Kong48003
BobAsteroids93001
AliceAsteroids85002
CharlieAsteroids76003

If you don’t want to have gaps in the ranking sequence when there are ties, you can substitute DENSE_RANK() for RANK().

Cumulative Sum

The SUM() function can be used as a window function to calculate the cumulative sum of a column over a partition of rows.

Example: We are tracking our garden’s vegetable harvest in a database table, and we want to calculate the cumulative yield for each type of vegetable over the harvesting season.

vegetableharvest_dateyield_kg
Carrots2024-06-1810
Carrots2024-07-1015
Tomatos2024-06-1520
Tomatos2024-07-0130
Tomatos2024-07-2025
Zucchini2024-06-2015
Zucchini2024-07-0520

We calculate the running total (cumulative yield) for each vegetable type as the season progresses, using the SUM() function:

SELECT
 vegetable,
harvest_date,
yield_kg,
SUM(yield_kg) OVER (PARTITION BY vegetable ORDER BY harvest_date ASC) AS cumulative_yield
FROM garden_harvest;

Now we can see which vegetables are most productive and how yield accumulates throughout the season:

vegetableharvest_dateyield_kgcumulative_yield
Carrots2024-06-181010
Carrots2024-07-101525
Tomatos2024-06-152020
Tomatos2024-07-013050
Tomatos2024-07-202575
Zucchini2024-06-201515
Zucchini2024-07-052035

Serving static files from a JAR with with Jetty

Web applications are often deployed into some kind of web server or web container like Wildfly, Tomcat, Apache or IIS. For single-instance services this can be overkill and serving requests can easily be done in-process without interfacing with some external web container or web server.

A proven and popular framework for an in-process web container and webserver is Eclipse Jetty for Java. For easy deployment and distribution you can build a single application archive containing everything: the static resources, web request handling, all your code and dependencies needed to run your application.

If you package your application that way there is one caveat when trying to serve static resources from this application archive. Let us have a look how to do it with Jetty.

Using the DefaultServlet for static resources on the file system

Jetty comes with a Servlet-Implementation for serving static resources out-of-the-box. It is called DefaultServlet and can be used like below:

Server server = new Server();
ServerConnector connector = new ServerConnector(server);
connector.setHost(listenAddress);
connector.setPort(listenPort);
server.addConnector(connector);
var context = new ServletContextHandler();
context.setContextPath("/");
var defaultServletHolder = context.addServlet(DefaultServlet.class, "/static/*");
defaultServletHolder.setInitParameter("resourceBase", "/var/www/static");
server.setHandler(context);
server.start();

This works great in the case of static resources available somewhere on the filesystem.

But how do we specify where the DefaultServlet can find the resources within our application archive?

Using the DefaultServlet for static in-JAR resources

The only thing that we need to change is the init-parameter called resourceBase from a normal file path to a path in the JAR. What does a path to the files inside a JAR look like and how do we construct it? It took me a while to figure it out, but here is what I came up with and it works perfectly in my use cases:

private String getResourceBase() throws MalformedURLException {
	URL resourceFile = getClass().getClassLoader().getResource("static/existing-file-inside-jar.txt");
    return new URL(Objects.requireNonNull(resourceFile).getProtocol(), resourceFile.getHost(), resourceFile.getPath()
        .substring(0, resourceFile.getPath().lastIndexOf("/")))
        .toString();
}

The method results in a string like jar:file:/path/to/application.jar!/static. Using such a path as the resourceBase for the DefaultServlet allows you to serve all the stuff from the /static directory inside your application (or any other) jar containing the class this method resides in.

A few notes on the code

Why don’t we just hardcode a path after the jar:-protocol? Well, the file path may chance in several scenarios:

  • Running the application on a different platform or operating system
  • Renaming the application archive – it could contain the version number for example…
  • Installing or copying the application archive to a different location on the file system

Using an existing resource and reusing the relevant parts of the URL-specification for the resource base directory solves all these issues because it is computed at runtime.

However, the code assumes that there is at least one resource available in the JAR and that its path is known at compile time.

In newer JDKs like 21 LTS constructing an URL using the constructor is deprecated but I did not bother to rewrite the code to use URI because of time constraints. That is left up to you or a future release…

As always I hope someone finds the code useful and drops a comment.