How to migrate a create-react-app project to vite

It seems that the React community is finally accepting that their old way of scaffolding a new projects, create-react-app (CRA in short), has outlived its usefulness. While there is no official statement about that, there was no update on npm in about a year, which in the JS universe screams “TOXIC WASTE” in very clear words, and meanwhile also has vanished from the official “Start a new React Project” docs.

In search for possibilities, one can do some quick google searches (e.g. this or that or maybe this) and at the moment, I’m giving vite a chance and it has not disappointed me yet, as the opposite:

  • the build definitely feels faster (as the French would say: plus vite), but I never quantified it
  • that over 9000 deprecation warnings one was accustomed to using CRA – gone TO ZERO
  • and the biggest point, no dependency on webpack. Webpack has this weird custom to introduce brutally breaking changes between their versions and then you have to polyfill Node JS core modules or whatever floats their boat, giving users not a choice – i.e. making it highly TOXIC in itself

But still, the react-scripts which CRA employs have played quite a role in development, as it also helped with the “npm start” development server and also as a test runner – so generally, if you have developed your project over some years, you might have relied on it quite a bit, and now you don’t want to recreate everything from scratch.

I recently migrated one of our projects and this is what worked for me. There were three main concerns

  • switch the general infrastructure to vite, so we can develop and build again
  • introduce vitest as a test runner
  • migrate Redux store tests specifically

Let’s focus today on the thing without tests and I will come back to that next time.

Migrate to vite INFRASTRUCTURE

This was actually surprisingly concise, I just had to

npm install -D vite @vitejs/plugin-react
npm uninstall react-scripts

(when in doubt, remove the node_modules folder and run npm install again, but I didn’t have to), then I adjusted package.json to:

  "scripts": {
    "start": "vite",
    "build": "vite build", 
  },

You might prefer to call your dev server via “npm run dev” instead of “npm start”, in that case just replace the "start": "vite" with "dev": "vite" above.

The Vite templates prefer to include a script "preview": "vite preview" but I do not use it, so I didn’t copy that.

It also was required to set this package.json entry:

  // somewhere top-level, i.e. next to "version" or somewhere like that
  "type": "module",

(I’m not entirely sure whether we can now safely remove the “browserslist” or “babel” entries from the package.json because they might be useless now, but I will have to think about in another minute.)

Now, some real code changes. One of the larger todos here might be to make sure that every JSX-containing source file ends with .jsx – there have been discussions about this and beforehand, it was still possible to just place your <App/> etc. inside an App.js, but vite does not like that anymore, so this is a thing you have to do.

So the code changes amount to:

  • Rename every .js file which has some JSX in it to .jsx – pro tip: do it via the IDE so you do not have to care for every import / require-Statement manually!
  • move the template in ./public/index.html directly to ./index.html and in there, replace every mentioning of %PUBLIC_URL% just by the single slash /
  • In the index.html <body>, include your index.jsx e.g. like:
  <body>
    <noscript>You need to enable JavaScript to run this app.</noscript>
    <div id="root"></div>
    <script type="module" src="/src/index.jsx"></script>
  </body>

It might be said that the vite templates like to call their index file “main.jsx”, but it’s not important – just match whatever you put inside the <script src="..."/>.

Now in order not to change your habits too much, i.e. keep your CI build as it is, plus maybe some Docker Dev Containers or even browser bookmarks, you can use this vite.config.js – see docs:

import { defineConfig } from 'vite';
import react from '@vitejs/plugin-react';

export default defineConfig({
  plugins: [react()],
  server: {
    port: 3000,
    host: true
  },
  build: {
    outDir: './build'
  },
});

otherwise, vite prefers to run its dev server on port 5173 (guess it’s Leetspeak) and build in ./dist – just so you know.

Addon: Using ReactComponents from SVGs with Vite. Also with refs.

Since today morning, when I wrote this article, I already learned something new. In another project we were importing SVG files via the approach

import {ReactComponent as Bla} from "./bla.svg";

const ExampleUsage = () => {
  return <Bla />;
};

Doing so now results in

Uncaught SyntaxError: ambiguous indirect export: ReactComponent

This can be solved by npm install vite-plugin-svgr and then updating vite.config.js:

import {defineConfig} from "vite";
import react from "@vitejs/plugin-react";
import svgr from "vite-plugin-svgr";

export default defineConfig({
    plugins: [
        svgr({
            svgrOptions: {
                ref: true,
            },
        }),
        react(),
    ],
    server: {
        port: 3000,
        host: true,
    },
    build: {
        outDir: "./build",
    },
});

The { svgrOptions: {ref: true} } was a specific requirement for our use case, it is necessary if you ever want to access the imported ReactComponents ref; i.e. in our ExampleUsage we needed a specification <Bla ref={...}/> . Leaving the svgrOption ref then at false (its default) gives us the error:

Warning: Function components cannot be given refs. Attempts to access this ref will fail. Did you mean to use React.forwardRef()?

Then, Make the tests work again

As mentioned above, these were a bit trickier, and while I found a way to leave most tests untouched, there was some specific tweaking to be done with Redux store tests, and also with mocking a foreign class (GraphQLClient from “graphql-request” in my case).

But as also mentioned above, I guess this might be a topic for my next blog post. In case you urgently need that knowledge, drop us a mail or something.. 🙂

Have we made things too easy?

One of the old mantras for API design is “Make doing the right thing easy and the wrong thing hard”. This, of course, applies to much broader topics as well, such as software development or UX.

For software development specifically, are we maybe making “doing the wrong thing” too easy as well? Here are a two examples:

Web Requests

In the old times, requesting data from a web server required first setting up the request, sending it, and then getting the result back to your application either via polling or callbacks. Dave Mark once adequately called this solving the “waiting problem”. It was cumbersome, to say the least. It was clear that making such a request was something to be avoided. You did it when you had to, but you avoided setting up too many different kinds of requests implictly.

Nowadays, with the advent anonymous functions/lambdas in most mainstream programming languages, continuations became the new way handle these things: do_request(...).then(result -> ...) This already made this a lot easier. And even better, now we have some form of coroutines in many languages were you can just do result = await do_request(...). It even looks almost like a normal function call.

With this, programmers can just do requests one after the other. Need one thing from a server? Do one request. Need ten things from a server? Do ten requests. Of course, this is horribly wasteful: each request will incur the full overhead of http/https and a server roundtrip. In the old times, doing the request was painful, so you automatically looked for ways to avoid doing more, and bundle your asks into one request, argueable leading to a better program.

Dependencies

Before nice package-managers where a thing, handling dependencies was a huge pain. You would have to manually get, unpack, configure and install the dependency for each developer and/or consumer system. As a consequence, libraries were big and often duplicated foundational things. But it also caused developers carefully grooming their library selections.

Now with package managers, libraries have started to become small. Duplication within libraries certainly seems to have decreased, and the average library size has decreased. But this also caused developers to be much less cautious when adopting a dependency, with package managers handling thousands of dependencies that no one developer can possibly have a full understanding of. And this then leads to things like the leftpad disaster.

Better or worse?

I am pretty sure that both having nice abstractions to deal with asynchronicity and package managers are good things. But if they make certain things too easy, how can we deal with that? The only thing I can currently think of is figuratively sticking warning-labels on these things during review time, but because those things are now so easy and subtle, it is also easy to miss them.

Are there other examples were we maybe made the wrong thing too easy? Do you have any ideas how to deal with this problem?

Materialized views in Oracle

Most relational database management systems (RDBMs) support not only views, but also materialized views. Materialized views and normal views are both database objects used to present data to users, but they work in different ways.

A database view is a virtual table or a named query that presents data from one or more tables in a specific way. They are not physical tables and do not store data directly, but instead retrieve data from the underlying tables based on a specified query.

Materialized views are similar to normal views, but they store pre-computed results. When a materialized view is created, the results of the underlying query are computed and stored in the database. One advantage of materialized views is faster query performance by avoiding the need to compute the same results repeatedly. This is especially useful for complex and time-consuming queries, as the results can be stored and accessed quickly.

The syntax for creating a normal view in an Oracle database is as follows:

CREATE VIEW view_name AS SELECT … FROM … WHERE …;

To create a materialized view instead of a normal view you add the MATERIALIZED keyword:

CREATE MATERIALIZED VIEW view_name AS SELECT … FROM … WHERE …;

When creating a materialized view you should think about and decide on three aspects of materialized views:

  1. the refresh method,
  2. the refresh interval,
  3. and the storage properties

The refresh method

The refresh method determines how the data in the materialized view is updated or refreshed to reflect changes in the base tables.

  • COMPLETE: This one completely rebuilds the materialized view from scratch. It drops the existing contents of the materialized view and then re-executes the query to populate it with fresh data. This method can be resource-intensive and slow, especially for large materialized views.
  • FAST: Updates only the rows in the materialized view that have changed since the last refresh. It uses the materialized view logs on the base tables to identify the changed rows and then applies the changes to the materialized view. It can be much faster than a complete refresh, especially if there are only a few changes to the data.
  • FORCE: Tries to perform a fast refresh if possible, but falls back to a complete refresh if necessary. This method is useful if you want to try to perform a fast refresh, but you’re not sure if it will be possible due to the nature of the data or the query.

You can specify the refresh method when creating the materialized view using the REFRESH keyword:

CREATE MATERIALIZED VIEW view_name
  REFRESH FAST
  AS SELECT ...;

If you do not specify a refresh mode FORCE is the default.

The refresh interval

The refresh interval controls how often the materialized view is automatically refreshed. It determines how frequently the materialized view is updated to reflect changes in the underlying data.

Some refresh interval options in Oracle are:

  • ON COMMIT: The materialized view is refreshed automatically every time a transaction that modifies the underlying data is committed. This interval is useful when you need to keep the materialized view up-to-date in near real-time.
  • ON DEMAND: The materialized view is refreshed only when you explicitly request a refresh using the DBMS_MVIEW.REFRESH.
  • START WITH … NEXT: With this interval, the materialized view is refreshed automatically at regular intervals. It is useful when you want to balance the need for up-to-date data with the resources required to refresh the materialized view.

You can specify the refresh interval when creating the materialized view by adding it to the REFRESH clause when creating the view:

CREATE MATERIALIZED VIEW view_name
  REFRESH FAST ON COMMIT
  AS SELECT ...;

The following materialized view gets refreshed every hour:

CREATE MATERIALIZED VIEW view_name
  REFRESH FAST START WITH SYSDATE NEXT SYSDATE + 1/24
  AS SELECT ...;

Storage properties

Storage properties affect how the data in the materialized view is stored and accessed. In Oracle, some of these are:

  • CACHE: The data is stored in the database buffer cache, which is a portion of memory used to cache frequently accessed data. It improves query performance by reducing disk I/O, but it can consume a significant amount of memory.
  • LOGGING: Changes to the materialized view data are logged in the database redo logs. This property ensures that changes to the materialized view can be recovered in case of a system failure but can result in additional overhead.
  • TABLESPACE: Allows you to specify the tablespace where the materialized view data is stored.

Again, you can specify these properties when creating the materialized view:

CREATE MATERIALIZED VIEW view_name
  CACHE
  LOGGING
  TABLESPACE tablespace_name
AS SELECT ... FROM ... WHERE ...;

Now you know the basics for creating materialized views in an Oracle database when needed. There is still more to learn about them. You can find the full reference here.

Grails Domain update optimisation

As many readers may know we are developing and maintaining some Grails applications for more than 10 years now. One of the main selling points of Grails is its domain model and object-relational-mapper (ORM) called GORM.

In general ORMs are useful for easy and convenient development at the cost of a bit of performance and flexibility. One of the best features of GORM is the availability of several flexible APIs for use-cases where dynamic finders are not enough. Let us look at a real-world example.

The performance problem

In one part of our application we have personal messages that are marked as read after viewing. For some users there can be quite a lot messages so we implemented a “mark all as read”-feature. The naive implementation looks like this:

def markAllAsRead() {
    def user = securityService.loggedInUser
    def messages = Messages.findAllByUserAndTimelineEntry.findAllByAuthorAndRead(user, false)
    messages.each { message ->
        message.read = true
        message.save()
    }
    Messages.withSession { session -> session.flush()}
 }

While this is both correct and simple it only works well for a limited amount of messages per user. Performance will degrade because all the domain objects are loaded into domain objects, then modified and save one-by-one to the session. Finally the session is persisted to the database. In our use case this could take several seconds which is much too long for a good user experience.

DetachedCriteria to the rescue

GORM offers a much better solution for such use-cases that does not sacrifice expressiveness. Instead it offers a succinct API called “Where Queries” that creates DetachedCriteria and offers batch-updates.

def markAllAsRead() {
    def user = securityService.loggedInUser
    def messages = Messages.where {
        read == false
        addressee == user
    }
    messages.updateAll(read: true)
}

This implementation takes only a few milliseconds to execute with the same dataset as above which is de facto native SQL performance.

Conclusion

Before cursing GORM for bad performance one should have a deeper look at the alternative querying APIs like Where Queries, Criteria, DetachedCriteria, SQL Projections and Restrictions to enhance your ORM toolbox. Compared to dynamic finders and GORM-methods on domain objects they offer better composability and performance without resorting to HQL or plain SQL.

The Optional Wildcast

This blog post presents a particular programming technique that I happen to use more often in recent months. It doesn’t state that this technique is superior or more feasible than others. It’s just a story about a different solution to an old programming problem.

Let’s program a class hierarchy for animals, in particular for mammals and birds. You probably know where this leads up to, but let’s start with a common solution.

Both mammals and birds behave like animals, so they are subclasses of it. Birds have the additional behaviour of laying eggs for reproduction. We indicate this feature by implementing the Egglaying interface.

Mammals feed their offsprings by giving them milk. There are two mammals in our system, a cow and the platypus. The cow behaves like the typical mammal and gives a lot of milk. The platypus also feeds their young with milk, but only after they hatched from their egg. Yes, the platypus is a rare exception in that it is both a mammal and egglaying. We indicate this odd fact by implementing the Egglaying interface, too.

If our code wants to access the additional methods of the Egglaying interface, it has to check if the given object implements it and then upcasts it. I call this type of cast “wildcast” because they seem to appear out of nowhere when reading the code and seemingly don’t lead up or down the typical type hierarchy. Why would a mammal lay eggs?

One of my approaches that I happen to use more often recently is to indicate the existence of real wildcast with a Optional return type. In theory, you can wildcast from anywhere to anyplace you want. But only some of these jumps have a purpose in the domain. And an explicit casting method is a good way to highlight this purpose:

public abstract class Mammal {
	public Optional<Egglaying> asEgglaying() {
		return Optional.empty();
	}
}

The “asEgglaying()” method might return an Egglaying object, or it might not. As you can see, on default, it returns only an empty Optional. This means that no cow, horse, cat or dog has to think about laying eggs, they just aren’t into it by default.

public class Platypus extends Mammal implements Egglaying {
	@Override
	public Optional<Egglaying> asEgglaying() {
		return Optional.of(this);
	}
}

The platypus is another story. It is the exception to the rule and knows it. The code “Optional.of(this)” is typical for this coding technique.

A client that iterates over a collection of mammals can now incorporate the special case with more grace:

for (Mammal each : List.of(mammals())) {
	each.lactate();
	each.asEgglaying().ifPresent(Egglaying::breed);
}

Compare this code with a more classic approach using a wildcast:

for (Mammal each : List.of(mammals())) {
	each.lactate();
	if (each instanceof Egglaying) {
		((Egglaying) each).breed();
	}
}

My biggest grief with the classic approach is that the instanceof is necessary for the functionality, but not guided by the domain model. It comes as a surprise and has no connection to the Mammal type. In the Optional wildcast version, you can look up the callers of “asEgglaying()” and see all the special code that is written for the small number of mammals that lay eggs. In the classic approach, you need to search for conditional upcasts or separate between code for birds and special mammal code when looking up the callers.

In my real-world projects, this “optional wildcast” style facilitates domain discovery by code completion and seems to lead me to more segregated type systems. These impressions are personal and probably biased, so I would like to hear from your experiences or at least opinions in the comments.

Cancellation Token in C#

Perhaps you know this problem from your first steps with asynchronous functions. You want to call them, but a CancellationToken is requested. But where does it come from if you do not have one at hand? At first glance, a quick way is to use an empty CancellationToken. But is that really the solution?

In this blog article, I explain what CancellationTokens are for and what the CancellationTokenSource is for.

Let us start with a thought experiment: You have an Application with different Worker threads. The Application is closed. How can you make the whole Application with all the Worker threads close?
One solution would be a global bool variable “closed”, which is set in this case. All Workers have a loop built in, which regularly checks this bool and terminates if it is set. Basically, the Application and all Workers must cooperate and react to the signal.

class Application
{
    public static bool closed = false;
    Application()
    {
        var worker= new Worker();
        worker.Run();

        Thread.Sleep(2500);
        closed = true;

        // Wait for termination of Tasks
        Thread.Sleep(2500);
    }
}
class Worker
{
    public async void Run()
    {
        bool moreToDo = true;
        while (moreToDo)
        {
            await DoSomething();
            if (Application.closed)
            {
                return;
            }
        }
    }
}

This is what the CancellationToken can be used for. The token is passed to the involved Workers and functions. Throught it, the status of whether or not it was cancelled can be queried. The token is provided via a CancellationTokenSource, which owns and manages the token.

class Application
{
    Application()
    {
        var source = new CancellationTokenSource();
        var token = source.Token;
        var worker = new Worker();
        worker.Run(token);

        Thread.Sleep(2500);
        source.Cancel();

        // Wait for cancellation of Tasks and dispose source
        Thread.Sleep(2500);
        source.Dispose();
    }
}
class Worker
{
    public async void Run(CancellationToken token)
    {
        bool moreToDo = true;
        while (moreToDo)
        {
            await DoSomething(token);
            if (token.IsCancellationRequested)
            {
                return;
            }
        }
    }   
}

The call to source.Cancel() sets the token to canceled. CancellationTokens and sources offer significantly more built-in possibilities, such as callbacks, WaitHandle, CancelAfter and many more.

So using an empty CancellationToken is not a good solution. This means that cancellations cannot be passed on and a task could continue to run in the background even though the Application should have been terminated a long time ago.

I hope I have been able to bring the topic a little closer to you. If you have already encountered such problems and found solutions, please leave a comment.

Using Docker Containers in Development with WebStorm: Next Iteration

We are always in pursue of improving our build and development infrastructures. Who isn’t?

At Softwareschneiderei, we have about five times as many projects than we have developers (without being overworked, by the way) and each of that comes with its own requirements, so it is important to be able to switch between different projects as easily as cloning a git repository, avoiding meticulous configuration of your development machines that might break on any change.

This is the main advantage of the development container (DevContainer) approach (with Docker being the major contestant at the moment), and last November, I tried to outline my then-current understanding of integrating such an approach with the JetBrains IDEs. E.g. for WebStorm, there is some kind of support for dockerized run configurations, but that does some weird stuff (see below), and JetBrains did not care enough yet to make that configurable, or at least to communicate the sense behind that.

Preparing our Dev Container

In our projects, we usually have at least two Docker build stages:

  • one to prepare the build platform (this will be used for the DevContainer)
  • one to execute the build itself (only this stage copies actual sources)

There might be more (e.g. for running the build in production, or for further dependencies), but the basic distinction above helps us to speed up the development process already. (Further reading: Docker cache management)

For one of our current React projects (in which I chose to try Vite in favor of the outdated Create-React-App, see also here), the Dockerfile might look like

# --------------------------------------------
FROM node:18-bullseye AS build-platform

WORKDIR /opt
COPY package.json .
COPY package-lock.json .

# see comment below
RUN npm install -g vite

RUN npm ci --ignore-scripts
WORKDIR /opt/project

# --------------------------------------------
FROM build-platform AS build-stage

RUN mkdir -p /build/result
COPY . .
CMD npm run build && mv dist /build/result/app

The “build platform” stage can then be used as our Dev Container, from the command line as (assuming, this Dockerfile resides inside your project directory where also src/ etc. are chilling)

docker build -t build-platform-image --target build-platform .
docker run --rm -v ${PWD}:/opt/project <command_for_starting_dev_server>

Some comments:

  • The RUN step to npm install -g vite is required for a Vite project because the our chosen base image node:18-bullseye does not know about the vite binaries. One could improve that by adding another step beforehand, only preparing a vite+node base image and taking advantage of Docker caching from then on.
  • We specifically have to take the WORKDIR /opt/project because our mission statement is to integrate the whole thing with WebStorm. If you are not interested in that, that path is for you to choose.

Now, if we are not working against any idiosyncrasies of an IDE, the preparation step “npm ci” gives us all our node dependencies in the current directory inside a node_modules/ folder. Because this blog post is going somewhere, already now we chose to place that node_modules in the parent folder of the actual WORKDIR. This will work because for lack of an own node_modules, node will find it above (this fact might change with future Node versions, but for now it holds true).

The Challenge with JetBrains

Now, the current JetBrains IDEs allow you to run your project with the node interpreter (containerized within the node-platform image) in the “Run/Debug Configurations” window via

“+” ➔ “npm” ➔ Node interpreter “Add…” ➔ “Add Remote” ➔ “Docker”

then choose the right image (e.g. build-platform-image:latest).

Now enters that strange IDE behaviour that is not really documented or changeable anywhere. If you run this configuration, your current project directory is going to be mounted in two places inside the container:

  • /opt/project
  • /tmp/<temporary UUID>

This mounting behaviour explains why we cannot install our node_modules dependencies inside the container in the /opt/project path – mounting external folders always override anything that might exist in the corresponding mount points, e.g. any /opt/project/node_modules will be overwritten by force.

As we cared about that by using the /opt parent folder for the node_modules installation, and we set the WORKDIR to be /opt/project one could think that now we can just call the development server (written as <command_for_starting_dev_server> above).

But we couldn’t!

For reasons that made us question our reality way longer than it made us happy, it turned out that the IDE somehow always chose the /tmp/<uuid> path as WORKDIR. We found no way of changing that. JetBrains doesn’t tell us anything about it. the “docker run -w / --workdir” parameter did not help. We really had to use that less-than-optimal hack to modify the package.json “scripts” options, by

 "scripts": {
    "dev": "vite serve",
    "dev-docker": "cd /opt/project && vite serve",
    ...
  },

The “dev” line was there already (if you use create-react-app or something else , this calls that something else accordingly). We added another script with an explicit “cd /opt/project“. One can then select that script in the new Run Configuration from above and now that really works.

We do not like this way because doing so, one couples a bad IDE behaviour with hard coded paths inside our source files – but at least we separate it enough from our other code that it doesn’t destroy anything – e.g. in principle, you could still run this thing with npm locally (after running “npm install” on your machine etc.)

Side note: Dealing with the “@esbuild/linux-x64” error

The internet has not widely adopteds Vite as a scaffolding / build tool for React projects yet and one of the problems on our way was a nasty error of the likes

Error: The package "esbuild-linux-64" could not be found, and is needed by esbuild

We found the best solution for that problem was to add the following to the package.json:

"optionalDependencies": {
    "@esbuild/linux-x64": "0.17.6"
}

… using the “optionalDependencies” rather than the other dependency entries because this way, we still allow the local installation on a Windows machine. If the dependency was not optional, npm install would just throw an wrong-OS-error.

(Note that as a rule, we do not like the default usage of SemVer ^ or ~ inside the package.json – we rather pin every dependency, and do our updates specifically when we know we are paying attention. That makes us less vulnerable to sudden npm-hacks or sneaky surprises in general.)

I hope, all this information might be useful to you. It took us a considerable amount of thought and research to come to this conclusion, so if you have any further tips or insights, I’d be glad to hear from you!

Unit-Testing Deep-Equality in C#

In the suite of redux-style applications we are building in C#, we are making extensive use of value-types, which implies that a value compares as equal exactly if all of its contents are equal also known as “deep equality”, as opposed to “reference equality” or “shallow equality”. Both of those imply deep equality, but the other way around is not true. The same object is of course equal to itself, not matter how deep you look. And an object that references the same data as another object also has equal content. But a simple object that contains different lists with equal content will be unequal under shallow comparison, but equal under deep comparison.

Though init-only records already provide a per-member comparison as Equals be default, this fails for collection types such as ImmutableList<> that, against all intuition but in accordance to , only provide reference-equality. For us, this means that we have to override Equals for any value type that contains a collection. And this is were the trouble starts. Once Equals is overridden, it’s extremely easy to forget to also adapt Equals when adding a new property. Since our redux-style machinery relies on a proper “unequal”, this would manifest in the application as a sporadically missing UI update.

So we devised a testing strategy for those types, using a little bit of reflection:

  1. Create a sample instance of the value type with no member retaining its default value
  2. Test, by going over all properties and comparing to the same property in a default instance, if indeed all members in the sample are non-default
  3. For each property, run Equals the sample instance to a modified sample instance with that property set to the value from a default instance.

If step 2 fails, it means there’s a member that’s still at its default value in the sample instance, e.g. the test wasn’t updated after a new property was added. If step 3 fails, the sample was updated, but the new property is not considered in Equals – and it can even tell which property is missing.

The same problems of course arise with GetHashCode, but are usually less severe. Forgetting to add a property just makes collisions more likely. It can be tested much in the same way, but can potentially lead to false positives: collisions can occur even if all properties are correctly considered in the function. In that case, however, the sample can usually be altered to remove the collision – and it is really unlikely. In fact, we never had a false positive.

Partitioning in Oracle Database: Because Who Wants to Search an Endless Table?

As data volumes continue to grow, managing large database tables and indexes can become a challenge. This is where partitioning comes in. Partitioning is a feature of database systems that allows you to divide large tables and indexes into smaller, more manageable parts, known as partitions. This can improve the performance and manageability of your database. Aside from performance considerations, maintenance operations, such as backups and index rebuilds, can become easier by allowing them to be performed on smaller subsets of data.

This is achieved by reducing the amount of data that needs to be scanned during query execution. When a query is executed, the database can use the partitioning information to skip over partitions that do not contain the relevant data, instead of having to scan the entire table. This reduces the amount of I/O required to execute the query, which can result in significant performance gains, especially for large tables.

There are several types of partitioning available in Oracle Database, including range partitioning, hash partitioning, list partitioning, and composite partitioning. Each type of partitioning is suited to different use cases and can be used to optimize the performance of your database in different ways. In this blog post we will look range partitioning.

Range partitioning

Here is an example of range-based partitioning in Oracle:

CREATE TABLE books (
  id NUMBER,
  title VARCHAR2(200),
  publication_year NUMBER
)

PARTITION BY RANGE (publication_year) (
  PARTITION p_before_2000 VALUES LESS THAN (2000),
  PARTITION p_2000s VALUES LESS THAN (2010),
  PARTITION p_2010s VALUES LESS THAN (2020),
  PARTITION p_after_2020 VALUES LESS THAN (MAXVALUE)
);

In this example, we have created a table called books that stores book titles, partitioned by the year of publication. We have defined four partitions, p_before_2000, p_2000s, p_2010s, and p_after_2020.

Now, when we insert data into the books table, it will automatically be placed in the appropriate partition based on the year of publication:

INSERT INTO books (id, title, publication_year)
  VALUES (1, 'Nineteen Eighty-Four', 1949);

This book will be inserted into partition p_before_2000, as the year of publication is before 2000. The following book will be placed into partition p_2000s:

INSERT INTO books (id, title, publication_year)
  VALUES (2, 'The Hunger Games', 2008);

When we query the books table, the database will only access the partitions that contain the data we need. For example, if we want to retrieve data for books published in 2015 and 2016, the database will only access partition p_2010s.

SELECT * FROM books WHERE publication_year>=2015 AND publication_year<=2016:

However, you should be aware that while partitioning can improve query performance for some types of queries, it can also negatively impact query performance for others, especially if the partitioning scheme does not align well with the query patterns. Therefore, you should tailor the partitioning to your needs and check if it brings the desired effect.

Format-based sorting looks clever, but is dangerous

A neat trick I learnt early in my career, even before I learnt about version control, was how to format a date as a string so that alphabetically sorted lists would contain them in the “correct” order:

“YYYYMMDD” is the magic string.

If you format your dates as 20230122 and 20230123, the second name will be sorted after the first one. With nearly any other format, your date strings will not be sorted chronologically in the file system.

I’ve found out that this is also nearly the only format that most people cannot intuitively recognize as a date. So while it is familiar with me and conveniently sorted, it is confusing or at least in need of explanation for virtually every user of my systems.

Keep that in mind when listening to the following story:

One project I adopted is a custom enterprise resource planning system that was developed by a single developer that one day left the company and the code behind. The software was in regular use and in dire need of maintenance and new features.

One concept in the system is central to its users: the list of items in an invoice or a bill of delivery. This list contains items in a defined order that is important to the company and its customers.

To my initial surprise, the position of an item in the list was not defined by an integer, but a string. This can be explained by the need of “sub-positions” that form a hierarchy of items, like in this example:

1 – basic item

1.1 – item upgrade #1

1.2 – item upgrade #2

Both positions “1.1” and “1.2” are positioned “underneath” position “1” and should be considered glued to it. If you move position “1” to position “4”, you also move 1.1 to 4.1 and 1.2 to 4.2.

But there was a strange formatting thing going on with the positions: They were stored as strings in the database, but with a strange padding in front. Instead of “1”, “2” and “3”, the entries contained the positions ” 1″, ” 2″ and ” 3″. All positions were prefixed with two space characters!

Well, nearly all positions. As soon as the list grew, the padding turned out to be dependent on the number of digits in the position: ” 9″, but then ” 10″ and “100”.

The reason can be found relatively simple: If you prefix with spaces (or most other characters, maybe “0”), your strings will be ordered in a numerical way. Without the prefixes, they would be sorted like “1”, “10”, “11”, “2”.

That means that the desired ordering of the positions is hardcoded in the database representation! You probably already thought about the case of a position greater than 999. That’s when trouble begins! Luckily, an invoice with a thousand items on the list is unheard of in the company (yet!).

Please note that while the desired ordering is hardcoded in the database, the items are still loaded in a different order (as they were entered into the system) and need to be sorted by the application. The default sorting for strings is the alphabetical order, so the original developer probably was clever/lazy, went with it and formatted the data in a way that would produce the result and not require additional logic during the sorting.

If you look at the code, you see seemingly strange formatting calls to the position all over the place. This is necessary because, for example, every time a user enters a position into the system, it needs to be reformatted (or at least sanitized) in order to adhere to the “auto-sortable” format.

If you wonder how a hierarchical sub-position looks like with this format, its ” 1. 1″, ” 1. 10″ and even ” 1. 17. 2. 4″. The database stores mostly blanks in this field.

While this approach might seem clever at the moment, it is highly dangerous. It conflates several things that should stay separated, like “storage format” and “display format”, “item order” or just “valid value range”. It is a clear violation of the “separation of concerns” principle. And it broke the application when I missed one place where the formatting was required, but not present. Of course, this only manifests in a problem when your test cases (or manual tries) exceed a list of 9 entries – lesson learnt here.

I dread the moment when the company calls to tell me about this “unusually large invoice” that exceeds the 999 limit. This would mean a reformatting of all stored data or another even more clever hack to circumvent the problem.

Did you encounter a format that was purely there for sorting in the wild? What was the story? Tell us in the comments!