Schneide Blog

Temporary Tables in PostgreSQL and Oracle

Temporary tables are useful when you need to store data only for a short time while running SQL statements. Instead of writing one very large query, you can break the work into steps and store intermediate results in a temporary table.

PostgreSQL and Oracle both support temporary tables, but they implement them differently. This post explains how temporary tables work in both systems and highlights the main differences between them.

What Is a Temporary Table?

A temporary table stores data that is only needed for a limited time, typically during a session or a transaction. Typical situations include saving intermediate query results, preparing data before inserting it into other tables, simplifying complex SQL statements, or staging data during imports or batch jobs.

Temporary tables behave like normal tables in many ways. You can insert, update, and query them using standard SQL. The main difference is that their lifetime is limited and they are automatically cleaned up later.

Temporary Tables in PostgreSQL

In PostgreSQL, the table itself is temporary. The table is created during a session and automatically removed when the session ends.

This means temporary tables are usually created inside scripts, functions, or interactive sessions whenever they are needed. There is no need to define them permanently in the database schema.

Here’s how to create a temporary table:

CREATE TEMP TABLE temp_orders (
    id     INT,
    amount NUMERIC
);

You can use the table like any normal table.

INSERT INTO temp_orders VALUES (1, 100);

SELECT * FROM temp_orders;

When the session ends, the table is automatically dropped. Another useful property is that different sessions can create temporary tables with the same name without interfering with each other.

Transaction Options

PostgreSQL also supports an ON COMMIT clause that controls what happens when a transaction commits.

Example:

CREATE TEMP TABLE temp_orders (
    id INT
) ON COMMIT DELETE ROWS;

Possible options are ON COMMIT PRESERVE ROWS (the default), ON COMMIT DELETE ROWS, and ON COMMIT DROP.

Example:

CREATE TEMP TABLE temp_orders (
    id INT
) ON COMMIT DROP;

After the commit, the table itself is removed.

Temporary Tables in Oracle

Oracle implements temporary tables differently. Oracle provides Global Temporary Tables. The key idea is that the table definition is permanent, but the data stored in the table is temporary. Usually the table is created once as part of the database schema and reused by applications.

Example:

CREATE GLOBAL TEMPORARY TABLE temp_orders (
  id     NUMBER,
  amount NUMBER
)
ON COMMIT DELETE ROWS;

Like PostgreSQL, Oracle also supports ON COMMIT clauses. They determine when the rows disappear. However, Oracle does not support ON COMMIT DROP.

Example: Working with Intermediate Results

Temporary tables are often used to store intermediate query results. For example, suppose you want to find customers with the highest total spending.

First step:

CREATE TEMP TABLE customer_totals AS
  SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id;

Second step:

SELECT *
  FROM customer_totals
  ORDER BY total_spent DESC
  LIMIT 10;

Splitting a task into multiple steps like this can make queries easier to read and maintain.

Why PostgreSQL Temporary Tables Are Often More Practical

The PostgreSQL approach is often more flexible in everyday work. Because the table itself is temporary, it can be created exactly when it is needed and disappears automatically afterwards. This keeps the database schema cleaner because temporary helper tables do not remain in the system permanently.

This model also works well for scripts, reporting tasks, and data processing jobs. A script can create the temporary tables it needs, perform several processing steps, and then end the session. When the session ends, the database removes the tables automatically without any extra cleanup.

In contrast, Oracle requires the temporary table structure to exist permanently in the schema. Even though the stored data is temporary, the table itself remains part of the database design. Over time this can lead to many helper tables that exist only to support certain procedures or batch jobs.

Another advantage of the PostgreSQL approach is flexibility. Developers can define temporary tables with different structures whenever they are needed, without changing the permanent schema. This can make development, testing, and data exploration much easier.

For these reasons, I find the PostgreSQL model more natural and convenient, especially for ad-hoc queries, data analysis, and multi-step data processing tasks.

Implicit Protocol Requirements Can Drive You Mad

Some years ago, I had a software project that wanted to integrate a new kind of machinery into an existing application. Thanks to a modular and layered architecture, you could swap out the old machinery module and replace it with a new one. So it came down to writing an elaborate adapter between the existing application code and the new machinery interface. Shouldn’t be too hard, right?

And at first, it wasn’t. The machinery interface was relatively narrow, with just a few data registers to read from and write into. One core functionality of the old and the new machinery was moving equipment around at different axes (horizontally, vertically, etc.). The difference was: The old machinery was based on position switches, the new one operated on a sensor-based positioning system.

Position switches are rudimentary technology: An engine drives along the axis until it triggers the position switch that shuts of the engine. The advantage is a basic set of commands: Drive left (until you hit a switch) or drive right (until you hit a switch). This machinery control can be implemented by analog relais logic. The downside is that there is only guessing where the engine actually is at any moment if it doesn’t reveal its position by triggering a switch.

The new machinery works with a fancier method of positioning and movement. The control unit of the machine keeps track of the coordinates for every axis of movement. If you want the machine to assume a different position, you transmit the target coordinates and the machine moves until the difference is zero.

In reality, it wasn’t that easy. You also needed to transmit the desired velocity of the movement. The target was reached once the coordinates were equal to the transmitted coordinates and the actual velocity of all axes was zero again.

Okay, so making the new machinery move was a two-step transmission: First, you give it the target coordinates, then the speed values. And then you wait until things are like you want them to be.

The new module worked flawlessly with the new machinery. We could move it around in the boring one-dimensional ways the actual use case required or we could make it dance in complicated courses. The customer was pleased and the machinery was installed to perform the one-dimensional movements from now on.

The project was finished successfully. But after a while, the customer had a complaint. Seldom, but reocurring, the machinery would not move when commanded to, but blow a fuse and go into an error state.

Initially, the customer treated it as an electrical problem within the machinery. Until the manufacturer couldn’t find a cause and suspected my software to transmit faulty command parameters. I implemented an exhaustive logging of all transmissions and could prove that the parameters were as correct as they were boring. The application transmitted “full left” or “full right” for the horizontal movement and nothing else.

We were all stumped and out of ideas until I had an idea out of the blue:

What if the command interface to the machinery has a hidden assumption that is not met by the application?

But why did it work 99 percent of the time? Wouldn’t the assumption be present for every movement command?

Every time I hear “spurious failure”, I think about a concurrency problem. But my module worked strictly serial, one command after the other. There was nothing going on concurrently on my side.

And then it dawned me: The concurrent process is the main loop of the machine control unit. The machine control unit essentially runs a single thread that performs a series of steps in an endless loop: Check machine status, check command registers, apply commands, do other machinery stuff, repeat.

What if the “check command registers” step occurs right when my software is in the middle of transmitting the target parameters? It would read a partially written set of parameters. More specifically it would read new target coordinates, but not the necessary velocities. It would calculate delta distances and try to move, but with absurdly low or high velocities, depending on the formulas. If at any point a division by velocity occurs, it would divide by zero.

Because I couldn’t review the code of the machine control unit and the original programmer of it wasn’t available anymore, I tested my hypothesis by reversing the parameter write order: velocity first, location last.

And I wasn’t wrong: This little change got rid of the spurious failures.

The hidden assumption of the control unit code was that all parameters were transactionally valid at any given time. This translated to an implicit protocol requirement: All clients of the command interface needed to either

Transmit all changes at once (not possible with the technology that was used for transmission)
Transmit the changes in an order that has no effect until all changes are written.

The second option was what I implemented. Instead of “steer, then accelerate”, I needed to “accelerate, then steer”, because velocity without a delta distance would not move the equipment, but delta distance without velocity would attempt to do so.

One small sentence about the required write sequence in the documentation would still make this a “surprise requirement”, but a documented one. Without any documentation, its pure luck if a client pushes the buttons in the right order or not.

If you want one learning from this story: If a failure happens only occassionally, think about concurrency problems and include all periphery (humans, too!) into your scenarios.

Hybrid Python packaging for Debian/Ubuntu

Writing software in Python often is a pleasure and can lead to great products with limited costs because of its expressiveness and rich ecosystem.

One area where imho Python falls a bit short is deployment and packaging. On Linux many users and customers expect packages for their platform so they can manage the software installation and updates using the standard tools.

This is where the pain often starts. Depending on the dependencies of your python project it may be simple or rather hard to provide a decent experience for the people managing your software.

I want to present several ways of providing a decent deployment experience to your customer specifically for Debian-based linux distributions.

The simple case

If all the dependencies of your project are available in usable versions for the target distribution, it is quite easy to package a python project as a .deb. My preferred way is to just use stdeb like below:

python3 -m build --sdist --no-isolation
py2dsc-deb --with-python3=True --debian-version 1 ./dist/my_project.tar.gz

This will built a simple debian package installable on a matching destination platform. For simple cases this often is enough.

If only one or a few dependencies are missing, you could consider packaging them too using this approach and allowing your project to take this same route.

Not using packages at all

If some dependencies are not available on the target platform through Debian packages it may be easiest to just provide a tarball with an installation script. This script would essentially perform the following steps

Unpack the source to a nice destination directory
Create a venv there
Install the dependencies in the venv
Provide some startscript and/or service definition to launch the software using the venv

This is simple and usually scales to bigger projects but does not provide nice and clean integration into the system tools. Administrators have to manage the software this way and not the package manager way they may expect and be comfortable with.

A hybrid Debian package approach

My hybrid approach is a blend of the two above:

It builds a normal debian package containing the project itself along with version and dependency metadata. In the postinst-script of the package however, it creates a venv and installs the dependencies unavailable or unusable (e.g. wrong version) on the target platform.

First we create the debian packaging files using

python3 setup.py sdist
dh_make -p my-project_1.0.0 -f dist/my-project-1.0.0.tar.gz

This creates a debian/ directory containing all the packaging metadata files. You should mainly edit the control, copyright and changelog files and then craft the postinst file for our hybrid packaging approach:

#!/bin/sh

set -e

case "$1" in
    configure)
      python3 -m venv /opt/my-project/venv
      . /opt/my-project/venv/bin/activate && pip install PyQt5 pytango==9.5.1 taurus pyepics
    ;;

    abort-upgrade|abort-remove|abort-deconfigure)
    ;;

    *)
        echo "postinst called with unknown argument '$1'" >&2
        exit 1
    ;;
esac

exit 0

For correct removal we need a modified postrm script too:

#!/bin/sh

set -e

case "$1" in
    purge|remove|upgrade|failed-upgrade|abort-install|abort-upgrade|disappear)
      rm -rf /opt/my-project/venv/
    ;;

    *)
        echo "postrm called with unknown argument '$1'" >&2
        exit 1
    ;;
esac
exit 0

Using a final dpkg-buildpackage -b -us -uc we get a debian package that builds its own venv on the target machine using the dependencies we actually need and not what the system offers.

For us and our customers this is a perfect compromise:

It allows us to define the dependencies and their versions exactly and mostly independent from what the target system offers while coming as a normal debian package managed using system tools.

Fighting the Paper War as a Team

Anyone who has ever gone through a public tender knows the feeling: forms on forms, references to other forms, appendices that depend on annexes, and fields that must be filled exactly as specified somewhere on page 37 of a different document. This is not a task; it is a paper war.

Trying to fight this war alone is a mistake.

We learned that the most effective way to survive such bureaucratic battles is to treat them like a team sport. Not a big team—three people are enough—but with clearly defined roles.

The Problem with the Lone Warrior

The naive approach is simple: one person sits down, opens all documents, and starts filling things out.

This person must:

understand the overall structure of the process,
search for the right documents and sections,
enter data correctly and consistently,
double-check everything afterward.

That is a lot of cognitive load. The result is usually slow progress, rising frustration, and errors that only show up when it’s already too late.

The paper war doesn’t reward heroics. It rewards coordination.

A Three-Person Setup

We had much better results by splitting the work into three distinct roles, all active at the same time.

1. The EXECUTOR

The executor is the only person who actually enters data into the forms.

This role is deliberately narrow:

type exactly what is agreed upon,
do not search,
do not interpret,
do not “improve” anything on the fly.

The executor’s job is flow. By removing all other responsibilities, they can focus on speed and accuracy.

2. The Navigator

The navigator owns the overview.

They know:

which document is relevant right now,
where a specific field is defined,
which appendix explains which requirement.

While the executor is typing, the navigator is already preparing the next reference: “Next field is in document B, section 4.2, and it depends on the value we used earlier in A.3.”

This prevents context switching for the executor and keeps the process moving forward.

3. The Checker

The checker validates everything live.

They verify:

numbers,
names,
dates,
consistency with previous entries,
alignment with external sources (contracts, invoices, registers).

This is crucial: checking after the fact is expensive. Checking while data is entered is cheap. Errors are caught immediately, while the context is still fresh.

Like a Car Driving Lesson

This setup is not unfamiliar if you think about a car driving lesson.

The executor is the driver. They focus entirely on operating the vehicle: steering, braking, accelerating. They don’t decide where to go next; they just execute cleanly and safely.

The navigator is the driving instructor sitting in the passenger seat. They know the route, anticipate upcoming turns, and give timely instructions so the driver can react without stress.

The checker plays the role of the driving examiner in the back seat. Quiet but attentive, they observe everything, immediately spotting mistakes, inconsistencies, or rule violations before they become real problems.

Just like in a driving lesson, separating these roles creates confidence, flow, and control—exactly what you need when navigating bureaucratic traffic.

Why This Works

This setup mirrors patterns we already know from software development:

separation of concerns,
reducing cognitive load,
fast feedback loops.

Each person has a clear responsibility, and overlaps are intentional but limited. Nobody is idle, and nobody is overwhelmed.

Most importantly, the process becomes predictable. Instead of a chaotic scramble through documents, you get a steady, almost mechanical flow from field to field.

Paper Wars Won’t Disappear

Bureaucratic processes are unlikely to become simpler anytime soon. Digital forms often just move the paper war onto a screen without changing its nature.

But how we approach them can change.

Treating a public tender as a collaborative, real-time effort instead of a solitary endurance test turns frustration into something manageable—and sometimes even efficient.

You may not win the war forever.
But at least you’ll win this battle.

Finding the culprit in massive-components-interactions web apps

So, one of the biggest software developer joys is, when a customer, after you developed some software for them a few years back, is returning to you with change requests and it turns out that the reason you didn’t hear much from them was not complete abandonment of their project, or utter mistrust in your capabilites, or – you name it – but just that the software worked without any problems. It just did its job.

But one of the next biggest joys is, then, to actually experience how your software behaves after their staff – also just doing their job – growed their database up to a huge number of records and now you see your system under pressure, with all parts playing each other in large scales.

Now, with interactions like on-the-fly-updates during mouse gestures like drag’n’drop, falling behind real-time can be more than a minor nuisance; this might go against a core purpose of such a gesture. But how do you troubleshoot an application which is behaving smoothly most of the time, but then have very costly computations during short periods of time?

Naively, one would just put logging statements in every part that can possibly want to update, but this very quickly meets its limit when your interface just has so many components on display, and you can’t just check that one interaction on one singular part?

This is where I thought of this small device, which I now wanted to share. In my case, this was a React application, but we just use a single object, defined globally, which is modified by a single function like this


const debugBatch = {
    seen: [],
    since: null,
    timeoutId: null
};

const logBatched = (id)=> {
    if (debugBatch.timeoutId) {
        debugBatch.seen.push(id);
    } else {
        debugBatch.since = new Date();
        debugBatch.seen = [];
        debugBatch.timeoutId = setTimeout(() => {
            console.log("[UPDATED],", debugBatch);
            debugBatch.timeoutId = null;
        }, 1000);
    }
};

const SomeComplicatedComponent = ({id, ...}) => {
    ...

    logBatched(id);

    return <>...</>;
};

and I do feel compelled to emphasize that this completely goes outside the React way of thinking, taking care with every state to follow the framework-internal logic. Not some global object, i.e. shared memory that any component can modify at will. But while for production code, I would hardly advise doing so, our debugging profits from that ghostly setup of “every component just raises the I-was-updated-information to somewhere out of its realm”.

It just uses one global timeout, giving a short time frame during which every repeated function call of logBatched(...); raises one entry in the “seen” field of our global object; and when that collection period is over, you get one batched output containing all the information. And you can easily extend that by passing more information along, or maybe, replacing that seen: [] with a new Set() if registering multiple updates of the same component is not what you want. (Also, the timestamp is there just out of habit, it’s not magically required or so).

Note that you can do all additional processing of “how do I need to prepare my debugging statements in order to actually see who the real culprits are” after collecting, and done in an extra task, can even be as expensive as you want it to be without blocking your rendering. As in, having debugging code that significantly affects the actual problem that you are trying to understand, is especially probable in such real-time user interactions; which means you are prone to chasing phantoms.

I like this thing because of its simplicity, and particularly, because it employs a way of thinking that would instinctively make me doubt the qualification of anyone who would give me such a piece to review (😉) but for that specific use case, I’d say, does the job pretty well.

Breaking WebGL Performance

A couple of weeks ago, I ported my game You Are Circle to the browser using Emscripten. Using the conan EMSDK toolchain, this was surprisingly easy to do. The largest engineering effort went into turning the “procedurally generate the level” background thread into a coroutine that I could execute in the main thread while showing the loading screen, since threads are not super well supported (and require enabling a beta feature on itch.io). I already had a renderer abstraction targeting OpenGL 4.1, which is roughly on feature parity with OpenGL ES 3.0, which is what you see WebGL2 as from Emscripten. And that just worked out of the box and things were fine for a while. Until they weren’t.

60FPS to 2FPS

Last week I finally released a new feature: breakable rocks. These are attached to the level walls and can be destroyed for power-ups. I tested this on my machine and everything seemed to be working fine. But some people soon started complaining about unplayable performance in the web build, in the range of 1-2 FPS, coming from a smooth 60 FPS on the same machine. So I took out my laptop and tested it there, and lo and behold, it was very slow indeed. On chrome, even the background music was stuttering. I did some other ‘optimization’ work in that patch, but after ruling that out as the culprit via bisection, I quickly narrowed it down to the rendering of the new breakable rocks.

The rocks are circled in red in this screenshot:

As you can see, everything is very low-poly. The rocks are rendered in two parts, a black background hiding the normal level background and the white outline. If I removed the background rendering, everything was fine (except for the ‘transparent’ rocks).

Now it’s important to know that the rendering is all but optimized at this point. I often use the most basic thing given my infrastructure that I can get away with until I see a problem. In this case, I have some thingies internally that let me draw in a pretty immediate mode way: Just upload the geometry to the GPU and render it. Every frame. At the moment, I do this with almost all the geometry, visible or not, every frame. That was fast enough, and makes it trivial to change what’s being rendered when the rock is broken. The white outline is actually more geometry generated by my line mesher than the rock-background. But that was not causing any catastrophic slow-downs, while the background was. So what was the difference? The line geometry was batched on the CPU, while I was issuing a separate draw-call for each of those rocks. To give some numbers: there were about 100 of those rocks, with each of with up to 11 triangles.

Suspecting the draw call overhead, I tried batching, e.g. merging, all the rock geometry into a single mesh and rendering it with a single draw call. That seemed to work well enough. And that is the version currently released.

Deep Dive

But the problem kept nagging at me after I released the fix. Yes, draw calls can have a lot of overhead, especially in Emscripten. But going from 60FPS to 2FPS still seemed pretty steep, and I did not fully understand why it was so extremely bad. After trying Firefox’s Gecko Profiler, which was recommended in the Emscripten docs, I finally got an idea what was causing the problem. The graphics thread was indeed very busy, and showing a lot of time in MaxForRange<>. That profiler is actually pretty cool, and you can jump directly into the Firefox source code from there to get an idea what’s going on.

Geometry is often specified via one level of indirection: The actual ‘per Point’- a.k.a. Vertex-Array and a list of indices pointing into that, with each triplet defining a triangle. This is a form of compression, and can also help the CPU avoid duplicate work by caching. But it also means that the indices can be invalid, e.g. there can be out-of-bounds indices. And browsers cannot allow that fore safety reasons, so they check the validity before actually issuing a rendering command on the GPU. MaxForRange<> is part of the machinery to do just that via its caller GetIndexedFetchMaxVert. It determines the max index of a section of an index buffer. When issuing a draw call, that max-index is checked against the size of the per-point-data to avoid out-of-range accesses.

This employs a caching scheme: For a given range in the indices, the result is cached, so it doesn’t have to be evaluated again for repeated calls on the same buffer. Also, I suspect to make this cache ‘hit’ more often, the max-index is first retrieved for the whole of the current index buffer, and only if that cannot guarantee valid access, is the subrange even checked. See the calls to GetIndexedFetchMaxVert in WebGLContext::DrawElementsInstanced. When something in the index list is changed from the user side, this cache is completely evicted.

The way that I stream my geometry data in my renderer is by using “big” (=4mb) per-frame buffers for vertex and index data that I gradually fill in some kind of emulated “immediate mode”. In the specific instance for the rocks, this looks like this:

			
for (auto& each : rock)
{
  auto vertex_source = per_frame_buffer.transfer_vertex_data(each.vertex_data);
  auto index_source = per_frame_buffer.transfer_index_data(each.index_data);
  device.draw(vertex_source, index_source, ...);
}

		

The combination of all that turned out to be deadly for performance, and again shows why caching is one of the two hard things in IT. The code essentially becomes:

			
for (auto& each : rock)
{
  auto vertex_source = per_frame_buffer.transfer_vertex_data(each.vertex_data);
  invalidate_per_frame_index_buffer_cache();
  auto index_source = per_frame_buffer.transfer_index_data(each.index_data);
  fill_cache_again_for_the_whole_big_index_buffer();
  device.draw(vertex_source, index_source, ...);
}

		

So for my 100 or so small rocks, the whole loop went through about 400mb of extra data per frame, or ~24gb per second. That’s quite something.

That also explains why merging the geometry helped, as it drastically reduced the amount of cache invalidations/refills. But now that the problem was understood, another option became apparent. Reorder the streamed buffer updates and draw calls, so that all the updates happen before all the draw calls.

Open questions

I am still not sure what the optimal way to stream geometry in WebGL is, but I suspect reordering the updates/draws and keeping the index buffer as small as possible might prove useful. So if you have any proven idea, I’d love to hear it.

I am also not entirely sure why I did not notice this catastrophic slow-down on my developer machine. I suspect it’s just because my CPU has big L2 and L3 caches that made the extra index scans very fast. I suspect I will see the performance problem in the profiler.

Preserving Datatypes When Reusing Views in Oracle

Views are often used as building blocks for other database objects in SQL databases like Oracle. You might start by prototyping logic in a view, then later build a reporting view on top of it, or eventually turn the data into a physical table for performance or snapshot reasons.

When you create a table from a view or build a new view on top of an existing one, the same surprise often appears: the column data types are not what you expected. You run something simple like:

			
CREATE TABLE new_table AS
  SELECT * FROM my_view;

… and later discover that some columns have changed. A NUMBER column may have lost its precision or scale, character columns may be longer or shorter than expected, or calculated columns may end up with odd default datatypes.

This post explains how Oracle decides datatypes when you create tables or views from a view, what usually goes wrong, and how to avoid these problems by being explicit.

Why data types “change” when you create a table from a view

When you use CTAS (Create Table As Select), Oracle does not copy column definitions from the source view. Instead, it looks at each expression in the SELECT list and decides the datatype based on Oracle’s SQL expression rules. In other words, Oracle uses what the query returns, not what you intended.

Problems usually appear when a view contains expressions such as CASE, string concatenation, or arithmetic operations, functions like NVL, COALESCE, TO_CHAR, TRUNC, or ROUND, implicit datatype conversions between numbers, strings, and dates, constants such as NULL, 0, or 'X', or set operations like UNION and UNION ALL where the branches use different types. Oracle’s rules are consistent, but the resulting datatypes do not always match the schema you had in mind.

If a view is little more than a wrapper around base table columns, CTAS usually works fine. It is generally safe when the view selects columns directly, avoids expressions and implicit conversions, and does not use UNION or UNION ALL with mismatched types.

Using CAST to control datatypes

The most reliable way to force a specific datatype is to CAST each expression to the type you want.

For example, if your view calculates values and you want a stable and predictable table schema, you can do this:

			
CREATE TABLE sales_snapshot AS
SELECT
  CAST(order_id AS NUMBER(10))         AS order_id,
  CAST(customer_name AS VARCHAR2(100)) AS customer_name,
  CAST(order_date AS DATE)             AS order_date,
  CAST(amount AS NUMBER(12,2))         AS amount,
  CAST(status AS VARCHAR2(20))         AS status
FROM sales_v;

		

By casting the columns yourself, you remove any guesswork and prevent Oracle from choosing a datatype you did not intend.

When you should always cast

Some expressions are especially likely to cause datatype issues and should almost always be cast explicitly:

NULL columns in a view
SELECT NULL AS some_col ... becomes an untyped null. CTAS can’t infer a useful type unless you cast: CAST(NULL AS VARCHAR2(30)) AS some_col

NVL / COALESCE
These can promote a column to a different type depending on arguments: CAST(COALESCE(num_col, 0) AS NUMBER(10,2)) AS num_col

CASE expressions
All branches should be the same type, or Oracle will pick a “common type” that may surprise you: CAST( CASE WHEN flag = 'Y' THEN amount ELSE 0 END AS NUMBER(12,2) ) AS amount

Concatenation (||)
This always yields a character type; explicitly size it: CAST(first_name || ' ' || last_name AS VARCHAR2(500)) AS full_name

Date formatting
If you convert dates to strings in the view, CTAS will store strings. If you want DATE, don’t TO_CHAR in the view – or cast back (better: avoid the conversion).

What CTAS still does not copy

Even when the column datatypes are correct, CREATE TABLE AS SELECT does not copy everything. Primary keys, foreign keys, check constraints, indexes, triggers, grants, column comments, and default values are not included and must be recreated manually.

Avoiding datatype drift

Datatype problems in Oracle do not only happen when creating tables from views. They often start earlier, when one view is built on top of another and Oracle silently infers a slightly different datatype. That inferred datatype then carries forward into every downstream view or table.

By casting derived columns early and treating views as real schema objects rather than throwaway queries, you can prevent datatype drift and make sure that both views and tables behave exactly the way you expect.

Digitalization is hard (especially in Germany)

Digitalization in this context means transforming traditional paper based processes to computer-based digital ones. For existing organisations and administrations – both private and public – such a transformation requires a lot of thought, work and time.

There are mostly functioning albeit sometimes inefficient processes in place providing services that do not allow interruptions or unavailabilities for extensive periods of time. That means the transition has to be as smooth as possible often requiring running multiple solutions in parallel or providing several ingestion methods and output formats.

Process evolution in general

Nevertheless I see a general pattern when business processes are transformed from traditional to fully digital:

I have observed and performed such transformations both privately as a client or customer and professionally implementing or supporting them.

Status quo

The current state in many organisations in Germany is “Digital Documents” and that is where it often stops. The processes themselves remain largely unchanged and opportunities and improvements remain lost.

Unfortunately this is the step where a lot of potential could be uncovered: Just by using proper collaboration tools one could assign assign tasks to specific people in a process associated to digital documents, track the progress and inform watchers. In many cases this results in much tighter processes, shorter resolution times and hugely improved documentation and traceability.

Going even further

The next step is where service providers like us are often brought to the table to extend, improve or replace the existing solution with custom- and purpose-build software to maximise efficiency, usability and power of the digital world.

Using general tools for certain processes and a certain time often shows the shortcomings and lets you destill a clearer picture of what is actually needed. Using that knowledge helps building better solutions.

Requirements for success

For this whole transformation to be successful one has to be very careful with the transition. It is seldom as easy as shutting down the old way ™ and firing up the new stuff.

Often we need to keep several ingestion points open – imaging snail mail, e-mail, texting, voice mail, web interface, app etc. as possible input media. At different points in the process several people may want to use their own way of interating with the process/documents/associated people. In the end the output may still be a paper document or a digital document as the end artifact. But maybe in addition other output like digital certificates, codes or tokens may benefit the whole experience and process.

So imho the key besides digitalisation and a good process analysis is keeping the process flexible and approachable using different means.

Some examples we all know:

Paying at a store often offers cash, bank card, credit card and sometimes even instant payment systems like Paypal or Wero
Document management with tools like Paperless-ngx office allows ingestion by scan, e-mail, direct upload etc. in different formats like PDF, JPG, PNG and hybrid storage digitally and optionally in a filing cabinet using file folders.
Sick notices may be sent in using phone, e-mail, web forms, in-app and be delivered by the means the recipient likes most.

The possibilities are endless and the potential improvement of efficiency, speed and comfort is huge. Just look around you and you will begin to see a lot of processes that could easily be improve and cause many win-win situations for both, service providers and their clients.

“Keep in mind” code

Reading source code, especially those from other people (and that includes your past self of some months, too), is hard and needs practice. Source code is one of the rare forms of literature that is easier to write than to read. Yet it seems that code is only written once, but read multiple times during its lifetime. Every time we need to make a change to it, we need to read the whole block of code thoroughly and sift through the rest in order to find the relevant block.

This means that source code should be written with readability and understandability in mind. And in fact, I’ve never met a programmer that set out to write obscure code. We all want our code to be easy to read. And while I don’t have a silver bullet answer how this feature can be achieved, I’ve seen some patterns that are detrimental to the goal. I call them “keep in mind” code lines, because that is what you need to do:

You need to make a mental or physical note of some additional requirement that the source code imposes onto the reader, often without a discernable reason.

Let me make an obvious example:

while (true) {
    // some more code
}

This simple line of code requires the reader to make a note: There needs to be some construct that exits the forever loop in the “some more code” section or else the program wouldn’t work right anyway. We have two possibilities: We can interrupt our flow of reading and understanding, scan ahead looking for the exit structure and discard the mental note once we’ve found it. Or we can persist the note, continue with our reading and cross the note off once we read past the exit structure. Both reactions require additional effort from the reader.

If we choose the first possibility, we need to pause our mental model of the code that we’ve read and understood so far. This is equivalent to peeking some pages ahead in a riveting book, just to make sure the character doesn’t die in the current situation. It’s good for peak suspense, but we really don’t want that in our source code. Source code should be a rather boring type of literature. The stories we tell should fascinate on a higher level than “will this thread survive the method call”?

Recalling a paused mental model is always accompanied by some loss. We don’t notice it right away, but some aspect that we already knew goes missing and needs to be learnt again if relevant. In my opinion, that is a bad trade: My high-level model gets compromised because I need to follow a low-level distraction from one line of code.

If we choose the second possibility and make a written note on a piece of paper (or something equivalent), we might hold the mental model in place during the short interruption of writing the note. But we need to implement a recurring note checking mechanism into our reading process, because we shouldn’t forget about the notes.

There is a third possibility: Ignoring the danger. That would mean reading the code like letting the TV run in the background. You don’t really pay attention and the story just flows by. I don’t think that’s a worthwile way to engage with source code.

Let me try to define what a line of “keep in mind” code is: It is source code that cannot be understood without a forward reference further “down” the lines, but raises concerns or questions. It represents an open item on my “sorrow list”.

Another, less obvious example would be:

private final InputStream input;

Because InputStream is a resource (a Closeable) in java, it needs to be used in accordance to its lifecycle. Storing it into a member variable means that the enclosing object “inherits” the lifecycle management. If the enclosing object exposes the resource to the outside, it gets messy. All these unfortunate scenarios appear on my checklist as soon as I read the line above.

What can we do to avoid “keep in mind” lines? We can try to structure our source code not for writing, but for reading. The dangling mental reference of “I need to exit that while-true loop” is present even as the code is written. Once we notice that we keep a mental short-term list of open code structure tasks while programming, we can optimize it. Every code structure that doesn’t lead to some mandatory complementary work further down is one less thing to keep in mind while reading.

How would the example above produce less mental load while reading? Two options come to mind:

do {
    // some more code
} while (true);

This is essentially the same code, but the reader has seen all lines that exit the loop before being made aware that otherwise, it loops forever. The solution to the problem is already present when the problem presents itself.

Another option makes the exit structure explicit from the start:

boolean exitWhileLoop = false;
while (!exitWhileLoop) {
    // some more code
}

The exit structures in the “some more code” section should now use the flag “exitWhileLoop” if possible instead of breaking out directly. If necessary, a hearty “continue” statement at the right place omits the rest of the loop code. This option will lead to more code that is more verbose about the control flow. For the reader, that’s a good thing because the intent isn’t hidden between the lines anymore. If you as the code author think that your code gets clunky because of it, contemplate if the control flow structure is a good fit for the story you want to tell. Maybe you can simplify it, or you need to employ an even more complex structure because your story requires it.

In any case, try to avoid “keep in mind” lines. They burden your readers and make working with the code less pleasant. I have several more examples of such lines or structures, but wanted to keep this blog post short. Are you interested in more specific examples? Can you provide some example from your experience? Write a comment!

P.S.: I love the gibberish on the AI-generated checklist in the blog entry picture and wanted you to savor it, too.

Splitting a repository while preserving history

Monorepos or “collection repositories” tend to grow over time. At some point, a part of them deserves its own life: independent deployments, a dedicated team, or separate release cycles.

The tricky part is obvious: How do you split out a subproject without losing its Git history?

The answer is a powerful tool called git-filter-repo.

Step 1: Clone the Repository into a New Directory

Do not work in your existing checkout.
Instead, clone the repository into a fresh directory by running the following commands in Git Bash:

git clone ssh://git@github.com/project/Collection.git
cd Collection

We avoid working directly on origin and create a temporary branch:

git checkout -b split

This provides a safety net while rewriting history.

Step 2: Filter the Repository

Now comes the crucial step. Using git-filter-repo, we keep only the desired path and move it to the repository root.

python -m git_filter_repo \
  --path path/to/my/subproject/ \
  --path-rename path/to/my/subproject/: \
  --force

--path defines what should remain
--path-rename moves the directory to the repository root
--force is required because history is rewritten

After this step, the repository contains only the former subdirectory — with its full Git history intact.

Step 3: Push to new repository

Now point the repository to its new remote location:

git remote add origin ssh://git@github.com/project/NewProject.git

If an origin already exists, remove it first:

git remote remove origin

Rename the working branch to main:

git branch -m split main

Finally, push the rewritten history to the new repository:

git push -u origin main

That’s it — the new repository is ready, complete with a clean and meaningful history.

Conclusion

git-filter-repo makes it possible to split repositories precisely. Instead of copying files and losing context, you preserve history — which is invaluable for git blame, audits, and understanding how the code evolved.

When refactoring at repository level, history is not baggage. It’s documentation.

Happy splitting!

	Miq on AI Code Won’t Be for Humans Mu…
	AI Code Won’t Be for… on Impressions of Our Current AI…
	Impressions of Our C… on AI Code Won’t Be for Humans Mu…
	Great software engin… on Digitalization is hard (especi…
	Impressions of Our C… on Computing gets fuzzy again (AI…