PostgreSQL’s new MERGE command

PostgreSQL version 15 introduces a new SQL command: the MERGE command. This allows merging a table into another table. The MERGE command has existed for some time in other databases such as Oracle or SQL Server.

The principle of this command is that you have a target table in which you want to insert or remove data based on a source table under certain conditions, or you want to update existing entries with data from the source table. The source table doesn’t have to be a real table, it can just as easily be a SELECT query.

How to use it, step-by-step

The command begins with MERGE INTO, followed by the name of the target table. We call it dest here:

MERGE
  INTO dest ...

Then you specify the source table with USING, here we call it src:

MERGE
  INTO dest
  USING src
  ...

If you want to use a SELECT query as the source instead of a real table, you can do it like this:

MERGE
  INTO dest
  USING (SELECT ... FROM ...) AS src
  ...

Now you need a condition that is used to match entries from one table to entries from the other table. This is specified after ON. In this example we simply use the IDs of the two tables:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  ...

This is followed by a case distinction that describes what should happen if the condition either applies or not. The possible actions can be: UPDATE, DELETE, INSERT, or DO NOTHING.

The two cases are specified with WHEN MATCHED THEN and WHEN NOT MATCHED THEN:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  WHEN MATCHED THEN
    UPDATE SET ...
  WHEN NOT MATCHED THEN
    INSERT (...) VALUES (...);

If a match exists, then reasonable actions are UPDATE, DELETE, or DO NOTHING. If no match exists, then reasonable actions are INSERT or DO NOTHING.

In the WHEN cases, additional conditions can be specified with AND:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  WHEN MATCHED AND dest.value > src.value THEN
    DELETE
  WHEN MATCHED THEN
    UPDATE SET ...
  WHEN NOT MATCHED THEN
    DO NOTHING;

A realistic example

Here’s an example demonstrating a use case that might occur in the real world:

MERGE
  INTO account a
  USING transaction t
  ON a.id=t.account_id
WHEN MATCHED THEN
  UPDATE SET balance = a.balance + t.amount
WHEN NOT MATCHED THEN
  INSERT (id, balance) VALUES (t.account_id, t.amount);

This statement processes a table of monetary transactions and applies them to their matching customer accounts by adding the amount of each transaction to the balance of the matching account. If no matching account exists it will be created and the initial balance is the amount of the first transaction.

Speeding up your HQL

Using an object-relational-mapper (ORM) to persist your entities, manage their state and query subsets for lists or reports is a wide-spread practice and may speed up your development.

If not used correctly, it may introduce unexpected performance problems because of unefficient default queries and the overhead this mapping introduces as most of the time table rows are converted to domain objects. Often this results in many queries and the n+1 query problem.

Nevertheless, the benefits of using an ORM may outweigh the problems and most problems can be mitigated by features and a correct usage of the tool.

Today I want to present a performance problem we had using GORM/Hibernate and how we easily fixed it without major code restructuring or workarounds.

The Problem

We used a HQL-query to load quite a lot of entities which took about 3 seconds. This was acceptable for our customer. If the user however tried to narrow down the results using a filter loading a smaller amount of the same entities took over 1 minute. Obviously, this was totally unacceptable and counter-intuitive.

The Analysis

Further analysis revealed, that a particular part of the WHERE-clause was responsible for the observed slowdown:

FROM Report r
WHERE r.project.proposal.id = p.id

So we did filter the root entity Report on an entity called Proposal but needed to load an associated Project entity for all reports to consider. So even if we are just using entity-ids to filter the innocently looking path r.project.proposal.id leads to loading and mapping of hundreds of Project entities.

The Solution

In our example we can fortunately do a lot better without big changes to our domain model, the application code or the query.

The relevant part of the schema looks like below:

In the above schema we can see, that both, a Report and a Proposal are associated with a certain project. Remember, that in Hibernate your entities contain only the id of their one-to-one mapped sub-entities by default. This means that if we change the filter clause to

WHERE r.project.id = p.project.id

we skip loading and mapping of all the Project entities and only load the needed reports and proposals. Since they both contain the project id we can use that in our filter. This resulted in more than a 10x speedup with such a simple and non-invasive change.

General Takeaway

ORMs can be a great tool but it is very easy to shoot yourself into the foot. With enough care you can achieve both simple code and good performance but you may run into non-obvious problems every now and then.

Re-ordering table columns in an Oracle database

In an Oracle database, once a table is created, there is no obvious way to change the order of its columns. Sometimes you add a new column to an existing table and want it to be displayed in a different position by default for query results via SELECT *. Imagine you add an ID column to a table after the fact. Wouldn’t it be nice if this appeared in the first position?

Of course you can drop the whole table and create it again with the new column order. But this is cumbersome and potentially dangerous if the table is already filled with data. However, there is a trick that allows you to rearrange the columns without having to recreate the table.

The key to this is an Oracle feature that allows invisible columns. The feature itself is interesting in its own right, but it has a useful side effect that we’ll exploit. The documentation says:

When you make an invisible column visible, the column is included in the table’s column order as the last column. When you make a visible column invisible, the invisible column is not included in the column order, and the order of the visible columns in the table might be re-arranged.

So the plan is to make the appropriate columns invisible first by clever choice, and then to make them visible again in the desired order. This is how it works:

First we have a table with the following columns.

CREATE TABLE t (a NUMBER, b NUMBER, c NUMBER, e NUMBER, f NUMBER);

Later we realize that we need a column d that should be between c and f. So we add it to the table:

ALTER TABLE t ADD (d NUMBER);

This is of course added at the end:

DESC t;
Name Null? Type   
---- ----- ------ 
A          NUMBER 
B          NUMBER 
C          NUMBER 
E          NUMBER 
F          NUMBER 
D          NUMBER 

To get it in the right position, we first hide the columns e and f, and then make them visible again.

ALTER TABLE t MODIFY (e INVISIBLE, f INVISIBLE);
ALTER TABLE t MODIFY (e VISIBLE, f VISIBLE);

And voilà, we have our desired order:

DESC t;
Name Null? Type   
---- ----- ------ 
A          NUMBER 
B          NUMBER 
C          NUMBER 
D          NUMBER 
E          NUMBER 
F          NUMBER 

Note that this doesn’t change the internal, physical layout of the table on the disk. It’s just a cosmetic change.

PostgreSQL’s “DISTINCT ON” clause

Anyone who uses SQL databases knows the DISTINCT modifier for SELECT queries to get result sets without duplicates. However, PostgreSQL has another variant of it that not everyone knows, but which is very useful: the SELECT DISTINCT ON clause. It can be used to query only the first row of each set of rows according to a grouping.

To understand its usefulness, let’s look at an example and solve it in the classical way first.

The complicated way

Given the following table of items we want to query for each category the item with the highest value.

 name │ category │ value
-------------------------
 A    │ X        │ 52
 B    │ X        │ 35
 C    │ X        │ 52
 D    │ Y        │ 27
 E    │ Y        │ 31
 F    │ Y        │ 20

Usually we’d start out with a query like this:

SELECT
  category,
  MAX(value) AS highest_value
FROM items
GROUP BY category;
category │ highest_value
--------------------------
 X       │ 52
 Y       │ 31

And then use this query as a sub-select:

SELECT * FROM items
WHERE (category, value) IN (
  SELECT
    category,
    MAX(value) AS highest_value
  FROM items
  GROUP BY category
);
 name │ category │ value
-------------------------
 A    │ X        │ 52
 C    │ X        │ 52
 E    │ Y        │ 31

Unfortunately, there are multiple items in category X with the same highest value 52. But we really only want one row for each category. In this case we might use the ROW_NUMBER() function:

SELECT
  name, category, value
FROM (
  SELECT
    items.*,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY value DESC, name
    ) AS rownum
  FROM items
) WHERE rownum = 1;
 name │ category │ value
-------------------------
 A    │ X        │ 52
 E    │ Y        │ 31

This is finally our desired result.

The easy way

But I promised it can be easier with the DISTINCT ON clause. How does it work?

SELECT DISTINCT ON (category) *
FROM items
ORDER BY
  category, value DESC, name;

After DISTINCT ON we specify one or more columns by which to group by in parentheses. The ORDER BY clause determines which row will be the first in each group. We get the same result:

 name │ category │ value
-------------------------
 A    │ X        │ 52
 E    │ Y        │ 31

Range Types in PostgreSQL

How do you store ranges in an SQL database? By ranges I mean things like price ranges, temperature ranges, date ranges for scheduling, etc. You’d probably represent them with two columns in a table, like min_price and max_price, min_temperature and max_temperature, start_date and end_date. If you want to represent an unbounded range, you’d probably make one or both columns nullable and then take NULL as +/- infinity.

If you want to test if a value is in a range you can use the BETWEEN operator:

SELECT * FROM products WHERE
  target_price BETWEEN min_price AND max_price;

This doesn’t work as nicely anymore if you work with unbounded ranges as described above. You’d have to add additional checks for NULL. What if you want to test if one of the ranges in the table overlaps with a given range?

SELECT * FROM products WHERE
  max_given >= min_price AND
  min_given <= max_price;

Did I make a mistake here? I’m not sure. What if they should overlap but not cover each other? And again, this becomes even more complicated with unbounded ranges.

Enter range types

PostgreSQL has a better solution for these problems — range types. It comes with these additional built-in data types:

  • int4range: Range of integer
  • int8range: Range of bigint
  • numrange: Range of numeric
  • tsrange: Range of timestamp without time zone
  • tstzrange: Range of timestamp with time zone
  • daterange: Range of date

You can use them as a column type in a table:

CREATE TABLE products (…, price_range numrange);

Construction

You can construct range values for these types like this:

'[20,35]'::int4range
'(5,12]'::int4range
'(6.2,12.5)'::numrange
'[2022-05-01, 2022-05-31]'::daterange
'[9:30, 12:00)'::timerange

As you can see, they use mathematical interval notation. A square bracket means inclusive bound, and a round parenthesis means exclusive bound. They can also be unbounded (infinite) or empty:

'[5,)'::int4range
'(,20]'::int4range
'empty'::int4range

You can get the bounds of a range individually with the lower() and upper() functions:

SELECT * FROM products ORDER BY lower(price_range);

Operators

The range types become really powerful through the range operators. There are a lot, so I will only show some basic examples:

  • The && operators tests if two ranges overlap: range_a && range_b
  • The @> and <@ operators test if the first range contains the second or vice versa: range_a <@ range_b. If used with an element on one side they test if the element is in a range: element <@ range or range @> element.
  • The -|- operator tests if two ranges are adjacent: range_a -|- range_b

Additionally to these boolean tests you can also calculate new ranges based on existing ranges:

The + operator computes the union of two overlapping or adjacent ranges: range_a + range_b. The * computes the intersection of ranges, and the - operator the difference.

Multiranges

There is one more thing I want to mention: For each one of the range types there is also a multirange type: int4multirange, int8multirange, nummultirange, tsmultirange, tstzmultirange, datemultirange. As their names suggest, they store multiple ranges in one value:

'{}'::int4multirange
'{[2,9)}'::int4multirange
'{[2,9), [12,20)}'::int4multirange

The mentioned range operators work with them as well.

Full-text Search with PostgreSQL

If you want to add simple text search functionality to an application backed by an SQL database one of the first things that may come to your mind is the SQL LIKE operator. The LIKE operator and its case-insensitive sibling ILIKE find substrings in text data via wildcards such as %, which matches any sequence of zero or more characters:

SELECT * FROM book WHERE title ILIKE '%dog%'.

However, this approach satisfies only very basic requirements for text search, because it only matches exact substrings. That’s why application developers often use an external search engine like Elasticsearch based on the Apache Lucene library.

With a PostgreSQL database there is another option: it comes with a built-in full-text search. A full-text search analyzes text according to the language of the text, parses it into tokens and converts them into so-called lexemes. These are strings, just like tokens, but they have been normalized so that different forms of the same word, for example “pony” and “ponies”, are made alike. Additionally, stop words are eliminated, which are words that are so common that they are useless for searching, like “a” or “the”. For this purpose the search engine uses a dictionary of the target language.

In PostgreSQL, there are two main functions to perform full-text search: they are to_tsvector and to_tsquery. The ts part in the function names stands for “text search”. The to_tsvector function breaks up the input string and creates a vector of lexemes out of it, which are then used to perform full-text search using the to_tsquery function. The two functions can be combined with the @@ (match) operator, which applies a search query to a search vector:

SELECT title
  FROM book
  WHERE to_tsvector(title) @@ to_tsquery('(cat | dog) & pony')

The query syntax of ts_query supports boolean operators like | (or), & (and), ! (not) and grouping using parentheses, but also other operators like and <-> (“followed by”) and * (prefix matching).

You can specify the target language as a parameter of to_tsvector:

# SELECT to_tsvector('english', 'Thousands of ponies were grazing on the prairie.');

'graze':5 'poni':3 'prairi':8 'thousand':1

Here’s another example in German:

# SELECT to_tsvector('german', 'Wer einen Fehler begeht, und ihn nicht korrigiert, begeht einen zweiten (Konfuzius)');

'begeht':4,9 'fehl':3 'konfuzius':12 'korrigiert':8 'wer':1 'zweit':11

PostgreSQL supports dictionaries for about 80+ languages out-of-the-box.

The examples in this article are just a small glimpse of what is possible with regards to full-text search in PostgreSQL. If you want to learn more you should consult the documentation. The key takeaway is that there is another option between simple LIKE clauses and an external search engine.

Commenting SQL database objects

Did you know that you can annotate database object like tables, views and columns with comments in many SQL database systems? By that I don’t mean comments in SQL scripts, indicated by double dashes (--), but comments attached to the objects themselves, stored in the database. These may be helpful to the database admin by providing context via a description text on what is stored in these objects.

For PostgreSQL and Oracle databases the syntax is as follows:

COMMENT ON TABLE [schema_name.]table_name IS '...';
COMMENT ON COLUMN [schema_name.]table_name.column_name IS '...';

For example:

COMMENT ON COLUMN books.author IS 'The main author''s last name';
COMMENT ON TABLE books IS 'Contains only the best books';

These comments can be viewed in database tools like SQL Developer:

Comments on columns
Comments on tables

You can also view the comments in psql:

db=# \d+ books
 Column |  Type   |          Description
--------+---------+------------------------------
id      | integer |
author  | text    | The main author''s last name
title   | text    |

And for a table:

db=# \dt+ books
                    List of relations
 Schema | Name  | Type  |     |        Description
--------+-------+-------+ ... +------------------------------
public  | books | table |     | Contains only the best books

In Oracle you can query the comments from the data dictionary views ALL_TAB_COMMENTS and ALL_COL_COMMENTS:

> SELECT * FROM all_col_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  COLUMN_NAME  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     ID           (null)
LIBRARY	 BOOKS	     AUTHOR       The main author's last name
LIBRARY	 BOOKS	     TITLE        (null)

> SELECT * FROM all_tab_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  TABLE_TYPE  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     TABLE       Contains only the best books

In Oracle comments are limited to tables, views, materialized views, columns, operators and indextypes, but in PostgreSQL you can attach comments to nearly everything. Another good use case for this are documentation comments on database functions:

COMMENT ON FUNCTION my_function IS $$
This function does something important.

Parameters:
...
Example usage:
...
$$;

Note: the $$ delimits multi-line strings (called dollar quoted string constants).

Converting character sets in an Oracle database

I recently had to correct text data stored in an Oracle database with a wrong character set. A bug in one of our applications caused text data to be stored in the ISO 8859-1 West European character encoding. The database, however, is configured to work with UTF-8 encoded strings. The bug was fixed, but the existing data had to be corrected afterwards. If you ever encounter the same problem you can use Oracle’s CONVERT() function to fix the data. The syntax is as follows:

CONVERT(text, to_charset, from_charset)

First you have to find out Oracle’s names of the desired source and target character sets. You can look them up in this table. In my case the the character set names are 'UTF8' and 'WE8ISO8859P1' (for Western European 8-bit ISO-8859-1). You can check the function by selecting from DUAL:

SELECT CONVERT('ä ö ü Ä Ö Ü ß', 'UTF8', 'WE8ISO8859P1') FROM DUAL;

If the result shows the same scrambled characters you see in the data you know that you have chosen the correct character set names:

The other direction should unscramble the characters again:

Finally, you can convert your data:

UPDATE example_table
  SET text_column=CONVERT(text_column, 'WE8ISO8859P1', 'UTF8');

Understanding, identifying and fixing the N+1 query problem

One of the most common performance pitfalls for applications accessing data from databases is the so-called “N+1 query problem”, or sometimes also called the “N+1 selects problem”. It is the first thing you should look for when an application has performance issues related to database access. It is especially easy to run into with object-relational mappers (ORMs).

The problem

The problem typically arises when your entity-relationship model has a 1:n or n:m association. It exists when application code executes one query to get objects of one entity and then executes another query for each of these objects to get the objects of an associated entity. An example would be a blog application that executes one query to fetch all authors whose names start with the letter ‘B’, and then another query for each of these authors to fetch their articles. In pseudocode:

# The 1 query
authors = sql("SELECT * FROM author WHERE name LIKE 'B%'");

# The N queries
articles = []
FOR EACH author IN authors:
    articles += sql("SELECT * FROM article WHERE author_id=:aid", aid: author.id)

The first query is the “1” in “N+1”, the following queries in the loop are the “N”.

Of course, to anybody who knows SQL this is a very naive way to get the desired result. However, OR mappers often seduce their users into writing inefficient database access code by hiding the SQL queries and allowing their users to reach for the normal tools of their favorite programming language like loops or collection operations such as map. A lot of popular web application frameworks come along with OR mappers: Rails with Active Record, Grails with GORM (Hibernate based), Laravel with Eloquent.

How to detect

The easiest way to detect the problem in an application is to log the database queries. Virtually all ORMs have a configuration option to enable query logging.

For Grails/GORM the logging can be enabled per data source in the application.yml config file:

dataSource:
    logSql: true
    formatSql: true

For Rails/ActiveRecord query logging is automatically enabled in the development environment. Since Grails 5.2 the Verbose Query Logs format is enabled by default, which you had to explicitly enable in earlier versions.

For Laravel/Eloquent you can enable and access the query log with these two methods/functions:

DB::connection()->enableQueryLog();
DB::getQueryLog();

Once query logging is enabled you will quickly see if the same query is executed over and over again, usually indicating the presence of the N+1 problem.

How to fix

The goal is to replace the N+1 queries with a single query. In SQL this means joining. The example above would be written as a single query:

SELECT article.*
FROM article
JOIN author
  ON article.author_id=author.id
WHERE author.name LIKE 'B%'

The query interface of ORMs usually allows you to write joins as well. Here the example in ActiveRecord:

Article.joins(:authors).where("authors.name LIKE ?", "B%")

Another option when using ORMs is to enable eager loading for associations. In GORM this can be enabled via the fetchMode static property:

class Author {
    static hasMany = [articles: Article]
    static fetchMode = [articles: 'eager']
}

REST APIs

The problem isn’t limited to SQL databases and SQL queries. For REST APIs it’s the “N+1 requests problem”, describing the situation where a client application has to call the server N+1 times to fetch one collection resource + N child resources. Here the REST-API has to be extended or modified to serve the client’s use cases with a single request. Another option is to offer a GraphQL API instead of a REST API. GraphQL is a query language for HTTP APIs that allows complex queries, so the client application can specify exactly what resources it needs with in a single request.

Migrating a Grails application from Oracle to PostgreSQL

In my previous post I explained how to migrate an Oracle schema with data to a PostgreSQL database management system (DBMS). Besides the general tasks and issues there are additional topics to migrate a complete application using the database to the other DBMS.

In our specific case we have a grails application which we maintain since Grails 1.0 times for more than 12 years. During that time we did a ton of feature development with lots of refactoring and many database migrations. So the source database will most like not be perfectly consistent and clean.

General approach

Since Grails/GORM and the DatabaseMigration-Plugin (DBM-Plugin) do a great job at preparing an empty database with a matching schema for the application to run we let the framework tools generate the schema and only migrate the data using Ora2Pg.

Sounds simple, but how it is done in detail and what else to look for?

Generating the initial dabase schema

The DBM-Plugin provides a script to create a database changelog with a schema matching the domain model of your grails application. It is integrated in gradle, so you can grails dbm-generate-gorm-changelog initialdb.groovy to create the migration scripts providing a fitting schema. You then include this script in grails-app/migrations/changelog.groovy or replace all the migrations you had before included there with this initial database changelog.

To prepare an empty database to run with your application you call the gradle task dbmUpdate.

Checking all plain SQL code

If you are only using GORM’s dynamic finders, save()/update()/delete()-methods, HQL and the criteria API you are probably fine to run your application or perform the data mirgration step.

Our application has some specific parts where we use plain SQL. Because of syntactical differences you will want to check all the plain SQL if it works with PostgreSQL. The most obvious stuff is dealing with sequences or other queries where you need the dual table in Oracle.

Migrating the data

This is probably the part where the most things can go wrong. We had quite some work with data-inconsistencies and left-overs from manual corrections that happened over the course of running and upgrading the application for so many year. For younger and simpler applications this may not present any challenges but for us it was quite time-consuming.

Now you can use Ora2Pg to import the data. After the whole data import using Ora2Pg worked as intended you should check the value of the hibernate_sequence . This sequence is used to generate the ids of all grails domain objects.

Do not let the sequences from the autoincrement columns of the tables of your domain objects confuse you! They are not used by Grails/GORM. To avoid this confusion you can remove the default value of the id columns and the accompanying sequences.

Checking the result

You should always run acceptance or manual tests to make sufficiently sure that the migration worked as intended. There is always the possibility of a configuration or software error or some oversights in checking the application code.

If possible tests the result on a dedicated system with some snapshot of the real world data before making the switch on the production system. Good luck!