PostgreSQL’s new MERGE command

PostgreSQL version 15 introduces a new SQL command: the MERGE command. This allows merging a table into another table. The MERGE command has existed for some time in other databases such as Oracle or SQL Server.

The principle of this command is that you have a target table in which you want to insert or remove data based on a source table under certain conditions, or you want to update existing entries with data from the source table. The source table doesn’t have to be a real table, it can just as easily be a SELECT query.

How to use it, step-by-step

The command begins with MERGE INTO, followed by the name of the target table. We call it dest here:

MERGE
  INTO dest ...

Then you specify the source table with USING, here we call it src:

MERGE
  INTO dest
  USING src
  ...

If you want to use a SELECT query as the source instead of a real table, you can do it like this:

MERGE
  INTO dest
  USING (SELECT ... FROM ...) AS src
  ...

Now you need a condition that is used to match entries from one table to entries from the other table. This is specified after ON. In this example we simply use the IDs of the two tables:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  ...

This is followed by a case distinction that describes what should happen if the condition either applies or not. The possible actions can be: UPDATE, DELETE, INSERT, or DO NOTHING.

The two cases are specified with WHEN MATCHED THEN and WHEN NOT MATCHED THEN:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  WHEN MATCHED THEN
    UPDATE SET ...
  WHEN NOT MATCHED THEN
    INSERT (...) VALUES (...);

If a match exists, then reasonable actions are UPDATE, DELETE, or DO NOTHING. If no match exists, then reasonable actions are INSERT or DO NOTHING.

In the WHEN cases, additional conditions can be specified with AND:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  WHEN MATCHED AND dest.value > src.value THEN
    DELETE
  WHEN MATCHED THEN
    UPDATE SET ...
  WHEN NOT MATCHED THEN
    DO NOTHING;

A realistic example

Here’s an example demonstrating a use case that might occur in the real world:

MERGE
  INTO account a
  USING transaction t
  ON a.id=t.account_id
WHEN MATCHED THEN
  UPDATE SET balance = a.balance + t.amount
WHEN NOT MATCHED THEN
  INSERT (id, balance) VALUES (t.account_id, t.amount);

This statement processes a table of monetary transactions and applies them to their matching customer accounts by adding the amount of each transaction to the balance of the matching account. If no matching account exists it will be created and the initial balance is the amount of the first transaction.

Re-ordering table columns in an Oracle database

In an Oracle database, once a table is created, there is no obvious way to change the order of its columns. Sometimes you add a new column to an existing table and want it to be displayed in a different position by default for query results via SELECT *. Imagine you add an ID column to a table after the fact. Wouldn’t it be nice if this appeared in the first position?

Of course you can drop the whole table and create it again with the new column order. But this is cumbersome and potentially dangerous if the table is already filled with data. However, there is a trick that allows you to rearrange the columns without having to recreate the table.

The key to this is an Oracle feature that allows invisible columns. The feature itself is interesting in its own right, but it has a useful side effect that we’ll exploit. The documentation says:

When you make an invisible column visible, the column is included in the table’s column order as the last column. When you make a visible column invisible, the invisible column is not included in the column order, and the order of the visible columns in the table might be re-arranged.

So the plan is to make the appropriate columns invisible first by clever choice, and then to make them visible again in the desired order. This is how it works:

First we have a table with the following columns.

CREATE TABLE t (a NUMBER, b NUMBER, c NUMBER, e NUMBER, f NUMBER);

Later we realize that we need a column d that should be between c and f. So we add it to the table:

ALTER TABLE t ADD (d NUMBER);

This is of course added at the end:

DESC t;
Name Null? Type   
---- ----- ------ 
A          NUMBER 
B          NUMBER 
C          NUMBER 
E          NUMBER 
F          NUMBER 
D          NUMBER 

To get it in the right position, we first hide the columns e and f, and then make them visible again.

ALTER TABLE t MODIFY (e INVISIBLE, f INVISIBLE);
ALTER TABLE t MODIFY (e VISIBLE, f VISIBLE);

And voilà, we have our desired order:

DESC t;
Name Null? Type   
---- ----- ------ 
A          NUMBER 
B          NUMBER 
C          NUMBER 
D          NUMBER 
E          NUMBER 
F          NUMBER 

Note that this doesn’t change the internal, physical layout of the table on the disk. It’s just a cosmetic change.

PostgreSQL’s “DISTINCT ON” clause

Anyone who uses SQL databases knows the DISTINCT modifier for SELECT queries to get result sets without duplicates. However, PostgreSQL has another variant of it that not everyone knows, but which is very useful: the SELECT DISTINCT ON clause. It can be used to query only the first row of each set of rows according to a grouping.

To understand its usefulness, let’s look at an example and solve it in the classical way first.

The complicated way

Given the following table of items we want to query for each category the item with the highest value.

 name │ category │ value
-------------------------
 A    │ X        │ 52
 B    │ X        │ 35
 C    │ X        │ 52
 D    │ Y        │ 27
 E    │ Y        │ 31
 F    │ Y        │ 20

Usually we’d start out with a query like this:

SELECT
  category,
  MAX(value) AS highest_value
FROM items
GROUP BY category;
category │ highest_value
--------------------------
 X       │ 52
 Y       │ 31

And then use this query as a sub-select:

SELECT * FROM items
WHERE (category, value) IN (
  SELECT
    category,
    MAX(value) AS highest_value
  FROM items
  GROUP BY category
);
 name │ category │ value
-------------------------
 A    │ X        │ 52
 C    │ X        │ 52
 E    │ Y        │ 31

Unfortunately, there are multiple items in category X with the same highest value 52. But we really only want one row for each category. In this case we might use the ROW_NUMBER() function:

SELECT
  name, category, value
FROM (
  SELECT
    items.*,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY value DESC, name
    ) AS rownum
  FROM items
) WHERE rownum = 1;
 name │ category │ value
-------------------------
 A    │ X        │ 52
 E    │ Y        │ 31

This is finally our desired result.

The easy way

But I promised it can be easier with the DISTINCT ON clause. How does it work?

SELECT DISTINCT ON (category) *
FROM items
ORDER BY
  category, value DESC, name;

After DISTINCT ON we specify one or more columns by which to group by in parentheses. The ORDER BY clause determines which row will be the first in each group. We get the same result:

 name │ category │ value
-------------------------
 A    │ X        │ 52
 E    │ Y        │ 31

Range Types in PostgreSQL

How do you store ranges in an SQL database? By ranges I mean things like price ranges, temperature ranges, date ranges for scheduling, etc. You’d probably represent them with two columns in a table, like min_price and max_price, min_temperature and max_temperature, start_date and end_date. If you want to represent an unbounded range, you’d probably make one or both columns nullable and then take NULL as +/- infinity.

If you want to test if a value is in a range you can use the BETWEEN operator:

SELECT * FROM products WHERE
  target_price BETWEEN min_price AND max_price;

This doesn’t work as nicely anymore if you work with unbounded ranges as described above. You’d have to add additional checks for NULL. What if you want to test if one of the ranges in the table overlaps with a given range?

SELECT * FROM products WHERE
  max_given >= min_price AND
  min_given <= max_price;

Did I make a mistake here? I’m not sure. What if they should overlap but not cover each other? And again, this becomes even more complicated with unbounded ranges.

Enter range types

PostgreSQL has a better solution for these problems — range types. It comes with these additional built-in data types:

  • int4range: Range of integer
  • int8range: Range of bigint
  • numrange: Range of numeric
  • tsrange: Range of timestamp without time zone
  • tstzrange: Range of timestamp with time zone
  • daterange: Range of date

You can use them as a column type in a table:

CREATE TABLE products (…, price_range numrange);

Construction

You can construct range values for these types like this:

'[20,35]'::int4range
'(5,12]'::int4range
'(6.2,12.5)'::numrange
'[2022-05-01, 2022-05-31]'::daterange
'[9:30, 12:00)'::timerange

As you can see, they use mathematical interval notation. A square bracket means inclusive bound, and a round parenthesis means exclusive bound. They can also be unbounded (infinite) or empty:

'[5,)'::int4range
'(,20]'::int4range
'empty'::int4range

You can get the bounds of a range individually with the lower() and upper() functions:

SELECT * FROM products ORDER BY lower(price_range);

Operators

The range types become really powerful through the range operators. There are a lot, so I will only show some basic examples:

  • The && operators tests if two ranges overlap: range_a && range_b
  • The @> and <@ operators test if the first range contains the second or vice versa: range_a <@ range_b. If used with an element on one side they test if the element is in a range: element <@ range or range @> element.
  • The -|- operator tests if two ranges are adjacent: range_a -|- range_b

Additionally to these boolean tests you can also calculate new ranges based on existing ranges:

The + operator computes the union of two overlapping or adjacent ranges: range_a + range_b. The * computes the intersection of ranges, and the - operator the difference.

Multiranges

There is one more thing I want to mention: For each one of the range types there is also a multirange type: int4multirange, int8multirange, nummultirange, tsmultirange, tstzmultirange, datemultirange. As their names suggest, they store multiple ranges in one value:

'{}'::int4multirange
'{[2,9)}'::int4multirange
'{[2,9), [12,20)}'::int4multirange

The mentioned range operators work with them as well.

Full-text Search with PostgreSQL

If you want to add simple text search functionality to an application backed by an SQL database one of the first things that may come to your mind is the SQL LIKE operator. The LIKE operator and its case-insensitive sibling ILIKE find substrings in text data via wildcards such as %, which matches any sequence of zero or more characters:

SELECT * FROM book WHERE title ILIKE '%dog%'.

However, this approach satisfies only very basic requirements for text search, because it only matches exact substrings. That’s why application developers often use an external search engine like Elasticsearch based on the Apache Lucene library.

With a PostgreSQL database there is another option: it comes with a built-in full-text search. A full-text search analyzes text according to the language of the text, parses it into tokens and converts them into so-called lexemes. These are strings, just like tokens, but they have been normalized so that different forms of the same word, for example “pony” and “ponies”, are made alike. Additionally, stop words are eliminated, which are words that are so common that they are useless for searching, like “a” or “the”. For this purpose the search engine uses a dictionary of the target language.

In PostgreSQL, there are two main functions to perform full-text search: they are to_tsvector and to_tsquery. The ts part in the function names stands for “text search”. The to_tsvector function breaks up the input string and creates a vector of lexemes out of it, which are then used to perform full-text search using the to_tsquery function. The two functions can be combined with the @@ (match) operator, which applies a search query to a search vector:

SELECT title
  FROM book
  WHERE to_tsvector(title) @@ to_tsquery('(cat | dog) & pony')

The query syntax of ts_query supports boolean operators like | (or), & (and), ! (not) and grouping using parentheses, but also other operators like and <-> (“followed by”) and * (prefix matching).

You can specify the target language as a parameter of to_tsvector:

# SELECT to_tsvector('english', 'Thousands of ponies were grazing on the prairie.');

'graze':5 'poni':3 'prairi':8 'thousand':1

Here’s another example in German:

# SELECT to_tsvector('german', 'Wer einen Fehler begeht, und ihn nicht korrigiert, begeht einen zweiten (Konfuzius)');

'begeht':4,9 'fehl':3 'konfuzius':12 'korrigiert':8 'wer':1 'zweit':11

PostgreSQL supports dictionaries for about 80+ languages out-of-the-box.

The examples in this article are just a small glimpse of what is possible with regards to full-text search in PostgreSQL. If you want to learn more you should consult the documentation. The key takeaway is that there is another option between simple LIKE clauses and an external search engine.

Commenting SQL database objects

Did you know that you can annotate database object like tables, views and columns with comments in many SQL database systems? By that I don’t mean comments in SQL scripts, indicated by double dashes (--), but comments attached to the objects themselves, stored in the database. These may be helpful to the database admin by providing context via a description text on what is stored in these objects.

For PostgreSQL and Oracle databases the syntax is as follows:

COMMENT ON TABLE [schema_name.]table_name IS '...';
COMMENT ON COLUMN [schema_name.]table_name.column_name IS '...';

For example:

COMMENT ON COLUMN books.author IS 'The main author''s last name';
COMMENT ON TABLE books IS 'Contains only the best books';

These comments can be viewed in database tools like SQL Developer:

Comments on columns
Comments on tables

You can also view the comments in psql:

db=# \d+ books
 Column |  Type   |          Description
--------+---------+------------------------------
id      | integer |
author  | text    | The main author''s last name
title   | text    |

And for a table:

db=# \dt+ books
                    List of relations
 Schema | Name  | Type  |     |        Description
--------+-------+-------+ ... +------------------------------
public  | books | table |     | Contains only the best books

In Oracle you can query the comments from the data dictionary views ALL_TAB_COMMENTS and ALL_COL_COMMENTS:

> SELECT * FROM all_col_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  COLUMN_NAME  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     ID           (null)
LIBRARY	 BOOKS	     AUTHOR       The main author's last name
LIBRARY	 BOOKS	     TITLE        (null)

> SELECT * FROM all_tab_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  TABLE_TYPE  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     TABLE       Contains only the best books

In Oracle comments are limited to tables, views, materialized views, columns, operators and indextypes, but in PostgreSQL you can attach comments to nearly everything. Another good use case for this are documentation comments on database functions:

COMMENT ON FUNCTION my_function IS $$
This function does something important.

Parameters:
...
Example usage:
...
$$;

Note: the $$ delimits multi-line strings (called dollar quoted string constants).

Converting character sets in an Oracle database

I recently had to correct text data stored in an Oracle database with a wrong character set. A bug in one of our applications caused text data to be stored in the ISO 8859-1 West European character encoding. The database, however, is configured to work with UTF-8 encoded strings. The bug was fixed, but the existing data had to be corrected afterwards. If you ever encounter the same problem you can use Oracle’s CONVERT() function to fix the data. The syntax is as follows:

CONVERT(text, to_charset, from_charset)

First you have to find out Oracle’s names of the desired source and target character sets. You can look them up in this table. In my case the the character set names are 'UTF8' and 'WE8ISO8859P1' (for Western European 8-bit ISO-8859-1). You can check the function by selecting from DUAL:

SELECT CONVERT('ä ö ü Ä Ö Ü ß', 'UTF8', 'WE8ISO8859P1') FROM DUAL;

If the result shows the same scrambled characters you see in the data you know that you have chosen the correct character set names:

The other direction should unscramble the characters again:

Finally, you can convert your data:

UPDATE example_table
  SET text_column=CONVERT(text_column, 'WE8ISO8859P1', 'UTF8');

Understanding, identifying and fixing the N+1 query problem

One of the most common performance pitfalls for applications accessing data from databases is the so-called “N+1 query problem”, or sometimes also called the “N+1 selects problem”. It is the first thing you should look for when an application has performance issues related to database access. It is especially easy to run into with object-relational mappers (ORMs).

The problem

The problem typically arises when your entity-relationship model has a 1:n or n:m association. It exists when application code executes one query to get objects of one entity and then executes another query for each of these objects to get the objects of an associated entity. An example would be a blog application that executes one query to fetch all authors whose names start with the letter ‘B’, and then another query for each of these authors to fetch their articles. In pseudocode:

# The 1 query
authors = sql("SELECT * FROM author WHERE name LIKE 'B%'");

# The N queries
articles = []
FOR EACH author IN authors:
    articles += sql("SELECT * FROM article WHERE author_id=:aid", aid: author.id)

The first query is the “1” in “N+1”, the following queries in the loop are the “N”.

Of course, to anybody who knows SQL this is a very naive way to get the desired result. However, OR mappers often seduce their users into writing inefficient database access code by hiding the SQL queries and allowing their users to reach for the normal tools of their favorite programming language like loops or collection operations such as map. A lot of popular web application frameworks come along with OR mappers: Rails with Active Record, Grails with GORM (Hibernate based), Laravel with Eloquent.

How to detect

The easiest way to detect the problem in an application is to log the database queries. Virtually all ORMs have a configuration option to enable query logging.

For Grails/GORM the logging can be enabled per data source in the application.yml config file:

dataSource:
    logSql: true
    formatSql: true

For Rails/ActiveRecord query logging is automatically enabled in the development environment. Since Grails 5.2 the Verbose Query Logs format is enabled by default, which you had to explicitly enable in earlier versions.

For Laravel/Eloquent you can enable and access the query log with these two methods/functions:

DB::connection()->enableQueryLog();
DB::getQueryLog();

Once query logging is enabled you will quickly see if the same query is executed over and over again, usually indicating the presence of the N+1 problem.

How to fix

The goal is to replace the N+1 queries with a single query. In SQL this means joining. The example above would be written as a single query:

SELECT article.*
FROM article
JOIN author
  ON article.author_id=author.id
WHERE author.name LIKE 'B%'

The query interface of ORMs usually allows you to write joins as well. Here the example in ActiveRecord:

Article.joins(:authors).where("authors.name LIKE ?", "B%")

Another option when using ORMs is to enable eager loading for associations. In GORM this can be enabled via the fetchMode static property:

class Author {
    static hasMany = [articles: Article]
    static fetchMode = [articles: 'eager']
}

REST APIs

The problem isn’t limited to SQL databases and SQL queries. For REST APIs it’s the “N+1 requests problem”, describing the situation where a client application has to call the server N+1 times to fetch one collection resource + N child resources. Here the REST-API has to be extended or modified to serve the client’s use cases with a single request. Another option is to offer a GraphQL API instead of a REST API. GraphQL is a query language for HTTP APIs that allows complex queries, so the client application can specify exactly what resources it needs with in a single request.

Linking separate PostgreSQL servers with a Foreign Data Wrapper

If you want to query a table in one database server from another you need a way to connect these two servers with each other. For PostgreSQL databases the feature that makes this possible is called Foreign Data Wrapper.

To use this feature you have to load its extension into the current database:

CREATE EXTENSION postgres_fdw;

The postgres_fdw extension ships with a default PostgreSQL installation. It allows you to connect a PosgreSQL database with another PostgreSQL database. You can connect a PostgreSQL database with databases by other vendors, too, but then you need external extensions like oracle_fdw for Oracle databases or mysql_fdw for MySQL databases. In this article we will only use postgres_fdw.

You can check if the extension was loaded successfully. The following query should return a row for the extension:

SELECT * FROM pg_extension WHERE extname='postgres_fdw';

The next step is to set up the remote server instance:

CREATE SERVER remotesrv
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (
    host '127.0.0.1',
    port '5432',
    dbname 'remotedb'
  );

This statement registers a remote server under the name remotesrv, a name you can choose freely. You have to specify which Foreign Data Wrapper to use (postgres_fdw in this case) as well the target host, port and database name.

The CREATE SERVER statement didn’t contain any user login information, you have to provide it via a user mapping:

CREATE USER MAPPING
  FOR CURRENT_USER
  SERVER remotesrv
  OPTIONS (
    user 'joe',
    password 'secret'
  );

In this case we map the remote user joe to the current user (CURRENT_USER) of the local server. It doesn’t have to be the current user, you could specify any user name.

Now you have to import tables from the remote database. You can either explicitly import individual tables or a whole schema. Here’s how to import the public schema of the remote (“foreign”) database into the public schema of the local database:

IMPORT FOREIGN SCHEMA public
  FROM SERVER remotesrv
  INTO public;

You can restrict which tables to import with the LIMIT TO or EXCEPT clauses. The following statement will only import the tables books and students:

IMPORT FOREIGN SCHEMA public
  LIMIT TO (students, books)
  FROM SERVER remotesrv
  INTO public;

Now you can access these tables as if they were in the local database.

Tables as types in PostgreSQL

In SQL each column of a database table has a data type. These are types like NUMBER, VARCHAR(size) / TEXT, TIMESTAMP. What you perhaps don’t know yet is that in PostgreSQL you can use tables as types. What does this mean? Let’s say we have a table:

CREATE TABLE person (
  firstname TEXT,
  lastname  TEXT,
  email     TEXT
);

Now you can use this table as a type for columns in other tables:

CREATE TABLE article (
  id       SERIAL,
  content  TEXT,
  author   person,
  reviewer person
);

Instead of repeating the three columns of a person twice, e.g. author_firstname, author_lastname, author_email, and reviewer_firstname, reviewer_lastname, reviewer_email, the person table defined before acts as a type. Of course, the usual way in SQL is to give each person an ID and reference persons via these IDs from other tables. But sometimes you do not want this reference semantics. In this example you might want to fix the values of author and reviewer for articles in time and not retroactively update them automatically if a person changes their last name or email address later.

How to access the columns of these types? For INSERT the syntax is as follows:

INSERT INTO article (content, author, reviewer)
  VALUES ('...',
    ('Jane', 'Doe', 'jane.doe@example.com'),
    ('John', 'Roe', 'jroe@example.com')
  ); 

Or with explicit names of the sub-columns:

INSERT INTO article (content,
    author.firstname,
    author.lastname,
    author.email,
    reviewer.firstname,
    reviewer.lastname,
    reviewer.email)
  VALUES ('...',
    'Jane', 'Doe', 'jane.doe@example.com',
    'John', 'Roe', 'jroe@example.com'
  ); 

In a SELECT query individual values can be accessed with the following syntax:

SELECT
  content,
  (author).lastname,
  (reviewer).lastname
FROM article;

Of course, tables that uses other tables as data types for their columns can be used as data types again.

One last thing worth mentioning is that these nested definitions can be mapped nicely to JSON:

SELECT jsonb_pretty(to_jsonb(article)) FROM article;
{
  "id": 1,
  "content": "...",
  "author": {
    "email": "jane.doe@example.com",
    "firstname": "Jane",
    "lastname": "Doe"
  },
  "reviewer": {
    "email": "jroe@example.com",
    "firstname": "John",
    "lastname": "Roe"
  }
}