PostgreSQL’s new MERGE command

PostgreSQL version 15 introduces a new SQL command: the MERGE command. This allows merging a table into another table. The MERGE command has existed for some time in other databases such as Oracle or SQL Server.

The principle of this command is that you have a target table in which you want to insert or remove data based on a source table under certain conditions, or you want to update existing entries with data from the source table. The source table doesn’t have to be a real table, it can just as easily be a SELECT query.

How to use it, step-by-step

The command begins with MERGE INTO, followed by the name of the target table. We call it dest here:

MERGE
  INTO dest ...

Then you specify the source table with USING, here we call it src:

MERGE
  INTO dest
  USING src
  ...

If you want to use a SELECT query as the source instead of a real table, you can do it like this:

MERGE
  INTO dest
  USING (SELECT ... FROM ...) AS src
  ...

Now you need a condition that is used to match entries from one table to entries from the other table. This is specified after ON. In this example we simply use the IDs of the two tables:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  ...

This is followed by a case distinction that describes what should happen if the condition either applies or not. The possible actions can be: UPDATE, DELETE, INSERT, or DO NOTHING.

The two cases are specified with WHEN MATCHED THEN and WHEN NOT MATCHED THEN:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  WHEN MATCHED THEN
    UPDATE SET ...
  WHEN NOT MATCHED THEN
    INSERT (...) VALUES (...);

If a match exists, then reasonable actions are UPDATE, DELETE, or DO NOTHING. If no match exists, then reasonable actions are INSERT or DO NOTHING.

In the WHEN cases, additional conditions can be specified with AND:

MERGE
  INTO dest
  USING src
  ON dest.id=src.id
  WHEN MATCHED AND dest.value > src.value THEN
    DELETE
  WHEN MATCHED THEN
    UPDATE SET ...
  WHEN NOT MATCHED THEN
    DO NOTHING;

A realistic example

Here’s an example demonstrating a use case that might occur in the real world:

MERGE
  INTO account a
  USING transaction t
  ON a.id=t.account_id
WHEN MATCHED THEN
  UPDATE SET balance = a.balance + t.amount
WHEN NOT MATCHED THEN
  INSERT (id, balance) VALUES (t.account_id, t.amount);

This statement processes a table of monetary transactions and applies them to their matching customer accounts by adding the amount of each transaction to the balance of the matching account. If no matching account exists it will be created and the initial balance is the amount of the first transaction.

Re-ordering table columns in an Oracle database

In an Oracle database, once a table is created, there is no obvious way to change the order of its columns. Sometimes you add a new column to an existing table and want it to be displayed in a different position by default for query results via SELECT *. Imagine you add an ID column to a table after the fact. Wouldn’t it be nice if this appeared in the first position?

Of course you can drop the whole table and create it again with the new column order. But this is cumbersome and potentially dangerous if the table is already filled with data. However, there is a trick that allows you to rearrange the columns without having to recreate the table.

The key to this is an Oracle feature that allows invisible columns. The feature itself is interesting in its own right, but it has a useful side effect that we’ll exploit. The documentation says:

When you make an invisible column visible, the column is included in the table’s column order as the last column. When you make a visible column invisible, the invisible column is not included in the column order, and the order of the visible columns in the table might be re-arranged.

So the plan is to make the appropriate columns invisible first by clever choice, and then to make them visible again in the desired order. This is how it works:

First we have a table with the following columns.

CREATE TABLE t (a NUMBER, b NUMBER, c NUMBER, e NUMBER, f NUMBER);

Later we realize that we need a column d that should be between c and f. So we add it to the table:

ALTER TABLE t ADD (d NUMBER);

This is of course added at the end:

DESC t;
Name Null? Type   
---- ----- ------ 
A          NUMBER 
B          NUMBER 
C          NUMBER 
E          NUMBER 
F          NUMBER 
D          NUMBER 

To get it in the right position, we first hide the columns e and f, and then make them visible again.

ALTER TABLE t MODIFY (e INVISIBLE, f INVISIBLE);
ALTER TABLE t MODIFY (e VISIBLE, f VISIBLE);

And voilà, we have our desired order:

DESC t;
Name Null? Type   
---- ----- ------ 
A          NUMBER 
B          NUMBER 
C          NUMBER 
D          NUMBER 
E          NUMBER 
F          NUMBER 

Note that this doesn’t change the internal, physical layout of the table on the disk. It’s just a cosmetic change.

Range Types in PostgreSQL

How do you store ranges in an SQL database? By ranges I mean things like price ranges, temperature ranges, date ranges for scheduling, etc. You’d probably represent them with two columns in a table, like min_price and max_price, min_temperature and max_temperature, start_date and end_date. If you want to represent an unbounded range, you’d probably make one or both columns nullable and then take NULL as +/- infinity.

If you want to test if a value is in a range you can use the BETWEEN operator:

SELECT * FROM products WHERE
  target_price BETWEEN min_price AND max_price;

This doesn’t work as nicely anymore if you work with unbounded ranges as described above. You’d have to add additional checks for NULL. What if you want to test if one of the ranges in the table overlaps with a given range?

SELECT * FROM products WHERE
  max_given >= min_price AND
  min_given <= max_price;

Did I make a mistake here? I’m not sure. What if they should overlap but not cover each other? And again, this becomes even more complicated with unbounded ranges.

Enter range types

PostgreSQL has a better solution for these problems — range types. It comes with these additional built-in data types:

  • int4range: Range of integer
  • int8range: Range of bigint
  • numrange: Range of numeric
  • tsrange: Range of timestamp without time zone
  • tstzrange: Range of timestamp with time zone
  • daterange: Range of date

You can use them as a column type in a table:

CREATE TABLE products (…, price_range numrange);

Construction

You can construct range values for these types like this:

'[20,35]'::int4range
'(5,12]'::int4range
'(6.2,12.5)'::numrange
'[2022-05-01, 2022-05-31]'::daterange
'[9:30, 12:00)'::timerange

As you can see, they use mathematical interval notation. A square bracket means inclusive bound, and a round parenthesis means exclusive bound. They can also be unbounded (infinite) or empty:

'[5,)'::int4range
'(,20]'::int4range
'empty'::int4range

You can get the bounds of a range individually with the lower() and upper() functions:

SELECT * FROM products ORDER BY lower(price_range);

Operators

The range types become really powerful through the range operators. There are a lot, so I will only show some basic examples:

  • The && operators tests if two ranges overlap: range_a && range_b
  • The @> and <@ operators test if the first range contains the second or vice versa: range_a <@ range_b. If used with an element on one side they test if the element is in a range: element <@ range or range @> element.
  • The -|- operator tests if two ranges are adjacent: range_a -|- range_b

Additionally to these boolean tests you can also calculate new ranges based on existing ranges:

The + operator computes the union of two overlapping or adjacent ranges: range_a + range_b. The * computes the intersection of ranges, and the - operator the difference.

Multiranges

There is one more thing I want to mention: For each one of the range types there is also a multirange type: int4multirange, int8multirange, nummultirange, tsmultirange, tstzmultirange, datemultirange. As their names suggest, they store multiple ranges in one value:

'{}'::int4multirange
'{[2,9)}'::int4multirange
'{[2,9), [12,20)}'::int4multirange

The mentioned range operators work with them as well.

Commenting SQL database objects

Did you know that you can annotate database object like tables, views and columns with comments in many SQL database systems? By that I don’t mean comments in SQL scripts, indicated by double dashes (--), but comments attached to the objects themselves, stored in the database. These may be helpful to the database admin by providing context via a description text on what is stored in these objects.

For PostgreSQL and Oracle databases the syntax is as follows:

COMMENT ON TABLE [schema_name.]table_name IS '...';
COMMENT ON COLUMN [schema_name.]table_name.column_name IS '...';

For example:

COMMENT ON COLUMN books.author IS 'The main author''s last name';
COMMENT ON TABLE books IS 'Contains only the best books';

These comments can be viewed in database tools like SQL Developer:

Comments on columns
Comments on tables

You can also view the comments in psql:

db=# \d+ books
 Column |  Type   |          Description
--------+---------+------------------------------
id      | integer |
author  | text    | The main author''s last name
title   | text    |

And for a table:

db=# \dt+ books
                    List of relations
 Schema | Name  | Type  |     |        Description
--------+-------+-------+ ... +------------------------------
public  | books | table |     | Contains only the best books

In Oracle you can query the comments from the data dictionary views ALL_TAB_COMMENTS and ALL_COL_COMMENTS:

> SELECT * FROM all_col_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  COLUMN_NAME  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     ID           (null)
LIBRARY	 BOOKS	     AUTHOR       The main author's last name
LIBRARY	 BOOKS	     TITLE        (null)

> SELECT * FROM all_tab_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  TABLE_TYPE  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     TABLE       Contains only the best books

In Oracle comments are limited to tables, views, materialized views, columns, operators and indextypes, but in PostgreSQL you can attach comments to nearly everything. Another good use case for this are documentation comments on database functions:

COMMENT ON FUNCTION my_function IS $$
This function does something important.

Parameters:
...
Example usage:
...
$$;

Note: the $$ delimits multi-line strings (called dollar quoted string constants).

Understanding, identifying and fixing the N+1 query problem

One of the most common performance pitfalls for applications accessing data from databases is the so-called “N+1 query problem”, or sometimes also called the “N+1 selects problem”. It is the first thing you should look for when an application has performance issues related to database access. It is especially easy to run into with object-relational mappers (ORMs).

The problem

The problem typically arises when your entity-relationship model has a 1:n or n:m association. It exists when application code executes one query to get objects of one entity and then executes another query for each of these objects to get the objects of an associated entity. An example would be a blog application that executes one query to fetch all authors whose names start with the letter ‘B’, and then another query for each of these authors to fetch their articles. In pseudocode:

# The 1 query
authors = sql("SELECT * FROM author WHERE name LIKE 'B%'");

# The N queries
articles = []
FOR EACH author IN authors:
    articles += sql("SELECT * FROM article WHERE author_id=:aid", aid: author.id)

The first query is the “1” in “N+1”, the following queries in the loop are the “N”.

Of course, to anybody who knows SQL this is a very naive way to get the desired result. However, OR mappers often seduce their users into writing inefficient database access code by hiding the SQL queries and allowing their users to reach for the normal tools of their favorite programming language like loops or collection operations such as map. A lot of popular web application frameworks come along with OR mappers: Rails with Active Record, Grails with GORM (Hibernate based), Laravel with Eloquent.

How to detect

The easiest way to detect the problem in an application is to log the database queries. Virtually all ORMs have a configuration option to enable query logging.

For Grails/GORM the logging can be enabled per data source in the application.yml config file:

dataSource:
    logSql: true
    formatSql: true

For Rails/ActiveRecord query logging is automatically enabled in the development environment. Since Grails 5.2 the Verbose Query Logs format is enabled by default, which you had to explicitly enable in earlier versions.

For Laravel/Eloquent you can enable and access the query log with these two methods/functions:

DB::connection()->enableQueryLog();
DB::getQueryLog();

Once query logging is enabled you will quickly see if the same query is executed over and over again, usually indicating the presence of the N+1 problem.

How to fix

The goal is to replace the N+1 queries with a single query. In SQL this means joining. The example above would be written as a single query:

SELECT article.*
FROM article
JOIN author
  ON article.author_id=author.id
WHERE author.name LIKE 'B%'

The query interface of ORMs usually allows you to write joins as well. Here the example in ActiveRecord:

Article.joins(:authors).where("authors.name LIKE ?", "B%")

Another option when using ORMs is to enable eager loading for associations. In GORM this can be enabled via the fetchMode static property:

class Author {
    static hasMany = [articles: Article]
    static fetchMode = [articles: 'eager']
}

REST APIs

The problem isn’t limited to SQL databases and SQL queries. For REST APIs it’s the “N+1 requests problem”, describing the situation where a client application has to call the server N+1 times to fetch one collection resource + N child resources. Here the REST-API has to be extended or modified to serve the client’s use cases with a single request. Another option is to offer a GraphQL API instead of a REST API. GraphQL is a query language for HTTP APIs that allows complex queries, so the client application can specify exactly what resources it needs with in a single request.

Tables as types in PostgreSQL

In SQL each column of a database table has a data type. These are types like NUMBER, VARCHAR(size) / TEXT, TIMESTAMP. What you perhaps don’t know yet is that in PostgreSQL you can use tables as types. What does this mean? Let’s say we have a table:

CREATE TABLE person (
  firstname TEXT,
  lastname  TEXT,
  email     TEXT
);

Now you can use this table as a type for columns in other tables:

CREATE TABLE article (
  id       SERIAL,
  content  TEXT,
  author   person,
  reviewer person
);

Instead of repeating the three columns of a person twice, e.g. author_firstname, author_lastname, author_email, and reviewer_firstname, reviewer_lastname, reviewer_email, the person table defined before acts as a type. Of course, the usual way in SQL is to give each person an ID and reference persons via these IDs from other tables. But sometimes you do not want this reference semantics. In this example you might want to fix the values of author and reviewer for articles in time and not retroactively update them automatically if a person changes their last name or email address later.

How to access the columns of these types? For INSERT the syntax is as follows:

INSERT INTO article (content, author, reviewer)
  VALUES ('...',
    ('Jane', 'Doe', 'jane.doe@example.com'),
    ('John', 'Roe', 'jroe@example.com')
  ); 

Or with explicit names of the sub-columns:

INSERT INTO article (content,
    author.firstname,
    author.lastname,
    author.email,
    reviewer.firstname,
    reviewer.lastname,
    reviewer.email)
  VALUES ('...',
    'Jane', 'Doe', 'jane.doe@example.com',
    'John', 'Roe', 'jroe@example.com'
  ); 

In a SELECT query individual values can be accessed with the following syntax:

SELECT
  content,
  (author).lastname,
  (reviewer).lastname
FROM article;

Of course, tables that uses other tables as data types for their columns can be used as data types again.

One last thing worth mentioning is that these nested definitions can be mapped nicely to JSON:

SELECT jsonb_pretty(to_jsonb(article)) FROM article;
{
  "id": 1,
  "content": "...",
  "author": {
    "email": "jane.doe@example.com",
    "firstname": "Jane",
    "lastname": "Doe"
  },
  "reviewer": {
    "email": "jroe@example.com",
    "firstname": "John",
    "lastname": "Roe"
  }
}

Geometric shapes, functions and operators in PostgreSQL

On this blog I frequently write about features of relational database systems and their SQL dialects. One feature many developers do not know about is support for geometric shapes, although a lot of RDBMs support them in one form or the other, each with its own syntax, of course. In this article I’m going to demonstrate this feature with PostgreSQL.

PostgreSQL has data types for geometric shapes like point, box, line segment, line, path, polygon, and circle. These data types are only defined for two dimensions with Euclidean (x, y) coordinates. Here are some literals for these types:

point '(3.2,4)'
box '((1,2),(6,4))'
lseg '((-4,0),(3,2))'
path '((0,0),(2,1),(5,3))'
polygon '((0,0),(1,1),(2,0),(3,1))'
circle '((5,2),1.5)'

You can create tables with columns of these types and insert shapes:

CREATE TABLE shapes (p point, c circle);

INSERT INTO shapes (p, c) VALUES
  (point '(1,0)', circle '(0,0),3'),
  (point '(10,20)', circle '(2,3),4'),
  (point '(0.5,1.5)', circle '(1,2),1');

Now you can query shapes and filter them with special operators:

SELECT * FROM shapes WHERE c @> p;

This query uses the contains operator @> in the WHERE clause. It selects all rows where the circle c contains the point p.

Here’s another operator: <-> determines the Euclidean distance between two points.

SELECT point '(0,0)' <-> point '(1,1)';
=> 2.23606797749979

The ?|| operator tests if two lines are parallel:

SELECT line '((1,2),(1,3))' ?|| line '((2,3),(2,4))';
=> true

You can translate a shape with the + operator:

SELECT box '((0,0),(1,1))' + point '(1,2)';
=> box '(2,3),(1,2)'

Or you can test with && if two shapes overlap:

SELECT box '((1,2),(4,3))' && box '(2,3),(1,2)';
=> true

This is only a small selection of geometric operators. See the full list in the official documentation. There you can also find a list of geometric functions like area , center, isclosed, npoints, etc.

SELECT area(box '((4,6),(10,12))');
=> 36

As mentioned in the beginning, other database systems support similar functionality. Check out MySQL’s spatial data types, Oracle Spatial, and MS SQL’s spatial data types.

Pagination in SQL

Pagination is the task of dividing a data set into subsequent parts of the whole data set. For example, a search engine initially only shows the first 15 results for a search query. The user can then step through the rest of the results the by clicking a “Next” button.

Ideally this feature is also supported by the underlying database system. Otherwise, the application would have to load all matching data records from the database, just to filter out the major part of of them, because the user only wanted to see page 3 of 50. A pagination request has two components: a limit and an offset. If a page contains a maximum of 15 items and page 3 is requested, then the limit would be 15 and the offset would be 30 = (page-1) × limit.

PostgreSQL, MySQL, MariaDB

The database systems PostgreSQL, MySQL and MariaDB have a straight forward syntax for pagination: LIMIT {number} OFFSET {number} . So a simple SQL query with pagination might look like this:

SELECT * FROM users ORDER BY name LIMIT 15 OFFSET 30;

Oracle DB

Oracle DB didn’t have a dedicated syntax for pagination before Oracle 12c, but it was still possible to achieve the same result with other means. With Oracle 12c a new syntax for pagination was introduced under the name “Row limiting clause”. First I’ll show the old method, then the new syntax.

The old method is based on ROWNUM . If you wanted to specify both an offset and a limit, you had to nest multiple queries:

SELECT *
FROM (SELECT *, rownum AS rnum
      FROM (SELECT *
            FROM users
            ORDER BY name)
      WHERE rownum < 45)
WHERE rnum >= 30;

The newer row limiting clause syntax is shorter and looks as follows:

SELECT * FROM users ORDER BY name
  OFFSET 30 ROWS FETCH NEXT 15 ROWS ONLY;

This syntax also allows the option to specify a percentage of rows instead of a fixed number of rows:

SELECT * FROM users ORDER BY name
  FETCH FIRST 20 PERCENT ROWS ONLY;

MS SQL Server

Microsoft’s SQL Server also supports the Oracle-like syntax with OFFSET and FETCH clauses and recommends the usage of this syntax for pagination.

The World of SQL Dialects

For software projects I work with various relational database management systems (RDBMs), mainly PostgreSQL, MySQL/MariaDB, Oracle Database and Microsoft SQL Server. All of these use SQL as a query language, but the dialects of this language vary wildly, especially when it comes to non-standardized features. One such feature I often use is the aggregation of a list to a string. It does the following.

LEGS    ANIMAL
-----------------
2       Ostrich
2       Human
4       Cat
4       Dog
4       Capybara
6       Ant
8       Spider

Given a table like the one above it groups the elements of a column that have the same value in another column together in a string, concatenated by a separator like a comma:

LEGS    ANIMALS
----------------------------
2       Human, Ostrich
4       Capybara, Cat, Dog
6       Ant
8       Spider

This simple operation has four different syntaxes in the four mentioned database systems, which I want to demonstrate.

PostgreSQL

In PostgreSQL the function is called STRING_AGG:

SELECT legs,
  STRING_AGG(animal, ', ' ORDER BY animal) AS animals
FROM fauna
GROUP BY legs
ORDER BY legs;
MySQL / MariaDB

In MySQL and its fork MariaDB the function is called GROUP_CONCAT, and it has a special syntax to specify the separator:

SELECT legs,
  GROUP_CONCAT(animal ORDER BY animal SEPARATOR ', ') AS animals
FROM fauna
GROUP BY legs
ORDER BY legs;
Oracle

Oracle calls it LISTAGG and specifies the grouping via WITHIN GROUP.

SELECT legs,
  LISTAGG(animal, ', ') WITHIN GROUP (ORDER BY animal) AS animals
FROM fauna
GROUP BY legs
ORDER BY legs;
Microsoft SQL Server

SQL Server calls it STRING_AGG like PostgreSQL, but specifies the grouping via WITHIN GROUP like Oracle:

SELECT legs,
  STRING_AGG(animal, ', ') WITHIN GROUP (ORDER BY animal) AS animals
FROM fauna
GROUP BY legs
ORDER BY legs;

Unfortunately, as developers we have to live with all these dialects of SQL. Even though there is an ISO standards committee for SQL, database creators love to build non-standard extensions into their products. The situation is worse than the browser-specific extensions and differences of JavaScript, HTML and CSS in modern web browsers. One thing that can paper over these differences are OR-Mappers like Hibernate or query languages like Hibernate’s HQL that abstract over SQL, but they come with their own set of problems.

Contiguous date ranges in Oracle SQL

In one of my last posts from a couple of weeks ago I wrote about querying gaps between non-contiguous date ranges in Oracle SQL. This week’s post is about contiguous date ranges.

While non-contiguous date ranges are best represented in a database table with a start_date and an end_date column, it is better to represent contiguous date ranges only by one date column, so that we avoid redundancy and do not have to keep the start date of a date range in sync with the end date of the previous date range. In this post I will use the start date:

CREATE TABLE date_ranges (
name VARCHAR2(100),
start_date DATE
);

The example content of the table is:

NAME	START_DATE
----	----------
A	05/02/2020
B	02/04/2020
C	16/04/2020
D	01/06/2020
E	21/06/2020
F	02/07/2020
G	05/08/2020

This representation means that the date range with the most recent start date does not have an end. The application using this data model can choose whether to interpret this as a date range with an open end or just as the end point for the previous range and not as a date range by itself.

While this is a nice non-redundant representation, it is less convenient for queries where we want to have both a start and an end date per row, for example in order to check wether a given date lies within a date range or not. Luckily, we can transform the ranges with a query:

SELECT
date_ranges.*,
LEAD(date_ranges.start_date)
OVER (ORDER BY start_date)
AS end_date
FROM date_ranges;

As in the previous post on non-contiguous date ranges the LEAD analytic function allows you to access the following row from the current row without using a self-join. Here’s the result:

NAME	START_DATE	END_DATE
----	----------	--------
A	05/02/2020	02/04/2020
B	02/04/2020	16/04/2020
C	16/04/2020	01/06/2020
D	01/06/2020	21/06/2020
E	21/06/2020	02/07/2020
F	02/07/2020	05/08/2020
G	05/08/2020	(null)

By using a WITH clause you can use this query like a view and join it with the another table, for example with the join condition that a date lies within a date range:

WITH ranges AS
(SELECT date_ranges.*, LEAD(date_ranges.start_date) OVER (ORDER BY start_date) AS end_date FROM date_ranges)
SELECT timeseries.*, ranges.name
FROM timeseries LEFT OUTER JOIN ranges ON
timeseries.measurement_date
BETWEEN ranges.start_date AND ranges.end_date;