Common SQL Performance Gotchas in Application Development

When building apps that use a SQL database, it’s easy to run into performance problems without noticing. Many of these issues come from the way queries are written and used in the code. Below are seven common SQL mistakes developers make, why they happen, and how you can avoid them.

Not Using Prepared Statements

One of the most common mistakes is building SQL queries by concatenating strings. This approach not only introduces the risk of SQL injection but also prevents the database from reusing execution plans. Prepared statements or parameterized queries let the database understand the structure of the query ahead of time, which improves performance and security. They also help avoid subtle bugs caused by incorrect string formatting or escaping.

// Vulnerable and inefficient
String userId = "42";
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users WHERE id = " + userId);
// Safe and performant
String sql = "SELECT * FROM users WHERE id = ?";
PreparedStatement ps = connection.prepareStatement(sql);
ps.setInt(1, 42);
ResultSet rs = ps.executeQuery();

The N+1 Query Problem

The N+1 problem happens when an application fetches a list of items and then runs a separate query for each item to retrieve related data. For example, fetching a list of users and then querying each user’s posts in a loop. This results in one query to fetch the users and N additional queries for their posts. The fix is to restructure the query using joins or batch-fetching strategies, so all the data can be retrieved in fewer queries.

We have written about it on our blog before: Understanding, identifying and fixing the N+1 query problem

Missing Indexes

When queries filter or join on columns that do not have indexes, the database may need to scan entire tables to find matching rows. This can be very slow, especially as data grows. Adding the right indexes can drastically improve performance. It’s important to monitor slow queries and check whether indexes exist on the columns used in WHERE clauses, JOINs, and ORDER BY clauses.

Here’s how to create an index on an “orders” table for its “customer_id” column:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Once the index is added, the query can efficiently find matching rows without scanning the full table.

Retrieving Too Much Data

Using SELECT * to fetch all columns from a table is a common habit, but it often retrieves more data than the application needs. This can increase network load and memory usage. Similarly, not using pagination when retrieving large result sets can lead to long query times and a poor user experience. Always select only the necessary columns and use LIMIT or OFFSET clauses to manage result size.

For example:

String sql = "SELECT id, name, price FROM products LIMIT ? OFFSET ?";
PreparedStatement ps = connection.prepareStatement(sql);
ps.setInt(1, 50);
ps.setInt(2, 0);
ResultSet rs = ps.executeQuery();

Chatty Database Interactions

Some applications make many small queries in a single request cycle, creating high overhead from repeated database access. Each round-trip to the database introduces latency. Here’s an inefficient example:

for (int id : productIds) {
    PreparedStatement ps = connection.prepareStatement(
        "UPDATE products SET price = price * 1.1 WHERE id = ?"
    );
    ps.setInt(1, id);
    ps.executeUpdate();
}

Instead of issuing separate queries, it’s often better to combine them or use batch operations where possible. This reduces the number of database interactions and improves overall throughput:

PreparedStatement ps = connection.prepareStatement(
    "UPDATE products SET price = price * 1.1 WHERE id = ?"
);

for (int id : productIds) {
    ps.setInt(1, id);
    ps.addBatch();
}
ps.executeBatch();

Improper Connection Pooling

Establishing a new connection to the database for every query or request is slow and resource-intensive. Connection pooling allows applications to reuse database connections, avoiding the cost of repeatedly opening and closing them. Applications that do not use pooling efficiently may suffer from connection exhaustion or high latency under load. To avoid this use a connection pooler and configure it with appropriate limits for the workload.

Unbounded Wildcard Searches

Using wildcard searches with patterns like '%term%' in a WHERE clause causes the database to scan the entire table, because indexes cannot be used effectively. These searches are expensive and scale poorly. To handle partial matches more efficiently, consider using full-text search features provided by the database, which are designed for fast text searching. Here’s an example in PosgreSQL:

SELECT * FROM articles
WHERE to_tsvector('english', title) @@ to_tsquery('database');

One of our previous blog posts dives deeper into this topic: Full-text Search with PostgreSQL.

By being mindful of these common pitfalls, you can write SQL that scales well and performs reliably under load. Good database performance isn’t just about writing correct queries – it’s about writing efficient ones.

Have you faced any of these problems before? Every project is different, and we all learn a lot from the challenges we run into. Feel free to share your experiences or tips in the comments. Your story could help someone else improve their app’s performance too.

PostgreSQL’s Foreign Data Wrappers

When people think of SQL and PostgreSQL, they usually picture databases full of rows and columns – clean, organized, and stored on a persistent volume like a disk. But what if your data isn’t in a database at all? What if it lives on the web, behind a REST API, in JSON format?

Usually, you’d write a script. You’d use Python or JavaScript, send a request to the API, parse the JSON, and maybe insert the results into your database for analysis. But PostgreSQL has a shortcut that many people don’t know about: Foreign Data Wrappers.

Just like FUSE (Filesystem in Userspace) lets you mount cloud drives or remote folders as if they were local, Foreign Data Wrappers lets PostgreSQL mount external resources like other database management systems, files, or web APIs and treat them like SQL tables.

Example

In this article we’ll use http_fdw as an example. With it, you can run a SQL query against a URL and read the result – no extra code, no data pipeline, just SQL. Here’s how you could set that up.

Let’s say you want to explore data from JSONPlaceholder, a free fake API for testing. It has a /posts endpoint that returns a list of blog posts in JSON format.

First, make sure the extension is installed (this may require building from source, depending on your system):

CREATE EXTENSION http_fdw;

Now create a foreign server that points to the API:

CREATE SERVER json_api_server
  FOREIGN DATA WRAPPER http_fdw
  OPTIONS (uri 'https://jsonplaceholder.typicode.com/posts', format 'json');

Then define a foreign table that maps to the shape of the JSON response:

CREATE FOREIGN TABLE api_posts (
  userId integer,
  id     integer,
  title  text,
  body   text
)
SERVER json_api_server
OPTIONS (rowpath '');

Since the API returns an array of posts, and each one is an object with userId, id, title, and body, this table matches that structure.

Now you can query it:

SELECT title FROM api_posts WHERE userId = 1;

Behind the scenes, PostgreSQL sends a GET request to the API, parses the JSON, and returns the result like any other table.

Things to Keep in Mind

Not everything is perfect. http_fdw is read-only, so you can’t use it to send data back to the API. It also relies on the API being available and responsive, and it doesn’t support authentication out of the box – you’ll need to handle that with custom headers if the API requires it. Complex or deeply nested JSON might also require some extra configuration.

But for many use cases, it’s an interesting option to work with external data. You don’t have to leave SQL. You don’t have to wire up a data pipeline. You just run a query.

What are GIN Indexes in PostgreSQL?

If you’ve worked with PostgreSQL and dealt with things like full-text search, arrays, or JSON data, you might have heard about GIN indexes. But what exactly are they, and why are they useful?

GIN stands for Generalized Inverted Index. Most indexes (like the default B-tree index) work best when there’s one clear value per row – like a number or a name. But sometimes, a single column can hold many values. Think of a column that stores a list of tags, words in a document, or key-value data in JSON. That’s where GIN comes in.

Let’s walk through a few examples to see how GIN indexes work and why they’re helpful.

Full-text search example

Suppose you have a table of articles:

CREATE TABLE articles (
  id    serial PRIMARY KEY,
  title text,
  body  text
);

You want to let users search the content of these articles. PostgreSQL has built-in support for full-text search, which works with a special data type called tsvector. To get started, you’d add a column to store this processed version of your article text:

ALTER TABLE articles ADD COLUMN tsv tsvector;
UPDATE articles SET tsv = to_tsvector('english', body);

Now, to speed up searches, you create a GIN index:

CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv);

With that in place, you can search for articles quickly:

SELECT * FROM articles WHERE tsv @@ to_tsquery('tonic & water');

This finds all articles that contain both “tonic” and “water”, and thanks to the GIN index, it’s fast – even if you have thousands of articles.

Array example

GIN is also great for columns that store arrays. Let’s say you have a table of photos, and each photo can have several tags:

CREATE TABLE photos (
  id   serial PRIMARY KEY,
  tags text[]
);

You want to find all photos tagged with “capybara”. You can create a GIN index on the tags column:

CREATE INDEX idx_photos_tags ON photos USING GIN(tags);
SELECT * FROM photos WHERE tags @> ARRAY['capybara'];

(The @> operator means “contains” or “is a superset of”.)

The index lets PostgreSQL find matching rows quickly, without scanning the entire table.

JSONB example

PostgreSQL’s jsonb type lets you store flexible key-value data. Imagine a table of users with extra info stored in a jsonb column:

CREATE TABLE users (
  id   serial PRIMARY KEY,
  data jsonb
);

One row might store {"age": 42, "city": "Karlsruhe"}. To find all users from New York, you can use:

SELECT * FROM users WHERE data @> '{"city": "Karlsruhe"}';

And again, with a GIN index on the data column, this query becomes much faster:

CREATE INDEX idx_users_data ON users USING GIN(data);

Things to keep in mind

GIN indexes are very powerful, but they come with some tradeoffs. They’re slower to build and can make insert or update operations a bit heavier. So they’re best when you read (search) data often, but don’t write to the table constantly.

In short, GIN indexes are your friend when you’re dealing with columns that contain multiple values – like arrays, full-text data, or JSON. They let PostgreSQL break apart those values and build a fast lookup system. If your queries feel slow and you’re working with these kinds of columns, adding a GIN index might be exactly what you need.

Dependent Subqueries and LATERAL in PostgreSQL

When working with databases, you often need to run a query that depends on results from another query. PostgreSQL offers two main ways to handle this: Dependent Subqueries and LATERAL joins.

A dependent subquery is like a mini-query inside another query. The inner query (subquery) depends on the outer query for its input. For every row in the outer query, the subquery runs separately.

Imagine you have two tables: customers, which holds information about customers (e.g., their id and name), and orders, which holds information about orders (e.g., customer_id, order_date, and amount). Now, you want to find the latest order date for each customer. You can use a dependent subquery like this:

SELECT id AS customer_id,
  name,
  (SELECT order_date
     FROM orders
    WHERE customer_id=customers.id
    ORDER BY order_date DESC
    LIMIT 1) AS latest_order_date
FROM customers;

For each customer in the customers table, the subquery looks for their orders in the orders table. The subquery sorts the orders by date (ORDER BY order_date DESC) and picks the most recent one (LIMIT 1).

This works, but it has a drawback: If you have many customers, this approach can be slow because the subquery runs once for every customer.

LATERAL join

A LATERAL join is a smarter way to solve the same problem. It allows you to write a subquery that depends on the outer query, but instead of repeating the subquery for every row, PostgreSQL handles it more efficiently.

Let’s solve the “latest order date” problem using LATERAL:

SELECT
  c.id AS customer_id,
  c.name,
  o.order_date AS latest_order_date
FROM customers c
LEFT JOIN LATERAL (
  SELECT order_date
   FROM orders
  WHERE orders.customer_id=c.id
  ORDER BY order_date DESC
  LIMIT 1
) o ON TRUE;

For each customer (c.id), the LATERAL subquery finds the latest order in the orders table. The LEFT JOIN ensures that customers with no orders are included in the results, with NULL for the latest_order_date.

It’s easier to read and often faster, especially for large datasets, because PostgreSQL optimizes it better.

Both dependent subqueries and LATERAL joins allow you to handle scenarios where one query depends on the results of another. Dependent subqueries are straightforward and good enough for simple tasks with small datasets. However, you should consider using LATERAL for large datasets where performance matters.

JSON as a table in PostgreSQL 17

Some time ago, I described in a blog post how to work with JSON data in a PostgreSQL database. Last month (September 2024), PostgreSQL 17 was released, which offers another feature for working with JSON data: the JSON_TABLE() function. Such a function already existed in other database systems, such as MySQL and Oracle.

The core idea behind JSON_TABLE() is to temporarily transform JSON data into a relational format that can be queried using standard SQL commands. In doing so, developers can apply the full power of SQL, such as filtering, sorting, aggregating, and joining, to data that originally exists in JSON format: It enables you to use SQL to query JSON data stored in a column as though it were relational data, and you can join JSON data with regular relational tables for a seamless mix of structured and semi-structured data.

Syntax

The syntax for JSON_TABLE looks a bit complex at first glance, but once you break it down, it becomes quite clear. Here’s the basic structure:

JSON_TABLE(
  json_doc, path
  COLUMNS (
    column_name type PATH 'path_to_value' [DEFAULT default_value ON ERROR],
    ...
  )
) AS alias_name

The json_doc is the JSON data you’re querying. It can be a JSON column or a JSON string. It is followed by a path expression, which describes the location in the JSON document that you want to process. Path expressions are specified as strings in single quotation marks and have a special syntax called JSONPath (loosely inspired by XPath for XML). For example, such an expression could look like this: '$.store.book[0].title'. The dollar sign represents the root of the JSON tree, the sub-properties are separated by dots, and array elements are referenced using square brackets.

This is followed by the COLUMNS keyword. It specifies the columns to be extracted from the JSON document. Each column is specified by a name of the column in the resulting table, a data type of the extracted value (e.g., VARCHAR, INT), followed by the PATH keyword and a JSON path expression that references a JSON value below the original path expression.

The DEFAULT keyword optionally provides a default value if the path does not exist or if there’s an error during extraction.

After the JSON_TABLE function call you can specify the alias name for the table with the AS keyword.

Example

It’s time for an example. Let’s assume we have a table named fruit_store with a column named json_details that holds the some JSON data about fruits:

INSERT INTO fruit_store (id, json_details) VALUES (
    1,
    '{
        "category": "Citrus",
        "fruits": [
            {
                "name": "Orange",
                "color": "Orange",
                "taste": "Sweet",
                "price_per_kg": 3.5
            },
            {
                "name": "Lemon",
                "color": "Yellow",
                "taste": "Sour",
                "price_per_kg": 4.0
            }
        ]
    }'
);

Now we can use the JSON_TABLE function to extract the details of each fruit from the JSON document. The goal is to retrieve the fruit name, color, taste, and price for each fruit in the array. Here’s the query:

SELECT *
  FROM fruit_store,
  JSON_TABLE(
    json_details, '$.fruits[*]'
    COLUMNS (
      fruit_name   VARCHAR(50)  PATH '$.name',
      fruit_color  VARCHAR(50)  PATH '$.color',
      fruit_taste  VARCHAR(50)  PATH '$.taste',
      price_per_kg DECIMAL(5,2) PATH '$.price_per_kg'
    )
  ) AS fruit_table;

Running this query will give you the following result:

idfruit_namefruit_colorfruit_tasteprice_per_kg
1OrangeOrangeSweet3.50
1LemonYellowSour4.00

SQL Database Window Functions

Window functions allow users to perform calculations across a set of rows that are somehow related to the current row. This can include calculations like running totals, moving averages, and ranking without the need to group the entire query into one aggregate result.

Despite their flexibility, window functions are sometimes underutilised, either because users are unaware of them or because they’re considered too complex for everyday tasks. Learning how to effectively use window functions can improve the efficiency and readability of SQL queries, particularly for reporting and data analysis purposes. This article will explore several use cases.

Numbering Rows

The simplest application area for window functions is the numbering of rows. The ROW_NUMBER() function assigns a unique number to each row within the partition of a result set. The numbering is sequential and starts at 1. It’s useful for creating a unique identifier for rows within a partition, even when the rows are identical in terms of data.

Consider the following database table of library checkouts:

bookcheckout_datemember_id
The Great Adventure2024-02-15102
The Great Adventure2024-01-10105
Mystery of the Seas2024-01-20103
Mystery of the Seas2024-03-01101
Journey Through Time2024-02-01104
Journey Through Time2024-02-18102

We want to assign a unique row number to each checkout instance for every book, ordered by the checkout date to analyze the circulation trend:

SELECT
 book,
checkout_date,
member_id,
ROW_NUMBER() OVER (PARTITION BY book ORDER BY checkout_date) AS checkout_order
FROM library_checkouts;

The result:

bookcheckout_datemember_idcheckout_order
The Great Adventure2024-01-101051
The Great Adventure2024-02-151022
Mystery of the Seas2024-01-201031
Mystery of the Seas2024-03-011012
Journey Through Time2024-02-011041
Journey Through Time2024-02-181022

Ranking

In the context of SQL and specifically regarding window functions, “ranking” refers to the process of assigning a unique position or rank to each row within a partition of a result set based on a specified ordering.

The RANK() function provides a ranking for each row within a partition, with gaps in the ranking sequence when there are ties. It’s useful for ranking items that have the same value.

Consider the following database table of scores in a game tournament:

playergamescore
AliceSpace Invaders4200
BobSpace Invaders5700
CharlieSpace Invaders5700
DanaDonkey Kong6000
EveDonkey Kong4800
FrankDonkey Kong6000
AliceAsteroids8500
BobAsteroids9300
CharlieAsteroids7600

We want to rank the players within each game based on their score, with gaps in rank for ties:

SELECT
 player,
 game,
score,
RANK() OVER (PARTITION BY game ORDER BY score DESC) AS rank
FROM scores;

The result looks like this:

playergamescorerank
BobSpace Invaders57001
CharlieSpace Invaders57001
AliceSpace Invaders42003
DanaDonkey Kong60001
FrankDonkey Kong60001
EveDonkey Kong48003
BobAsteroids93001
AliceAsteroids85002
CharlieAsteroids76003

If you don’t want to have gaps in the ranking sequence when there are ties, you can substitute DENSE_RANK() for RANK().

Cumulative Sum

The SUM() function can be used as a window function to calculate the cumulative sum of a column over a partition of rows.

Example: We are tracking our garden’s vegetable harvest in a database table, and we want to calculate the cumulative yield for each type of vegetable over the harvesting season.

vegetableharvest_dateyield_kg
Carrots2024-06-1810
Carrots2024-07-1015
Tomatos2024-06-1520
Tomatos2024-07-0130
Tomatos2024-07-2025
Zucchini2024-06-2015
Zucchini2024-07-0520

We calculate the running total (cumulative yield) for each vegetable type as the season progresses, using the SUM() function:

SELECT
 vegetable,
harvest_date,
yield_kg,
SUM(yield_kg) OVER (PARTITION BY vegetable ORDER BY harvest_date ASC) AS cumulative_yield
FROM garden_harvest;

Now we can see which vegetables are most productive and how yield accumulates throughout the season:

vegetableharvest_dateyield_kgcumulative_yield
Carrots2024-06-181010
Carrots2024-07-101525
Tomatos2024-06-152020
Tomatos2024-07-013050
Tomatos2024-07-202575
Zucchini2024-06-201515
Zucchini2024-07-052035

PostgreSQL’s auto-explain feature and execution plans

PostgreSQL’s auto-explain is a built-in feature that automatically generates and logs execution plans for SQL statements. It’s a useful tool for developers to understand how the query planner is executing SQL queries.

You enable and configure auto-explain by setting parameters in the PostgreSQL configuration file (postgresql.conf). Set auto_explain.log_analyze to on to log execution plans along with statistics, and set auto_explain.log_min_duration to specify the minimum execution time in milliseconds that a query must take to be logged. For example, if you want to log queries taking longer than 100 milliseconds, set it to 100. Set auto_explain.log_buffers to on if you want to include information about memory usage, and auto_explain.log_timing to log timing information.

Here’s an example of how to configure these parameters in postgresql.conf:

auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_timing = on
auto_explain.log_min_duration = 100

Reading the execution plan

Suppose you have a “recipe database” that stores recipes, ingredients, and chefs. You want to retrieve a list of recipes along with the names of the chefs who created them and the ingredients they contain. Here’s a query that accomplishes this:

SELECT recipes.recipe_name, chefs.chef_name, ingredients.ingredient_name
  FROM recipes
  JOIN chefs ON recipes.chef_id=chefs.chef_id
  JOIN recipe_ingredients ON recipes.recipe_id=recipe_ingredients.recipe_id
  JOIN ingredients ON recipe_ingredients.ingredient_id=ingredients.ingredient_id
WHERE recipes.cuisine='Italian';

This query fetches Italian recipes, their respective chefs, and the ingredients they use.

When you run this query with auto-explain enabled, PostgreSQL will log the execution plan. The query plan might look something like this:

Hash Join  (cost=100.25..350.75 rows=50 width=96)
  Hash Cond: (recipe_ingredients.recipe_id = recipes.recipe_id)
  ->  Hash Join  (cost=50.12..200.37 rows=50 width=60)
        Hash Cond: (recipes.chef_id = chefs.chef_id)
        ->  Seq Scan on recipes  (cost=0.00..100.00 rows=50 width=24)
              Filter: (cuisine = 'Italian'::text)
        ->  Hash  (cost=30.00..30.00 rows=1000 width=36)
              ->  Seq Scan on chefs  (cost=0.00..30.00 rows=1000 width=36)
  ->  Hash  (cost=30.00..30.00 rows=1000 width=36)
        ->  Seq Scan on recipe_ingredients  (cost=0.00..30.00 rows=1000 width=36)
              Filter: (recipe_id IS NOT NULL)

In this query plan Hash Join indicates a join operation using a hash-based algorithm. Seq Scan signifies a sequential scan of the table, which might imply a full table scan. Hash Cond shows the join condition for the respective hash join.

cost represents the estimated execution cost for each operation, and rows indicates the estimated number of rows returned by each operation.

The estimated cost in PostgreSQL query execution plans is typically represented in an abstract unit known as “cost units.” These cost units are used for relative cost estimation and are not expressed in any specific real-world measurement like time or money. They are designed to provide a relative measure of the cost of different query plan operations so that the query planner can make informed decisions about which plan to choose.

Reading this plan, PostgreSQL starts by filtering Italian recipes (a Seq Scan with a filter). It then joins the recipes with chefs using a hash join, and the result is further joined with ingredients using another hash join. The cost values provide relative estimates of resource usage, allowing you to identify potentially expensive parts of the query, and you can consider improving the performance of the SQL statement with optimisations like indexing.

Table inheritance in PostgreSQL

In a previous post I have written about tables as data types in PostgreSQL. In addition to that, PostgreSQL has a feature similar to the inheritance mechanism of data types in object-oriented programming: table inheritance.

Table inheritance allows you to create a hierarchical relationship between tables. With this feature you define a parent table, and child tables inherit columns and some constraints (like CHECK constraints and NOT NULL constraints) from it.

How it works

To begin, we create the parent table products using the following SQL code:

CREATE TABLE products (
  id     SERIAL         PRIMARY KEY,
  name   TEXT           NOT NULL,
  price  DECIMAL(10, 2) NOT NULL
);

Next, let’s create child tables that inherit from the products table. We will create two product categories: electronics and clothing. Each child table will have its own specific columns in addition to the inherited columns. To set up the inheritance relation we use the INHERITS keyword:

CREATE TABLE electronics (
  warranty_period   INT,
  power_consumption INT
) INHERITS (products);

CREATE TABLE clothing (
  size  TEXT,
  color TEXT
) INHERITS (products);

Now we insert some rows into the child tables:

INSERT INTO electronics
  (name, price, warranty_period, power_consumption)
  VALUES ('Laptop', 1439.0, 2, 536);

INSERT INTO clothing
  (name, price, size, color)
  VALUES ('T-Shirt', 5.99, 'L', 'red');

When we query each child table individually, we only get the rows we inserted into it, as expected. However, when we query the parent table, we get all the products from the child the tables:

SELECT * FROM products;

id	name	 price
--------------------
1	Laptop	 1439.00
2	T-Shirt	 5.99

Note that you can also insert rows into the parent table:

INSERT INTO products
  (name, price)
  VALUES ('Milk', 1.15);

This row will not show up in the child tables, only when querying the parent table. If you are asking yourself now if there is an equivalent to an “abstract class”, there isn’t one. Just don’t insert into the parent table if it doesn’t make sense in your domain.

If you want to query rows only from the parent table you can use the ONLY keyword:

SELECT name, price FROM ONLY products;

This keyword works also with other commands like UPDATE and DELETE.

On a final note, you can also remove the inheritance relationship from a child table with the NO INHERIT clause:

ALTER TABLE clothing NO INHERIT products;

PostgreSQL’s hstore module for semi-structured data

PostgreSQL has an extension module called hstore that allows you to store semi-structured data in a key/value format. Values ​​of an hstore object are stored like in a dictionary. You can also reference its values in SQL queries.

To use the extension, it must first be loaded into the current database:

CREATE EXTENSION hstore;

Now you can use the data type hstore. Here, we create a table with some regular columns and one column of type hstore:

CREATE TABLE animals (
    id     serial PRIMARY KEY,
    name   text,
    props  hstore
);

Literals of type hstore are written in single quotes, containing a set of key => value pairs separated by commas:

INSERT INTO
    animals (name, props)
VALUES
    ('Octopus', 'arms => 8, habitat => sea, color => varying'),
    ('Cat',     'legs => 4, fur => soft'),
    ('Bee',     'legs => 6, wings => 4, likes => pollen');

The order of the pairs is irrelevant. Keys within a hstore are unique. If you declare the same key more than once only one instance will be kept and the others will be discarded. You can use double quotes to include spaces or special characters:

'"fun-fact" => "Cats sleep for around 13 to 16 hours a day (70% of their life)"'

If the type of the literal can’t be inferred you can append ::hstore as a type indicator:

'legs => 4, fur => soft'::hstore

Both keys and values are stored as strings, so these two are equivalent:

'legs => 4, fur => soft'
'"legs" => "4", "fur" => "soft"'

Another limitation of hstore values is that they cannot be nested, which means they are less powerful than JSON objects.

You can use the -> operator to dereference a key, for example in a SELECT:

SELECT
    name, props->'legs' AS number_of_legs
FROM
    animals;

It returns NULL if the key is not present. Of course, you can also use it in a WHERE clause:

SELECT * FROM animals WHERE props->'fur' = 'soft';

There are many other operators and functions that can be used with hstore objects. Here is a small selection (please refer to the documentation for a complete list):

  • The || operator concatenates (merges) two hstores: a || b
  • The ? operator checks the existence of a key and returns a boolean value: props ? 'fur'
  • The - operator deletes a key from a hstore: props - 'fur'
  • The akeys function returns an array of a hstore’s keys: akeys(hstore)

You can also convert a hstore object to JSON: hstore_to_json(hstore). If you want to learn more about JSON in PostgreSQL you can continue reading this blog post: Working with JSON data in PostgreSQL

Copying and moving rows between tables in PostgreSQL

In this article, I’ll show some helpful tips for copying and moving data between database tables in PostgreSQL.

Copying

The simplest operation is copying rows from one table to another table. The associated SQL query is known to most. You can simply combine an INSERT with a SELECT:

INSERT INTO short_books
  SELECT *
    FROM books
    WHERE pages < 50;

Of course, if you want to copy a complete table, you must first create the target table with the same columns. Instead of just repeating the original CREATE TABLE with all the column definitions with a different name, there is a shortcut in the form of CREATE TABLE … LIKE.

CREATE TABLE books_copy (LIKE books);

If you want the copy to inherit all constraints, indices and defaults of the source table you can add INCLUDING ALL:

CREATE TABLE books_copy (LIKE books INCLUDING ALL);

Instead of executing a CREATE TABLE first and then an INSERT, you can also directly combine CREATE TABLE with a SELECT:

CREATE TABLE books_copy AS
  SELECT * FROM books;

Moving

The direct method of moving specific rows from one table to another table is a bit less known. You can of course first copy the rows into the target table and then delete the rows from the source table. However, this is also possible with just one statement, in one go. To do this, you need to know the RETURNING clause. It can be appended to a DELETE or UPDATE statement and causes the affected rows to be returned as the result set after the respective action:

DELETE FROM books
  WHERE pages < 50
  RETURNING
    title, author, pages;

This can be used in combination with the WITH … AS clause to move rows between tables with just one SQL statement:

WITH selection AS (
  DELETE FROM books
  WHERE pages < 50
  RETURNING *
)
INSERT INTO short_books
  SELECT * FROM selection;

The function of WITH can be thought of as defining a named temporary view that can only be used in the current statement.