Temporary Tables in PostgreSQL and Oracle

Temporary tables are useful when you need to store data only for a short time while running SQL statements. Instead of writing one very large query, you can break the work into steps and store intermediate results in a temporary table.

PostgreSQL and Oracle both support temporary tables, but they implement them differently. This post explains how temporary tables work in both systems and highlights the main differences between them.

What Is a Temporary Table?

A temporary table stores data that is only needed for a limited time, typically during a session or a transaction. Typical situations include saving intermediate query results, preparing data before inserting it into other tables, simplifying complex SQL statements, or staging data during imports or batch jobs.

Temporary tables behave like normal tables in many ways. You can insert, update, and query them using standard SQL. The main difference is that their lifetime is limited and they are automatically cleaned up later.

Temporary Tables in PostgreSQL

In PostgreSQL, the table itself is temporary. The table is created during a session and automatically removed when the session ends.

This means temporary tables are usually created inside scripts, functions, or interactive sessions whenever they are needed. There is no need to define them permanently in the database schema.

Here’s how to create a temporary table:

CREATE TEMP TABLE temp_orders (
    id     INT,
    amount NUMERIC
);

You can use the table like any normal table.

INSERT INTO temp_orders VALUES (1, 100);

SELECT * FROM temp_orders;

When the session ends, the table is automatically dropped. Another useful property is that different sessions can create temporary tables with the same name without interfering with each other.

Transaction Options

PostgreSQL also supports an ON COMMIT clause that controls what happens when a transaction commits.

Example:

CREATE TEMP TABLE temp_orders (
    id INT
) ON COMMIT DELETE ROWS;

Possible options are ON COMMIT PRESERVE ROWS (the default), ON COMMIT DELETE ROWS, and ON COMMIT DROP.

Example:

CREATE TEMP TABLE temp_orders (
    id INT
) ON COMMIT DROP;

After the commit, the table itself is removed.

Temporary Tables in Oracle

Oracle implements temporary tables differently. Oracle provides Global Temporary Tables. The key idea is that the table definition is permanent, but the data stored in the table is temporary. Usually the table is created once as part of the database schema and reused by applications.

Example:

CREATE GLOBAL TEMPORARY TABLE temp_orders (
  id     NUMBER,
  amount NUMBER
)
ON COMMIT DELETE ROWS;

Like PostgreSQL, Oracle also supports ON COMMIT clauses. They determine when the rows disappear. However, Oracle does not support ON COMMIT DROP.

Example: Working with Intermediate Results

Temporary tables are often used to store intermediate query results. For example, suppose you want to find customers with the highest total spending.

First step:

CREATE TEMP TABLE customer_totals AS
  SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id;

Second step:

SELECT *
  FROM customer_totals
  ORDER BY total_spent DESC
  LIMIT 10;

Splitting a task into multiple steps like this can make queries easier to read and maintain.

Why PostgreSQL Temporary Tables Are Often More Practical

The PostgreSQL approach is often more flexible in everyday work. Because the table itself is temporary, it can be created exactly when it is needed and disappears automatically afterwards. This keeps the database schema cleaner because temporary helper tables do not remain in the system permanently.

This model also works well for scripts, reporting tasks, and data processing jobs. A script can create the temporary tables it needs, perform several processing steps, and then end the session. When the session ends, the database removes the tables automatically without any extra cleanup.

In contrast, Oracle requires the temporary table structure to exist permanently in the schema. Even though the stored data is temporary, the table itself remains part of the database design. Over time this can lead to many helper tables that exist only to support certain procedures or batch jobs.

Another advantage of the PostgreSQL approach is flexibility. Developers can define temporary tables with different structures whenever they are needed, without changing the permanent schema. This can make development, testing, and data exploration much easier.

For these reasons, I find the PostgreSQL model more natural and convenient, especially for ad-hoc queries, data analysis, and multi-step data processing tasks.

Schemas, naming and search path in PostgreSQL

Modern (SQL) databases provide a multi-level hierarchy of separated contexts.

A database management system (DBMS) can manage multiple databases.

A database can contain multiple schemas.

A schema usually contains multiple tables, views, sequences and other objects.

Most of the time we developers only care about our database and the objects it contains – ignoring schemas and the DBMS.

In PostgreSQL instances this means we are using the default schema – usually called public – without mentioning it anywhere. Neither in the connection string or in any queries.

Ok, and why does all of the above even matter?

Special situations

Sometimes customers or operators require us to use special schemas in certain databases. When we do not have full control over our database deployment and usage we have to adapt to the situation we encounter.

Camel-casing

One thing I really advise against is using upper case names for tables, columns and so on. SQL itself is not case sensitive but many DBMSes differentiate case when it comes to naming. So they require you to use double-quotes to reference mixed-case objects like schemas and tables, e.g. "MyImportantTable". Imho it is far better to use only lower case letters and use snake_case for all names.

Multiple schemas in one database

I do not endorse using multiple names schemas in one database. Most of the time you do not need this level of separation and can simply use multiple databases in the same DBMS with their default schemas. That way you do not have to specify the schema in your queries.

What if you are required to use a non-default schema? At least for PostgreSQL you can modify your login role to change the search path. So you can leave all your queries and maybe other deployments untouched and without some schema name scattered all around your project:

-- instead of specifying schema name like
select * from "MyTargetSchema".my_table;

-- you can alter the search path of your role
alter role my_login set search_path = public,"MyTargetSchema";

-- to only write
select * from my_table;

Wrapping it up

Today’s software systems are quite complex and the problems we are trying to solve with them have their own, ever growing complexity. So when using DBMSes follow common advice like the above to reduce complexity in your solution even if it seems to be mandated.

Sometimes there are easy to follow conventions or configuration options that can reduce your burden as a developer even if you do not have complete control over the relevant parts of the deployment environment.

Common SQL Performance Gotchas in Application Development

When building apps that use a SQL database, it’s easy to run into performance problems without noticing. Many of these issues come from the way queries are written and used in the code. Below are seven common SQL mistakes developers make, why they happen, and how you can avoid them.

Not Using Prepared Statements

One of the most common mistakes is building SQL queries by concatenating strings. This approach not only introduces the risk of SQL injection but also prevents the database from reusing execution plans. Prepared statements or parameterized queries let the database understand the structure of the query ahead of time, which improves performance and security. They also help avoid subtle bugs caused by incorrect string formatting or escaping.

// Vulnerable and inefficient
String userId = "42";
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users WHERE id = " + userId);
// Safe and performant
String sql = "SELECT * FROM users WHERE id = ?";
PreparedStatement ps = connection.prepareStatement(sql);
ps.setInt(1, 42);
ResultSet rs = ps.executeQuery();

The N+1 Query Problem

The N+1 problem happens when an application fetches a list of items and then runs a separate query for each item to retrieve related data. For example, fetching a list of users and then querying each user’s posts in a loop. This results in one query to fetch the users and N additional queries for their posts. The fix is to restructure the query using joins or batch-fetching strategies, so all the data can be retrieved in fewer queries.

We have written about it on our blog before: Understanding, identifying and fixing the N+1 query problem

Missing Indexes

When queries filter or join on columns that do not have indexes, the database may need to scan entire tables to find matching rows. This can be very slow, especially as data grows. Adding the right indexes can drastically improve performance. It’s important to monitor slow queries and check whether indexes exist on the columns used in WHERE clauses, JOINs, and ORDER BY clauses.

Here’s how to create an index on an “orders” table for its “customer_id” column:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Once the index is added, the query can efficiently find matching rows without scanning the full table.

Retrieving Too Much Data

Using SELECT * to fetch all columns from a table is a common habit, but it often retrieves more data than the application needs. This can increase network load and memory usage. Similarly, not using pagination when retrieving large result sets can lead to long query times and a poor user experience. Always select only the necessary columns and use LIMIT or OFFSET clauses to manage result size.

For example:

String sql = "SELECT id, name, price FROM products LIMIT ? OFFSET ?";
PreparedStatement ps = connection.prepareStatement(sql);
ps.setInt(1, 50);
ps.setInt(2, 0);
ResultSet rs = ps.executeQuery();

Chatty Database Interactions

Some applications make many small queries in a single request cycle, creating high overhead from repeated database access. Each round-trip to the database introduces latency. Here’s an inefficient example:

for (int id : productIds) {
    PreparedStatement ps = connection.prepareStatement(
        "UPDATE products SET price = price * 1.1 WHERE id = ?"
    );
    ps.setInt(1, id);
    ps.executeUpdate();
}

Instead of issuing separate queries, it’s often better to combine them or use batch operations where possible. This reduces the number of database interactions and improves overall throughput:

PreparedStatement ps = connection.prepareStatement(
    "UPDATE products SET price = price * 1.1 WHERE id = ?"
);

for (int id : productIds) {
    ps.setInt(1, id);
    ps.addBatch();
}
ps.executeBatch();

Improper Connection Pooling

Establishing a new connection to the database for every query or request is slow and resource-intensive. Connection pooling allows applications to reuse database connections, avoiding the cost of repeatedly opening and closing them. Applications that do not use pooling efficiently may suffer from connection exhaustion or high latency under load. To avoid this use a connection pooler and configure it with appropriate limits for the workload.

Unbounded Wildcard Searches

Using wildcard searches with patterns like '%term%' in a WHERE clause causes the database to scan the entire table, because indexes cannot be used effectively. These searches are expensive and scale poorly. To handle partial matches more efficiently, consider using full-text search features provided by the database, which are designed for fast text searching. Here’s an example in PosgreSQL:

SELECT * FROM articles
WHERE to_tsvector('english', title) @@ to_tsquery('database');

One of our previous blog posts dives deeper into this topic: Full-text Search with PostgreSQL.

By being mindful of these common pitfalls, you can write SQL that scales well and performs reliably under load. Good database performance isn’t just about writing correct queries – it’s about writing efficient ones.

Have you faced any of these problems before? Every project is different, and we all learn a lot from the challenges we run into. Feel free to share your experiences or tips in the comments. Your story could help someone else improve their app’s performance too.

Using GENERATED AS IDENTITY Instead of SERIAL in PostgreSQL

In PostgreSQL, the SERIAL keyword is commonly used to create auto-incrementing primary keys. While it remains supported and functional, newer versions of PostgreSQL (version 10 and later) offer a more standardized and flexible alternative: the GENERATED … AS IDENTITY syntax.

Limitations of SERIAL

When you define a column as SERIAL, PostgreSQL automatically creates and links a sequence to that column behind the scenes. But this linkage is not explicitly part of the table definition. This can complicate schema management and make the behavior of the column less transparent.

The SERIAL keyword is also not part of the official SQL standard, which may be a concern in environments where cross-database compatibility is important. Additionally, the column remains writable, meaning it’s possible to insert values manually, potentially leading to inconsistencies or conflicts.

Identity Columns

The GENERATED … AS IDENTITY syntax addresses these concerns by making the auto-increment behavior explicit and standards-compliant. An identity column is defined as follows:

CREATE TABLE users (
  id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  username TEXT NOT NULL
);

This syntax makes it clear that the column is managed by the system. PostgreSQL offers two modes for identity columns:

GENERATED ALWAYS: PostgreSQL always generates a value. Manual insertion requires an override.

GENERATED BY DEFAULT: The application can supply a value, or PostgreSQL will use the next sequence value automatically.

To insert a value manually into an ALWAYS identity column, you must use the OVERRIDING SYSTEM VALUE clause:

INSERT INTO users (id, username)
  VALUES (999, 'admin') OVERRIDING SYSTEM VALUE;

Managing Sequences

Since identity columns integrate the sequence into the column definition, managing them is more straightforward. For example, to reset the sequence:

ALTER TABLE users ALTER COLUMN id RESTART WITH 1000;

The sequence is tied to the column, making it easier to inspect, back up, and restore using tools like pg_dump. This helps avoid issues that can arise with the implicit sequences used by SERIAL.

Conclusion

The GENERATED AS IDENTITY syntax offers clearer semantics, better standards compliance, and more predictable behavior than SERIAL. For new database designs, it is generally the preferred choice. While SERIAL continues to be supported, identity columns provide more transparency and control, especially in environments where portability and schema clarity are important.

PostgreSQL’s Foreign Data Wrappers

When people think of SQL and PostgreSQL, they usually picture databases full of rows and columns – clean, organized, and stored on a persistent volume like a disk. But what if your data isn’t in a database at all? What if it lives on the web, behind a REST API, in JSON format?

Usually, you’d write a script. You’d use Python or JavaScript, send a request to the API, parse the JSON, and maybe insert the results into your database for analysis. But PostgreSQL has a shortcut that many people don’t know about: Foreign Data Wrappers.

Just like FUSE (Filesystem in Userspace) lets you mount cloud drives or remote folders as if they were local, Foreign Data Wrappers lets PostgreSQL mount external resources like other database management systems, files, or web APIs and treat them like SQL tables.

Example

In this article we’ll use http_fdw as an example. With it, you can run a SQL query against a URL and read the result – no extra code, no data pipeline, just SQL. Here’s how you could set that up.

Let’s say you want to explore data from JSONPlaceholder, a free fake API for testing. It has a /posts endpoint that returns a list of blog posts in JSON format.

First, make sure the extension is installed (this may require building from source, depending on your system):

CREATE EXTENSION http_fdw;

Now create a foreign server that points to the API:

CREATE SERVER json_api_server
  FOREIGN DATA WRAPPER http_fdw
  OPTIONS (uri 'https://jsonplaceholder.typicode.com/posts', format 'json');

Then define a foreign table that maps to the shape of the JSON response:

CREATE FOREIGN TABLE api_posts (
  userId integer,
  id     integer,
  title  text,
  body   text
)
SERVER json_api_server
OPTIONS (rowpath '');

Since the API returns an array of posts, and each one is an object with userId, id, title, and body, this table matches that structure.

Now you can query it:

SELECT title FROM api_posts WHERE userId = 1;

Behind the scenes, PostgreSQL sends a GET request to the API, parses the JSON, and returns the result like any other table.

Things to Keep in Mind

Not everything is perfect. http_fdw is read-only, so you can’t use it to send data back to the API. It also relies on the API being available and responsive, and it doesn’t support authentication out of the box – you’ll need to handle that with custom headers if the API requires it. Complex or deeply nested JSON might also require some extra configuration.

But for many use cases, it’s an interesting option to work with external data. You don’t have to leave SQL. You don’t have to wire up a data pipeline. You just run a query.

What are GIN Indexes in PostgreSQL?

If you’ve worked with PostgreSQL and dealt with things like full-text search, arrays, or JSON data, you might have heard about GIN indexes. But what exactly are they, and why are they useful?

GIN stands for Generalized Inverted Index. Most indexes (like the default B-tree index) work best when there’s one clear value per row – like a number or a name. But sometimes, a single column can hold many values. Think of a column that stores a list of tags, words in a document, or key-value data in JSON. That’s where GIN comes in.

Let’s walk through a few examples to see how GIN indexes work and why they’re helpful.

Full-text search example

Suppose you have a table of articles:

CREATE TABLE articles (
  id    serial PRIMARY KEY,
  title text,
  body  text
);

You want to let users search the content of these articles. PostgreSQL has built-in support for full-text search, which works with a special data type called tsvector. To get started, you’d add a column to store this processed version of your article text:

ALTER TABLE articles ADD COLUMN tsv tsvector;
UPDATE articles SET tsv = to_tsvector('english', body);

Now, to speed up searches, you create a GIN index:

CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv);

With that in place, you can search for articles quickly:

SELECT * FROM articles WHERE tsv @@ to_tsquery('tonic & water');

This finds all articles that contain both “tonic” and “water”, and thanks to the GIN index, it’s fast – even if you have thousands of articles.

Array example

GIN is also great for columns that store arrays. Let’s say you have a table of photos, and each photo can have several tags:

CREATE TABLE photos (
  id   serial PRIMARY KEY,
  tags text[]
);

You want to find all photos tagged with “capybara”. You can create a GIN index on the tags column:

CREATE INDEX idx_photos_tags ON photos USING GIN(tags);
SELECT * FROM photos WHERE tags @> ARRAY['capybara'];

(The @> operator means “contains” or “is a superset of”.)

The index lets PostgreSQL find matching rows quickly, without scanning the entire table.

JSONB example

PostgreSQL’s jsonb type lets you store flexible key-value data. Imagine a table of users with extra info stored in a jsonb column:

CREATE TABLE users (
  id   serial PRIMARY KEY,
  data jsonb
);

One row might store {"age": 42, "city": "Karlsruhe"}. To find all users from New York, you can use:

SELECT * FROM users WHERE data @> '{"city": "Karlsruhe"}';

And again, with a GIN index on the data column, this query becomes much faster:

CREATE INDEX idx_users_data ON users USING GIN(data);

Things to keep in mind

GIN indexes are very powerful, but they come with some tradeoffs. They’re slower to build and can make insert or update operations a bit heavier. So they’re best when you read (search) data often, but don’t write to the table constantly.

In short, GIN indexes are your friend when you’re dealing with columns that contain multiple values – like arrays, full-text data, or JSON. They let PostgreSQL break apart those values and build a fast lookup system. If your queries feel slow and you’re working with these kinds of columns, adding a GIN index might be exactly what you need.

Inline and Implicit Foreign Key Constraints in SQL

Foreign key constraints are a key part of database design, ensuring that relationships between tables are consistent and reliable. They create a relationship between two tables, ensuring that data matches across them. For example, a column in an “Orders” table (like CustomerID) might reference a column in a “Customers” table. This guarantees that every order belongs to a valid customer.

In earlier versions of SQL systems, defining foreign key constraints often required separate ALTER TABLE statements after the table was created:

CREATE TABLE Orders (
  OrderID    INT PRIMARY KEY,
  CustomerID INT NOT NULL,
  OrderDate  DATE
);

ALTER TABLE Orders
ADD CONSTRAINT FK_Customer FOREIGN KEY (CustomerID)
REFERENCES Customers(CustomerID);

This two-step process was prone to errors and required careful management to ensure all constraints were applied correctly.

Inline Foreign Key Constraints

Most of the popular SQL database systems – PostgreSQL, Oracle, SQL Server, and MySQL since version 9.0, released in July 2024 – now support inline foreign key constraints. This means you can define the relationship directly in the column definition, making table creation easier to read:

CREATE TABLE Orders (
  OrderID    INT PRIMARY KEY,
  CustomerID INT NOT NULL REFERENCES Customers(CustomerID),
  OrderDate  DATE
);

Fortunately, this syntax is the same across these systems. However, MySQL 9 additionally supports implicit foreign key constraints:

CREATE TABLE Orders (
  OrderID    INT PRIMARY KEY,
  CustomerID INT NOT NULL REFERENCES Customers,
  OrderDate  DATE
);

By leaving out the (CustomerID) in the REFERENCES clause it will assume that you want to reference the primary key of the parent table. This syntax is unique to MySQL, and you should avoid it if you need to write SQL DDL statements that works across multiple database systems.

Dependent Subqueries and LATERAL in PostgreSQL

When working with databases, you often need to run a query that depends on results from another query. PostgreSQL offers two main ways to handle this: Dependent Subqueries and LATERAL joins.

A dependent subquery is like a mini-query inside another query. The inner query (subquery) depends on the outer query for its input. For every row in the outer query, the subquery runs separately.

Imagine you have two tables: customers, which holds information about customers (e.g., their id and name), and orders, which holds information about orders (e.g., customer_id, order_date, and amount). Now, you want to find the latest order date for each customer. You can use a dependent subquery like this:

SELECT id AS customer_id,
  name,
  (SELECT order_date
     FROM orders
    WHERE customer_id=customers.id
    ORDER BY order_date DESC
    LIMIT 1) AS latest_order_date
FROM customers;

For each customer in the customers table, the subquery looks for their orders in the orders table. The subquery sorts the orders by date (ORDER BY order_date DESC) and picks the most recent one (LIMIT 1).

This works, but it has a drawback: If you have many customers, this approach can be slow because the subquery runs once for every customer.

LATERAL join

A LATERAL join is a smarter way to solve the same problem. It allows you to write a subquery that depends on the outer query, but instead of repeating the subquery for every row, PostgreSQL handles it more efficiently.

Let’s solve the “latest order date” problem using LATERAL:

SELECT
  c.id AS customer_id,
  c.name,
  o.order_date AS latest_order_date
FROM customers c
LEFT JOIN LATERAL (
  SELECT order_date
   FROM orders
  WHERE orders.customer_id=c.id
  ORDER BY order_date DESC
  LIMIT 1
) o ON TRUE;

For each customer (c.id), the LATERAL subquery finds the latest order in the orders table. The LEFT JOIN ensures that customers with no orders are included in the results, with NULL for the latest_order_date.

It’s easier to read and often faster, especially for large datasets, because PostgreSQL optimizes it better.

Both dependent subqueries and LATERAL joins allow you to handle scenarios where one query depends on the results of another. Dependent subqueries are straightforward and good enough for simple tasks with small datasets. However, you should consider using LATERAL for large datasets where performance matters.

Working with JSON-DOM mapping in EntityFramework and PostgreSQL

A while ago, one of my colleagues covered JSON usage in PostgreSQL on the database level in two interesting blog posts (“Working with JSON data in PostgreSQL” and “JSON as a table in PostgreSQL 17”).

Today, I want to show the usage of JSON in EntityFramework with PostgreSQL as the database. We have an event sourcing application similar to the one in my colleagues first blog post written in C#/AspNetCore using EntityFramework Core (EF Core). Fortunately, EF Core and the PostgreSQL database driver have relatively easy to use JSON support.

You have essentially three options when working with JSON data and EF Core:

  1. Simple string
  2. EF owned entities
  3. System.Text.Json DOM types

Our event sourcing use case requires query support on the JSON data and the data has no stable and fixed schema, so the first two options are not really appealing. For more information on them, see the npgsql documentation.

Let us have a deeper look at the third option which suits our event sourcing use-case best.

Setup

The setup is ultra-simple. Just declare the relevant properties in your entities as JsonDocument and make them disposable:

public class Event : IDisposable
{
    public long Id { get; set; }

    public DateTime Date { get; set; }
    public string Type { get; set; }
    public JsonDocument Data { get; set; }
    public string Username { get; set; }
    public void Dispose() => Json?.Dispose();
}

Using dotnet ef migrate EventJsonSupport should generate changes for the database migrations and the database context. Now we are good to start querying and deserializing our JSON data.

Saving our events to the database does not require additional changes!

Writing queries using JSON properties

With this setup we can use JSON properties in our LINQ database queries like this:

var eventsForId = db.Events.Where(ev =>
  ev.Data.RootElement.GetProperty("payload").GetProperty("id").GetInt64() == id
)
.ToList();

Deserializing the JSON data

Now, that our entities contain JsonDocument (or JsonElement) properties, we can of course use the System.Text.Json API to create our own domain objects from the JSON data as we need it:

var eventData = event.Data.RootElement.GetProperty("payload");
return new HistoryEntry
{
    Timestamp = eventData.Date,
    Action = new Action
    {
        Id = eventData.GetProperty("id").GetInt64(),
        Action = eventData.GetProperty("action").GetString(),
    },
    Username = eventData.Username,
};

We could for example deserialize different domain object depending on the event type or deal with evolution of our JSON data over time to accomodate new features or refactorings on the data side.

Conclusion

Working with JSON data inside a classical application using an ORM and a relational database has become suprisingly easy and efficient. The times of fragile full-text queries using LIKE or similar stuff to find your data are over!

JSON as a table in PostgreSQL 17

Some time ago, I described in a blog post how to work with JSON data in a PostgreSQL database. Last month (September 2024), PostgreSQL 17 was released, which offers another feature for working with JSON data: the JSON_TABLE() function. Such a function already existed in other database systems, such as MySQL and Oracle.

The core idea behind JSON_TABLE() is to temporarily transform JSON data into a relational format that can be queried using standard SQL commands. In doing so, developers can apply the full power of SQL, such as filtering, sorting, aggregating, and joining, to data that originally exists in JSON format: It enables you to use SQL to query JSON data stored in a column as though it were relational data, and you can join JSON data with regular relational tables for a seamless mix of structured and semi-structured data.

Syntax

The syntax for JSON_TABLE looks a bit complex at first glance, but once you break it down, it becomes quite clear. Here’s the basic structure:

JSON_TABLE(
  json_doc, path
  COLUMNS (
    column_name type PATH 'path_to_value' [DEFAULT default_value ON ERROR],
    ...
  )
) AS alias_name

The json_doc is the JSON data you’re querying. It can be a JSON column or a JSON string. It is followed by a path expression, which describes the location in the JSON document that you want to process. Path expressions are specified as strings in single quotation marks and have a special syntax called JSONPath (loosely inspired by XPath for XML). For example, such an expression could look like this: '$.store.book[0].title'. The dollar sign represents the root of the JSON tree, the sub-properties are separated by dots, and array elements are referenced using square brackets.

This is followed by the COLUMNS keyword. It specifies the columns to be extracted from the JSON document. Each column is specified by a name of the column in the resulting table, a data type of the extracted value (e.g., VARCHAR, INT), followed by the PATH keyword and a JSON path expression that references a JSON value below the original path expression.

The DEFAULT keyword optionally provides a default value if the path does not exist or if there’s an error during extraction.

After the JSON_TABLE function call you can specify the alias name for the table with the AS keyword.

Example

It’s time for an example. Let’s assume we have a table named fruit_store with a column named json_details that holds the some JSON data about fruits:

INSERT INTO fruit_store (id, json_details) VALUES (
    1,
    '{
        "category": "Citrus",
        "fruits": [
            {
                "name": "Orange",
                "color": "Orange",
                "taste": "Sweet",
                "price_per_kg": 3.5
            },
            {
                "name": "Lemon",
                "color": "Yellow",
                "taste": "Sour",
                "price_per_kg": 4.0
            }
        ]
    }'
);

Now we can use the JSON_TABLE function to extract the details of each fruit from the JSON document. The goal is to retrieve the fruit name, color, taste, and price for each fruit in the array. Here’s the query:

SELECT *
  FROM fruit_store,
  JSON_TABLE(
    json_details, '$.fruits[*]'
    COLUMNS (
      fruit_name   VARCHAR(50)  PATH '$.name',
      fruit_color  VARCHAR(50)  PATH '$.color',
      fruit_taste  VARCHAR(50)  PATH '$.taste',
      price_per_kg DECIMAL(5,2) PATH '$.price_per_kg'
    )
  ) AS fruit_table;

Running this query will give you the following result:

idfruit_namefruit_colorfruit_tasteprice_per_kg
1OrangeOrangeSweet3.50
1LemonYellowSour4.00