What are GIN Indexes in PostgreSQL?

If you’ve worked with PostgreSQL and dealt with things like full-text search, arrays, or JSON data, you might have heard about GIN indexes. But what exactly are they, and why are they useful?

GIN stands for Generalized Inverted Index. Most indexes (like the default B-tree index) work best when there’s one clear value per row – like a number or a name. But sometimes, a single column can hold many values. Think of a column that stores a list of tags, words in a document, or key-value data in JSON. That’s where GIN comes in.

Let’s walk through a few examples to see how GIN indexes work and why they’re helpful.

Full-text search example

Suppose you have a table of articles:

CREATE TABLE articles (
  id    serial PRIMARY KEY,
  title text,
  body  text
);

You want to let users search the content of these articles. PostgreSQL has built-in support for full-text search, which works with a special data type called tsvector. To get started, you’d add a column to store this processed version of your article text:

ALTER TABLE articles ADD COLUMN tsv tsvector;
UPDATE articles SET tsv = to_tsvector('english', body);

Now, to speed up searches, you create a GIN index:

CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv);

With that in place, you can search for articles quickly:

SELECT * FROM articles WHERE tsv @@ to_tsquery('tonic & water');

This finds all articles that contain both “tonic” and “water”, and thanks to the GIN index, it’s fast – even if you have thousands of articles.

Array example

GIN is also great for columns that store arrays. Let’s say you have a table of photos, and each photo can have several tags:

CREATE TABLE photos (
  id   serial PRIMARY KEY,
  tags text[]
);

You want to find all photos tagged with “capybara”. You can create a GIN index on the tags column:

CREATE INDEX idx_photos_tags ON photos USING GIN(tags);
SELECT * FROM photos WHERE tags @> ARRAY['capybara'];

(The @> operator means “contains” or “is a superset of”.)

The index lets PostgreSQL find matching rows quickly, without scanning the entire table.

JSONB example

PostgreSQL’s jsonb type lets you store flexible key-value data. Imagine a table of users with extra info stored in a jsonb column:

CREATE TABLE users (
  id   serial PRIMARY KEY,
  data jsonb
);

One row might store {"age": 42, "city": "Karlsruhe"}. To find all users from New York, you can use:

SELECT * FROM users WHERE data @> '{"city": "Karlsruhe"}';

And again, with a GIN index on the data column, this query becomes much faster:

CREATE INDEX idx_users_data ON users USING GIN(data);

Things to keep in mind

GIN indexes are very powerful, but they come with some tradeoffs. They’re slower to build and can make insert or update operations a bit heavier. So they’re best when you read (search) data often, but don’t write to the table constantly.

In short, GIN indexes are your friend when you’re dealing with columns that contain multiple values – like arrays, full-text data, or JSON. They let PostgreSQL break apart those values and build a fast lookup system. If your queries feel slow and you’re working with these kinds of columns, adding a GIN index might be exactly what you need.

JSON as a table in PostgreSQL 17

Some time ago, I described in a blog post how to work with JSON data in a PostgreSQL database. Last month (September 2024), PostgreSQL 17 was released, which offers another feature for working with JSON data: the JSON_TABLE() function. Such a function already existed in other database systems, such as MySQL and Oracle.

The core idea behind JSON_TABLE() is to temporarily transform JSON data into a relational format that can be queried using standard SQL commands. In doing so, developers can apply the full power of SQL, such as filtering, sorting, aggregating, and joining, to data that originally exists in JSON format: It enables you to use SQL to query JSON data stored in a column as though it were relational data, and you can join JSON data with regular relational tables for a seamless mix of structured and semi-structured data.

Syntax

The syntax for JSON_TABLE looks a bit complex at first glance, but once you break it down, it becomes quite clear. Here’s the basic structure:

JSON_TABLE(
  json_doc, path
  COLUMNS (
    column_name type PATH 'path_to_value' [DEFAULT default_value ON ERROR],
    ...
  )
) AS alias_name

The json_doc is the JSON data you’re querying. It can be a JSON column or a JSON string. It is followed by a path expression, which describes the location in the JSON document that you want to process. Path expressions are specified as strings in single quotation marks and have a special syntax called JSONPath (loosely inspired by XPath for XML). For example, such an expression could look like this: '$.store.book[0].title'. The dollar sign represents the root of the JSON tree, the sub-properties are separated by dots, and array elements are referenced using square brackets.

This is followed by the COLUMNS keyword. It specifies the columns to be extracted from the JSON document. Each column is specified by a name of the column in the resulting table, a data type of the extracted value (e.g., VARCHAR, INT), followed by the PATH keyword and a JSON path expression that references a JSON value below the original path expression.

The DEFAULT keyword optionally provides a default value if the path does not exist or if there’s an error during extraction.

After the JSON_TABLE function call you can specify the alias name for the table with the AS keyword.

Example

It’s time for an example. Let’s assume we have a table named fruit_store with a column named json_details that holds the some JSON data about fruits:

INSERT INTO fruit_store (id, json_details) VALUES (
    1,
    '{
        "category": "Citrus",
        "fruits": [
            {
                "name": "Orange",
                "color": "Orange",
                "taste": "Sweet",
                "price_per_kg": 3.5
            },
            {
                "name": "Lemon",
                "color": "Yellow",
                "taste": "Sour",
                "price_per_kg": 4.0
            }
        ]
    }'
);

Now we can use the JSON_TABLE function to extract the details of each fruit from the JSON document. The goal is to retrieve the fruit name, color, taste, and price for each fruit in the array. Here’s the query:

SELECT *
  FROM fruit_store,
  JSON_TABLE(
    json_details, '$.fruits[*]'
    COLUMNS (
      fruit_name   VARCHAR(50)  PATH '$.name',
      fruit_color  VARCHAR(50)  PATH '$.color',
      fruit_taste  VARCHAR(50)  PATH '$.taste',
      price_per_kg DECIMAL(5,2) PATH '$.price_per_kg'
    )
  ) AS fruit_table;

Running this query will give you the following result:

idfruit_namefruit_colorfruit_tasteprice_per_kg
1OrangeOrangeSweet3.50
1LemonYellowSour4.00

JSON Web Token (JWT) and Security

General

JWT is an open standard for transmitting information as a JSON object. The most common scenario is authorization.

Unfortunately, the token keeps cropping up in connection with the security vulnerability. For example, it is mentioned in the OWASP top ten under the item “Broken Access Control”. In the following, I would like to briefly explain the JWT and point out a few risks when using it.

The token consists of three parts: the header, payload, and signature. Each part has been encoded with Base64Url, and all parts are joined together, separated by dots.

The type of the token and the signing algorithm is typically defined in the header.

{
  "alg": "HS256",
  "typ": "JWT"
}

The payload contains the information that should be transmitted. In cases of authorization, the user and permissions. It is also possible to define token metadata, like an expiration time. Such information must be checked by the developer. The token itself has no expiration.

{
  "exp": 1516242622,
  "name": "Max Mustermann",
  "admin": true
}

The signature take the encoded header, the encodes payload and a secret and encode it with the algorithm defined in header.

HMACSHA256(
  base64UrlEncode(header) + "." +
  base64UrlEncode(payload),
  secret)

In the authorization scenario, the token is requested from the client and created by the authorization server. The client uses the token to authorize a resource server. Therefore, the token is sent in the authorization header using the bearer schema.

Security Issues

JWT has a lot of attack surfaces, like the None Hashing Algorithm, ECDSA “Psychic Signatures”, Weak HMAC Keys, HMAC vs Public Key Confusion, Attacker Provided Public Key and the plaintext transmitting of the payload. All in all, JWT is vulnerable to replaying or tampering. Especially if the secret is not strong, the token does not expire, or no signature is used at all.

In the following, I will take a closer look at the none hashing algorithm issue as an example:

JWT allows you to set the algorithm in the header to none. So the signature part that tries to detect tampering is missing. It is just the first two parts and the point of waiting for a signature, which does not come.

eyJhbGciOiAibm9uZSIsICJ0eXAiOiAiSldUIn0K.eyJ1c2VybmFtZSI6ImFkbWluaW5pc3RyYXRvciIsImlzX2FkbWluIjp0cnVlLCJpYXQiOjE1MTYyMzkwMjIsImV4cCI6MTUxNjI0MjYyMn0.

All security precautions are already suspended, and the attacker can create his own tokens or modify intercepted ones at his leisure.

Some frameworks forbid this configuration, but in the JWT standard, it is possible. Although many applications and libraries that use JWT have disabled the option or at least tried to do so, there are still security vulnerabilities due to the configuration option. For example, because only case-sensitive “none”  was checked, settings like “None” or “NoNe” could still be used as an attack surface. So if you want to forbid this setting, it is important to do it case-insensitively.

Since this is a big problem that also occurs again and again, there is a website that counts the days since the last none algorithm problem: https://www.howmanydayssinceajwtalgnonevuln.com/.

The none algorithm is not the only attack surface on the algorithm usage. Most of the before named issues are related to the algorithm part.

How to use most secure

At best, the token should only be used once to confirm authentication and authorization. It should not be used for session management and has a short lifespan. Thus, the time span for attacks can be minimized. Also, the algorithm none option should be prevented, and a proper signature algorithm should be used. Additional information and configuration recommendations can be found under: https://jwt.io/ and OWASP cheatsheet.

Working with JSON data in PostgreSQL

Today most common SQL-based relational database management systems (DBMS) like PostgreSQL, MySQL, MariaDB, SQL Server and Oracle offer functionality to efficiently store and query JSON data in one form or another, with varying syntax. While a standard named SQL/JSON is in the works, it is not yet fully supported by all of these DBMS. This blog post is specific to PostgreSQL.

JSON data types

In PostgreSQL there are two data types for JSON columns: json and jsonb. The former stores JSON data as-is with any formatting preserved, while the latter stores JSON in a decomposed binary format. Operations on data in jsonb format are potentially more efficient.

We’ll use the jsonb data type in the following example to store a sequence of events, for example for an event sourcing based application, in a table.

CREATE TABLE events (date TIMESTAMP NOT NULL,
                     event JSONB NOT NULL);

JSON literals look like string literals. Let’s insert some events:

INSERT INTO events (date, event) VALUES
  (NOW(), '{"type": "add_shelf", "payload": {"id": 1}}'),
  (NOW(), '{"type": "add_book", "payload": {"title": "Ulysses", "shelf": 1}}'),
  (NOW(), '{"type": "add_book", "payload": {"title": "Moby Dick", "shelf": 1}}'),
  (NOW(), '{"type": "add_shelf", "payload": {"id": 2}}'),
  (NOW(), '{"type": "add_book", "payload": {"title": "Don Quixote", "shelf": 2}}');

Querying

PostgreSQL has two operators for navigating a JSON structure: -> and ->>. The former accesses an object field by key and the latter accesses an object field as text. These operators can be used both in the SELECT clause and the WHERE clause:

SELECT event->>'type' AS type FROM events;
type
add_shelf
add_book
add_book
add_shelf
SELECT event FROM events
        WHERE event->>'type'='add_book'
          AND event->'payload'->>'shelf'='1';
event
{"type":"add_book","payload":{"shelf":1,"title":"Ulysses"}}
{"type":"add_book","payload":{"shelf":1,"title":"Moby Dick"}}

Note that in the example above the value of "shelf" is compared to a string literal ('1'). In order to treat the value as a number we have to use the CAST function, and then we can use numerical comparison operators:

SELECT event FROM events
        WHERE CAST(
          event->'payload'->>'shelf' AS INTEGER
        ) > 1;
event
{"type":"add_book","payload":{"shelf":2,"title":"Don Quixote"}}

Updating

Updating JSON object fields is a bit more complicated. It is only possible with the jsonb data type and can be done via the JSONB_SET function, which takes four arguments:

1) the original JSON,
2) a path specifying which object fields should be updated,
3) a jsonb value, which is the new value, and
4) a boolean flag that specifies if missing fields should be created.

In this example we convert all the event "type" fields from lower case to upper case:

UPDATE events SET event=JSONB_SET(
  event,
  '{type}',
  TO_JSONB(UPPER(event->>'type')),
  false
);

PostgreSQL provides a lot more operators and functions for working with JSON data. This post only covered the most basic ones. See the PostgreSQL JSON reference for more.

Recap of the Schneide Dev Brunch 2016-12-11

If you couldn’t attend the Schneide Dev Brunch at 11th of December 2016, here is a summary of the main topics.

brunch64-borderedLast week at sunday, we held another Schneide Dev Brunch, a regular brunch on the second sunday of every other (even) month, only that all attendees want to talk about software development and various other topics. This brunch was so well-attended that we had to cramp around our conference table and gather all chairs on the floor. As usual, the main theme was that if you bring a software-related topic along with your food, everyone has something to share. Because we were so many, we established a topic list and an agenda for the event. As usual, a lot of topics and chatter were exchanged. This recapitulation tries to highlight the main topics of the brunch, but cannot reiterate everything that was spoken. If you were there, you probably find this list inconclusive:

Finland

We started with a report of one of our attendees who had studied in Finland for the last two years. He visited the Aalto university and shared a lot of cultural details about Finland and the Finnish people with us.

The two most important aspects of the report were sauna and singing. The Finnish love to visit a sauna, in fact, nearly every building has a functioning sauna. Every office building has a company sauna that will get visited often. So it might happen that your first visit of a company starts right in the sauna, naked with the bosses.

And the Finnish love singing so much that they usually start singing during the sauna session. There are open social events organized around singing together.

Alcohol plays a big role in Finland, mostly because the taxes makes it incredibly expensive to obtain a proper buzz. In the southern regions, much alcohol is imported from Russia or Estonia by ferry. There are even special ferry routes designed to be cost-neutral when shopping for alcohol. But alcohol isn’t the only thing that is made expensive with special taxes. Sugar and sugary food/drinks are heavily taxed, too. So it’s actually more expensive to eat unhealthy, which sounds like a good concept to counter some civilizational diseases.

The Finnish students often wear a special boilersuit during official events that identifies their affilition with their field of study and university. They apply patches and stickers to their suit when they have completed certain tasks or chores. It’s actually a lot like a military uniform with rank and campaign insignia. Only that the Finnish student boilersuit may not be cleaned or washed other than jumping into a body of water with you in it. And the Finnish lakes are frozen most of the year, with temperatures of -27 °C being nothing extraordinary.

As you probably have guessed right now, costs for rent and electricity are high. Our attendee enjoyed his time there, but is also glad to have the singing separated from the alcohol for the most part.

Lambdas and Concurrency

The next question revolved around the correlation between lambda expressions and concurrent execution of source code. The Vert.x framework relies heavily on lambdas and provides reactive programming patterns for Java. As such, it is event driven and non blocking. That makes it hard to debug or to reason about the backstory if an effect occurs in production. The traditional tools like stacktraces don’t tell the story anymore.

We took a deep dive into the concepts behind Optionals, Promises and Futures (but forgot to talk about the Expected type in C++). There is a lot of foggy implementation details in the different programming languages around these concepts and it doesn’t help that the Java Optional tries to be more than the C++ Optional, but doesn’t muster up the courage to be a full Monad. Whether deprecating the get()-method will make things better is open for discussion.

To give a short answer to a long discussion: Lambdas facilitate concurrent programming, but don’t require or imply it.

React.js and Tests

It was only a small step from the reactive framework Vert.x to the React.js framework in Javascript. One attendee reported his experiences with using different types of tests with the React framework. He also described the origin of the framework, mentioning the concept of Flux and Redux along the way.

Sorry if I’m being vague, but each written sentence about Javascript frameworks seem to have a halflife time of about six weeks. My take on the Javascript world is to lean back, grab some popcorn and watch the carnival from the terrace, because while we’re stuck with it forever, it is tragically unfortunate. Even presumed simple things like writing a correct parser for JSON end in nightmares.

It should be noted, though, that the vue.js framework entered the “assess” stage of the Thoughtworks Techradar, while AngularJS (or just Angular, as it should be called now) is in the “hold” stage.

Code Analysis

We also talked about source code analysis tools and plugins for the IDE. The gist of it seems to be that the products of JetBrains (especially the IntelliJ IDEA IDE) have all the good things readily included, while there are standalone products or plugins for other IDEs.

Epilogue

As usual, the Dev Brunch contained a lot more chatter and talk than listed here. The number of attendees makes for an unique experience every time. We are looking forward to the next Dev Brunch at the Softwareschneiderei in February 2017. We even have some topics already on the agenda (like a report about first-hand experiences with the programming language Rust). And as always, we are open for guests and future regulars. Just drop us a notice and we’ll invite you over next time.