full-text search – Schneide Blog

What are GIN Indexes in PostgreSQL?

If you’ve worked with PostgreSQL and dealt with things like full-text search, arrays, or JSON data, you might have heard about GIN indexes. But what exactly are they, and why are they useful?

GIN stands for Generalized Inverted Index. Most indexes (like the default B-tree index) work best when there’s one clear value per row – like a number or a name. But sometimes, a single column can hold many values. Think of a column that stores a list of tags, words in a document, or key-value data in JSON. That’s where GIN comes in.

Let’s walk through a few examples to see how GIN indexes work and why they’re helpful.

Full-text search example

Suppose you have a table of articles:

CREATE TABLE articles (
  id    serial PRIMARY KEY,
  title text,
  body  text
);

You want to let users search the content of these articles. PostgreSQL has built-in support for full-text search, which works with a special data type called tsvector. To get started, you’d add a column to store this processed version of your article text:

ALTER TABLE articles ADD COLUMN tsv tsvector;
UPDATE articles SET tsv = to_tsvector('english', body);

Now, to speed up searches, you create a GIN index:

CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv);

With that in place, you can search for articles quickly:

SELECT * FROM articles WHERE tsv @@ to_tsquery('tonic & water');

This finds all articles that contain both “tonic” and “water”, and thanks to the GIN index, it’s fast – even if you have thousands of articles.

Array example

GIN is also great for columns that store arrays. Let’s say you have a table of photos, and each photo can have several tags:

CREATE TABLE photos (
  id   serial PRIMARY KEY,
  tags text[]
);

You want to find all photos tagged with “capybara”. You can create a GIN index on the tags column:

CREATE INDEX idx_photos_tags ON photos USING GIN(tags);
SELECT * FROM photos WHERE tags @> ARRAY['capybara'];

(The @> operator means “contains” or “is a superset of”.)

The index lets PostgreSQL find matching rows quickly, without scanning the entire table.

JSONB example

PostgreSQL’s jsonb type lets you store flexible key-value data. Imagine a table of users with extra info stored in a jsonb column:

CREATE TABLE users (
  id   serial PRIMARY KEY,
  data jsonb
);

One row might store {"age": 42, "city": "Karlsruhe"}. To find all users from New York, you can use:

SELECT * FROM users WHERE data @> '{"city": "Karlsruhe"}';

And again, with a GIN index on the data column, this query becomes much faster:

CREATE INDEX idx_users_data ON users USING GIN(data);

Things to keep in mind

GIN indexes are very powerful, but they come with some tradeoffs. They’re slower to build and can make insert or update operations a bit heavier. So they’re best when you read (search) data often, but don’t write to the table constantly.

In short, GIN indexes are your friend when you’re dealing with columns that contain multiple values – like arrays, full-text data, or JSON. They let PostgreSQL break apart those values and build a fast lookup system. If your queries feel slow and you’re working with these kinds of columns, adding a GIN index might be exactly what you need.

Improving search for special content

Nowadays many applications are very complex or handle so much data that users need and expect a fast and powerful search functionality. Popular search engine you can leverage are ElasticSearch and Solr. They both use Lucene for index management under the hood and provide a similar functionality regarding indexing and text search.

In previous posts I already gave some advice on using and improving search in your applications using these engines. For example a how-to for .NET (core) with ElasticSearch and using n-grams or wildcard fields for substring search.

Even if these frameworks work really great implementing a great search functionality can be quite hard, especially when handling special data like DOIs, IP-Adresses, Hostnames and other text containing special characters.

What’s the deal with special characters?

The standard analyzers are tuned for dealing with natural language based texts. So they split words on punctuation, whitespace and certain special characters. These are usually filtered out and not indexed. So if you index something like my-cool-hostname the dashes and the complete string will not land in the index only leaving the separate parts my, cool and hostname.

The problem with that is, that neither an exact match nor a substring like my-cool will yield any results. That is not what your users will expect…

Making your search work with special data

There are several ways to improve your search functionality for fields or texts containing different kinds of non-language strings. Here are some simple options that will improve or fix handling problematic cases and make your search work as expected by your users.

My examples use ElasticSearch features and wording, so they may be called differently for other search engines.

Using a keyword field

If you only need an exact match on some weird data field like a DOI 10.1000/182 containing punctuation and slashes, where a user normally just copy & pastes the search string from somewhere using a keyword field for indexing instead of the text type might be the better option. Usually this is easy to implement and fast for indexing and searching.

Using multiple index fields for one data field

ElasticSearch also offers the possibility to index the same data using multiple index fields. So you can keep sub-string features while adding exact match or special character support by adding different index fields like keyword above or an index field using a different analyzer (see this option below). This is called multi-field in ElasticSearch. Using multi-fields can also be used to improve sorting and scoring matches.

Using different analyzers

As I mentioned before the standard analyzers for text fields use tokenization rules and character filters useful for natural language. Sometimes you want to keep words together or preserve special characters. To implement an appropriate search for hostnames or IP addresses you could for example use a custom analyzer with a whitespace or pattern tokenizer.

Conclusion

Default full text search works great out-of-the-box in many cases. However, there are many cases of special, structured data where you need to fine-tune the way the index gets populated.

Many approaches can be combined using different analyzers and indexing a data in several ways.

There is a lot you can do to provide awesome search capabilities to your users but that requires quite some knowledge of the way the search engines work and about the data you want to be searchable.

Full-text Search with PostgreSQL

If you want to add simple text search functionality to an application backed by an SQL database one of the first things that may come to your mind is the SQL LIKE operator. The LIKE operator and its case-insensitive sibling ILIKE find substrings in text data via wildcards such as %, which matches any sequence of zero or more characters:

SELECT * FROM book WHERE title ILIKE '%dog%'.

However, this approach satisfies only very basic requirements for text search, because it only matches exact substrings. That’s why application developers often use an external search engine like Elasticsearch based on the Apache Lucene library.

With a PostgreSQL database there is another option: it comes with a built-in full-text search. A full-text search analyzes text according to the language of the text, parses it into tokens and converts them into so-called lexemes. These are strings, just like tokens, but they have been normalized so that different forms of the same word, for example “pony” and “ponies”, are made alike. Additionally, stop words are eliminated, which are words that are so common that they are useless for searching, like “a” or “the”. For this purpose the search engine uses a dictionary of the target language.

In PostgreSQL, there are two main functions to perform full-text search: they are to_tsvector and to_tsquery. The ts part in the function names stands for “text search”. The to_tsvector function breaks up the input string and creates a vector of lexemes out of it, which are then used to perform full-text search using the to_tsquery function. The two functions can be combined with the @@ (match) operator, which applies a search query to a search vector:

SELECT title
  FROM book
  WHERE to_tsvector(title) @@ to_tsquery('(cat | dog) & pony')

The query syntax of ts_query supports boolean operators like | (or), & (and), ! (not) and grouping using parentheses, but also other operators like and <-> (“followed by”) and * (prefix matching).

You can specify the target language as a parameter of to_tsvector:

# SELECT to_tsvector('english', 'Thousands of ponies were grazing on the prairie.');

'graze':5 'poni':3 'prairi':8 'thousand':1

Here’s another example in German:

# SELECT to_tsvector('german', 'Wer einen Fehler begeht, und ihn nicht korrigiert, begeht einen zweiten (Konfuzius)');

'begeht':4,9 'fehl':3 'konfuzius':12 'korrigiert':8 'wer':1 'zweit':11

PostgreSQL supports dictionaries for about 80+ languages out-of-the-box.

The examples in this article are just a small glimpse of what is possible with regards to full-text search in PostgreSQL. If you want to learn more you should consult the documentation. The key takeaway is that there is another option between simple LIKE clauses and an external search engine.

Modern substring search

Nowadays many applications need a good search functionality. They manage large amounts of content in sometimes complex structures so looking for it manually quickly becomes unfeasible and annoying.

ElasticSearch is a powerful tool for implementing a fast and scalable search functionality for your applications. Many useful features like scoring and prefix search are available out-of-the-box.

One often requested feature needs a bit of thought and special implementation: A fulltext search for substrings.

Wildcard search

An easy way is to use an wildcard query. It allows using wildcard characters like * and ? but is not recommended due to low performance, especially if you start you search patterns with wildcards. For the sake of completeness I mention the link to the official documentation here.

Aside from performance it requires using the wildcard characters, either by the user or your code and perhaps needs to be combined with other queries like the match or term queries. Therefore I do not advise usage of wildcard queries.

Using n-grams for indexing

The trick here is to break up the tokens in your texts into even smaller parts – called n-grams – for indexing only. A word like “search” would be split into the following terms using 3-grams: sea, ear, arc, rch.

So if the user searches for “ear” a document/field containing “search” will be matched. You can configure the analyzer to use for individual fields an the minimum and maximum length of the n-grams to work best for your requirements.

The trick here is to use the n-gram analyzer only for indexing and not for searching because that would also break up the search term and lead to many false positives.

See this example configuration using the C# ElasticSearch API NEST:

var client = new ElasticClient(settings);
var response = client.Indices.Create("device-index", creator => creator
  .Settings(s => s
		.Setting("index.max_ngram_diff", 10)
		.Analysis(analysis => analysis
			.Analyzers(analyzers => analyzers
				.Custom("ngram_analyzer", analyzerDescriptor => analyzerDescriptor
					.Tokenizer("ngram_tokenizer")
					.Filters("lowercase")
				)
			)
			.Tokenizers(tokenizers => tokenizers
				.NGram("ngram_tokenizer", ngram => ngram
					.MinGram(3)
					.MaxGram(10)
				)
			)
		)
	)
	.Map<SearchableDevice>(device => device
		.AutoMap()
		.Properties(props => props
			.Text(t => t
				.Name(n => n.SerialNumber)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
			.Text(t => t
				.Name(n => n.InventoryNumber)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
			.Text(t => t
				.Name(n => n.Model)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
		)
	)
));

Using the wildcard field

Starting with ElasticSearch 7.9 there is a new field type called “wildcard”. Usage is in general straight forward: You simply exchange the field type “text” or “keyword” with this new type “wildcard”. ElasticSearch essentially uses n-grams in combination with a so called “binary doc value” to provide seemless performant substring search. See this official blog post for details and guidance when to prefer wildcard over the traditional field types.

Conclusion

Generally, search is hard. In the old days many may have used SQL like queries with wildcards etc. to implement search. With Lucene and ElasticSearch modern, highly scalable and performant indexing and search solutions are available for developers. Unfortunately, this great power comes with a bunch of pitfalls where you have to adapt your solution to fit you use-case.

	Writing Integration… on Every Unit Test Is a Stage Pla…
	mariuselvert on C# is very strict about modify…
	Anonymous on C# is very strict about modify…
	Anonymous on Cache configuration with WildF…
	Miq on Nested queries like N+1 in pra…