Full-text Search with PostgreSQL

If you want to add simple text search functionality to an application backed by an SQL database one of the first things that may come to your mind is the SQL LIKE operator. The LIKE operator and its case-insensitive sibling ILIKE find substrings in text data via wildcards such as %, which matches any sequence of zero or more characters:

SELECT * FROM book WHERE title ILIKE '%dog%'.

However, this approach satisfies only very basic requirements for text search, because it only matches exact substrings. That’s why application developers often use an external search engine like Elasticsearch based on the Apache Lucene library.

With a PostgreSQL database there is another option: it comes with a built-in full-text search. A full-text search analyzes text according to the language of the text, parses it into tokens and converts them into so-called lexemes. These are strings, just like tokens, but they have been normalized so that different forms of the same word, for example “pony” and “ponies”, are made alike. Additionally, stop words are eliminated, which are words that are so common that they are useless for searching, like “a” or “the”. For this purpose the search engine uses a dictionary of the target language.

In PostgreSQL, there are two main functions to perform full-text search: they are to_tsvector and to_tsquery. The ts part in the function names stands for “text search”. The to_tsvector function breaks up the input string and creates a vector of lexemes out of it, which are then used to perform full-text search using the to_tsquery function. The two functions can be combined with the @@ (match) operator, which applies a search query to a search vector:

SELECT title
  FROM book
  WHERE to_tsvector(title) @@ to_tsquery('(cat | dog) & pony')

The query syntax of ts_query supports boolean operators like | (or), & (and), ! (not) and grouping using parentheses, but also other operators like and <-> (“followed by”) and * (prefix matching).

You can specify the target language as a parameter of to_tsvector:

# SELECT to_tsvector('english', 'Thousands of ponies were grazing on the prairie.');

'graze':5 'poni':3 'prairi':8 'thousand':1

Here’s another example in German:

# SELECT to_tsvector('german', 'Wer einen Fehler begeht, und ihn nicht korrigiert, begeht einen zweiten (Konfuzius)');

'begeht':4,9 'fehl':3 'konfuzius':12 'korrigiert':8 'wer':1 'zweit':11

PostgreSQL supports dictionaries for about 80+ languages out-of-the-box.

The examples in this article are just a small glimpse of what is possible with regards to full-text search in PostgreSQL. If you want to learn more you should consult the documentation. The key takeaway is that there is another option between simple LIKE clauses and an external search engine.

Commenting SQL database objects

Did you know that you can annotate database object like tables, views and columns with comments in many SQL database systems? By that I don’t mean comments in SQL scripts, indicated by double dashes (--), but comments attached to the objects themselves, stored in the database. These may be helpful to the database admin by providing context via a description text on what is stored in these objects.

For PostgreSQL and Oracle databases the syntax is as follows:

COMMENT ON TABLE [schema_name.]table_name IS '...';
COMMENT ON COLUMN [schema_name.]table_name.column_name IS '...';

For example:

COMMENT ON COLUMN books.author IS 'The main author''s last name';
COMMENT ON TABLE books IS 'Contains only the best books';

These comments can be viewed in database tools like SQL Developer:

Comments on columns
Comments on tables

You can also view the comments in psql:

db=# \d+ books
 Column |  Type   |          Description
--------+---------+------------------------------
id      | integer |
author  | text    | The main author''s last name
title   | text    |

And for a table:

db=# \dt+ books
                    List of relations
 Schema | Name  | Type  |     |        Description
--------+-------+-------+ ... +------------------------------
public  | books | table |     | Contains only the best books

In Oracle you can query the comments from the data dictionary views ALL_TAB_COMMENTS and ALL_COL_COMMENTS:

> SELECT * FROM all_col_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  COLUMN_NAME  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     ID           (null)
LIBRARY	 BOOKS	     AUTHOR       The main author's last name
LIBRARY	 BOOKS	     TITLE        (null)

> SELECT * FROM all_tab_comments WHERE table_name='BOOKS';
OWNER    TABLE_NAME  TABLE_TYPE  COMMENTS
--------------------------------------------------------------
LIBRARY	 BOOKS	     TABLE       Contains only the best books

In Oracle comments are limited to tables, views, materialized views, columns, operators and indextypes, but in PostgreSQL you can attach comments to nearly everything. Another good use case for this are documentation comments on database functions:

COMMENT ON FUNCTION my_function IS $$
This function does something important.

Parameters:
...
Example usage:
...
$$;

Note: the $$ delimits multi-line strings (called dollar quoted string constants).

Linking separate PostgreSQL servers with a Foreign Data Wrapper

If you want to query a table in one database server from another you need a way to connect these two servers with each other. For PostgreSQL databases the feature that makes this possible is called Foreign Data Wrapper.

To use this feature you have to load its extension into the current database:

CREATE EXTENSION postgres_fdw;

The postgres_fdw extension ships with a default PostgreSQL installation. It allows you to connect a PosgreSQL database with another PostgreSQL database. You can connect a PostgreSQL database with databases by other vendors, too, but then you need external extensions like oracle_fdw for Oracle databases or mysql_fdw for MySQL databases. In this article we will only use postgres_fdw.

You can check if the extension was loaded successfully. The following query should return a row for the extension:

SELECT * FROM pg_extension WHERE extname='postgres_fdw';

The next step is to set up the remote server instance:

CREATE SERVER remotesrv
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (
    host '127.0.0.1',
    port '5432',
    dbname 'remotedb'
  );

This statement registers a remote server under the name remotesrv, a name you can choose freely. You have to specify which Foreign Data Wrapper to use (postgres_fdw in this case) as well the target host, port and database name.

The CREATE SERVER statement didn’t contain any user login information, you have to provide it via a user mapping:

CREATE USER MAPPING
  FOR CURRENT_USER
  SERVER remotesrv
  OPTIONS (
    user 'joe',
    password 'secret'
  );

In this case we map the remote user joe to the current user (CURRENT_USER) of the local server. It doesn’t have to be the current user, you could specify any user name.

Now you have to import tables from the remote database. You can either explicitly import individual tables or a whole schema. Here’s how to import the public schema of the remote (“foreign”) database into the public schema of the local database:

IMPORT FOREIGN SCHEMA public
  FROM SERVER remotesrv
  INTO public;

You can restrict which tables to import with the LIMIT TO or EXCEPT clauses. The following statement will only import the tables books and students:

IMPORT FOREIGN SCHEMA public
  LIMIT TO (students, books)
  FROM SERVER remotesrv
  INTO public;

Now you can access these tables as if they were in the local database.

Tables as types in PostgreSQL

In SQL each column of a database table has a data type. These are types like NUMBER, VARCHAR(size) / TEXT, TIMESTAMP. What you perhaps don’t know yet is that in PostgreSQL you can use tables as types. What does this mean? Let’s say we have a table:

CREATE TABLE person (
  firstname TEXT,
  lastname  TEXT,
  email     TEXT
);

Now you can use this table as a type for columns in other tables:

CREATE TABLE article (
  id       SERIAL,
  content  TEXT,
  author   person,
  reviewer person
);

Instead of repeating the three columns of a person twice, e.g. author_firstname, author_lastname, author_email, and reviewer_firstname, reviewer_lastname, reviewer_email, the person table defined before acts as a type. Of course, the usual way in SQL is to give each person an ID and reference persons via these IDs from other tables. But sometimes you do not want this reference semantics. In this example you might want to fix the values of author and reviewer for articles in time and not retroactively update them automatically if a person changes their last name or email address later.

How to access the columns of these types? For INSERT the syntax is as follows:

INSERT INTO article (content, author, reviewer)
  VALUES ('...',
    ('Jane', 'Doe', 'jane.doe@example.com'),
    ('John', 'Roe', 'jroe@example.com')
  ); 

Or with explicit names of the sub-columns:

INSERT INTO article (content,
    author.firstname,
    author.lastname,
    author.email,
    reviewer.firstname,
    reviewer.lastname,
    reviewer.email)
  VALUES ('...',
    'Jane', 'Doe', 'jane.doe@example.com',
    'John', 'Roe', 'jroe@example.com'
  ); 

In a SELECT query individual values can be accessed with the following syntax:

SELECT
  content,
  (author).lastname,
  (reviewer).lastname
FROM article;

Of course, tables that uses other tables as data types for their columns can be used as data types again.

One last thing worth mentioning is that these nested definitions can be mapped nicely to JSON:

SELECT jsonb_pretty(to_jsonb(article)) FROM article;
{
  "id": 1,
  "content": "...",
  "author": {
    "email": "jane.doe@example.com",
    "firstname": "Jane",
    "lastname": "Doe"
  },
  "reviewer": {
    "email": "jroe@example.com",
    "firstname": "John",
    "lastname": "Roe"
  }
}

Geometric shapes, functions and operators in PostgreSQL

On this blog I frequently write about features of relational database systems and their SQL dialects. One feature many developers do not know about is support for geometric shapes, although a lot of RDBMs support them in one form or the other, each with its own syntax, of course. In this article I’m going to demonstrate this feature with PostgreSQL.

PostgreSQL has data types for geometric shapes like point, box, line segment, line, path, polygon, and circle. These data types are only defined for two dimensions with Euclidean (x, y) coordinates. Here are some literals for these types:

point '(3.2,4)'
box '((1,2),(6,4))'
lseg '((-4,0),(3,2))'
path '((0,0),(2,1),(5,3))'
polygon '((0,0),(1,1),(2,0),(3,1))'
circle '((5,2),1.5)'

You can create tables with columns of these types and insert shapes:

CREATE TABLE shapes (p point, c circle);

INSERT INTO shapes (p, c) VALUES
  (point '(1,0)', circle '(0,0),3'),
  (point '(10,20)', circle '(2,3),4'),
  (point '(0.5,1.5)', circle '(1,2),1');

Now you can query shapes and filter them with special operators:

SELECT * FROM shapes WHERE c @> p;

This query uses the contains operator @> in the WHERE clause. It selects all rows where the circle c contains the point p.

Here’s another operator: <-> determines the Euclidean distance between two points.

SELECT point '(0,0)' <-> point '(1,1)';
=> 2.23606797749979

The ?|| operator tests if two lines are parallel:

SELECT line '((1,2),(1,3))' ?|| line '((2,3),(2,4))';
=> true

You can translate a shape with the + operator:

SELECT box '((0,0),(1,1))' + point '(1,2)';
=> box '(2,3),(1,2)'

Or you can test with && if two shapes overlap:

SELECT box '((1,2),(4,3))' && box '(2,3),(1,2)';
=> true

This is only a small selection of geometric operators. See the full list in the official documentation. There you can also find a list of geometric functions like area , center, isclosed, npoints, etc.

SELECT area(box '((4,6),(10,12))');
=> 36

As mentioned in the beginning, other database systems support similar functionality. Check out MySQL’s spatial data types, Oracle Spatial, and MS SQL’s spatial data types.

Pagination in SQL

Pagination is the task of dividing a data set into subsequent parts of the whole data set. For example, a search engine initially only shows the first 15 results for a search query. The user can then step through the rest of the results the by clicking a “Next” button.

Ideally this feature is also supported by the underlying database system. Otherwise, the application would have to load all matching data records from the database, just to filter out the major part of of them, because the user only wanted to see page 3 of 50. A pagination request has two components: a limit and an offset. If a page contains a maximum of 15 items and page 3 is requested, then the limit would be 15 and the offset would be 30 = (page-1) × limit.

PostgreSQL, MySQL, MariaDB

The database systems PostgreSQL, MySQL and MariaDB have a straight forward syntax for pagination: LIMIT {number} OFFSET {number} . So a simple SQL query with pagination might look like this:

SELECT * FROM users ORDER BY name LIMIT 15 OFFSET 30;

Oracle DB

Oracle DB didn’t have a dedicated syntax for pagination before Oracle 12c, but it was still possible to achieve the same result with other means. With Oracle 12c a new syntax for pagination was introduced under the name “Row limiting clause”. First I’ll show the old method, then the new syntax.

The old method is based on ROWNUM . If you wanted to specify both an offset and a limit, you had to nest multiple queries:

SELECT *
FROM (SELECT *, rownum AS rnum
      FROM (SELECT *
            FROM users
            ORDER BY name)
      WHERE rownum < 45)
WHERE rnum >= 30;

The newer row limiting clause syntax is shorter and looks as follows:

SELECT * FROM users ORDER BY name
  OFFSET 30 ROWS FETCH NEXT 15 ROWS ONLY;

This syntax also allows the option to specify a percentage of rows instead of a fixed number of rows:

SELECT * FROM users ORDER BY name
  FETCH FIRST 20 PERCENT ROWS ONLY;

MS SQL Server

Microsoft’s SQL Server also supports the Oracle-like syntax with OFFSET and FETCH clauses and recommends the usage of this syntax for pagination.

Working with JSON data in PostgreSQL

Today most common SQL-based relational database management systems (DBMS) like PostgreSQL, MySQL, MariaDB, SQL Server and Oracle offer functionality to efficiently store and query JSON data in one form or another, with varying syntax. While a standard named SQL/JSON is in the works, it is not yet fully supported by all of these DBMS. This blog post is specific to PostgreSQL.

JSON data types

In PostgreSQL there are two data types for JSON columns: json and jsonb. The former stores JSON data as-is with any formatting preserved, while the latter stores JSON in a decomposed binary format. Operations on data in jsonb format are potentially more efficient.

We’ll use the jsonb data type in the following example to store a sequence of events, for example for an event sourcing based application, in a table.

CREATE TABLE events (date TIMESTAMP NOT NULL,
                     event JSONB NOT NULL);

JSON literals look like string literals. Let’s insert some events:

INSERT INTO events (date, event) VALUES
  (NOW(), '{"type": "add_shelf", "payload": {"id": 1}}'),
  (NOW(), '{"type": "add_book", "payload": {"title": "Ulysses", "shelf": 1}}'),
  (NOW(), '{"type": "add_book", "payload": {"title": "Moby Dick", "shelf": 1}}'),
  (NOW(), '{"type": "add_shelf", "payload": {"id": 2}}'),
  (NOW(), '{"type": "add_book", "payload": {"title": "Don Quixote", "shelf": 2}}');

Querying

PostgreSQL has two operators for navigating a JSON structure: -> and ->>. The former accesses an object field by key and the latter accesses an object field as text. These operators can be used both in the SELECT clause and the WHERE clause:

SELECT event->>'type' AS type FROM events;
type
add_shelf
add_book
add_book
add_shelf
SELECT event FROM events
        WHERE event->>'type'='add_book'
          AND event->'payload'->>'shelf'='1';
event
{"type":"add_book","payload":{"shelf":1,"title":"Ulysses"}}
{"type":"add_book","payload":{"shelf":1,"title":"Moby Dick"}}

Note that in the example above the value of "shelf" is compared to a string literal ('1'). In order to treat the value as a number we have to use the CAST function, and then we can use numerical comparison operators:

SELECT event FROM events
        WHERE CAST(
          event->'payload'->>'shelf' AS INTEGER
        ) > 1;
event
{"type":"add_book","payload":{"shelf":2,"title":"Don Quixote"}}

Updating

Updating JSON object fields is a bit more complicated. It is only possible with the jsonb data type and can be done via the JSONB_SET function, which takes four arguments:

1) the original JSON,
2) a path specifying which object fields should be updated,
3) a jsonb value, which is the new value, and
4) a boolean flag that specifies if missing fields should be created.

In this example we convert all the event "type" fields from lower case to upper case:

UPDATE events SET event=JSONB_SET(
  event,
  '{type}',
  TO_JSONB(UPPER(event->>'type')),
  false
);

PostgreSQL provides a lot more operators and functions for working with JSON data. This post only covered the most basic ones. See the PostgreSQL JSON reference for more.

Using PostgreSQL with Entity Framework

The most widespread O/R (object-relational) mapper for the .NET platform is the Entity Framework. It is most often used in combination with Microsoft SQL Server as database. But the architecture of the Entity Framework allows to use it with other databases as well. A popular and reliable is open-source SQL database is PostgreSQL. This article shows how to use a PostgreSQL database with the Entity Framework.

Installing the Data Provider

First you need an Entity Framework data provider for PostgreSQL. It is called Npgsql. You can install it via NuGet. If you use Entity Framework 6 the package is called EntityFramework6.Npgsql:

> Install-Package EntityFramework6.Npgsql

If you use Entity Framework Core for the new .NET Core platform, you have to install a different package:

> Install-Package Npgsql.EntityFrameworkCore.PostgreSQL

Configuring the Data Provider

The next step is to configure the data provider and the database connection string in the App.config file of your project, for example:

<configuration>
  <!-- ... -->

  <entityFramework>
    <providers>
      <provider invariantName="Npgsql"
         type="Npgsql.NpgsqlServices, EntityFramework6.Npgsql" />
    </providers>
  </entityFramework>

  <system.data>
    <DbProviderFactories>
      <add name="Npgsql Data Provider"
           invariant="Npgsql"
           description="Data Provider for PostgreSQL"
           type="Npgsql.NpgsqlFactory, Npgsql"
           support="FF" />
    </DbProviderFactories>
  </system.data>

  <connectionStrings>
    <add name="AppDatabaseConnectionString"
         connectionString="Server=localhost;Database=postgres"
         providerName="Npgsql" />
  </connectionStrings>

</configuration>

Possible parameters in the connection string are Server, Port, Database, User Id and Password. Here’s an example connection string using all parameters:

Server=192.168.0.42;Port=5432;Database=mydatabase;User Id=postgres;Password=topsecret

The database context class

To use the configured database you create a database context class in the application code:

class AppDatabase : DbContext
{
  private readonly string schema;

  public AppDatabase(string schema)
    : base("AppDatabaseConnectionString")
  {
    this.schema = schema;
  }

  public DbSet<User> Users { get; set; }

  protected override void OnModelCreating(DbModelBuilder builder)
  {
    builder.HasDefaultSchema(this.schema);
    base.OnModelCreating(builder);
  }
}

The parameter to the super constructor call is the name of the configured connection string in App.config. In this example the method OnModelCreating is overridden to set the name of the used schema. Here the schema name is injected via constructor. For PostgreSQL the default schema is called “public”:

using (var db = new AppDatabase("public"))
{
  var admin = db.Users.First(user => user.UserName == "admin")
  // ...
}

The Entity Framework mapping of entity names and properties are case sensitive. To make the mapping work you have to preserve the case when creating the tables by putting the table and column names in double quotes:

create table public."Users" ("Id" bigserial primary key, "UserName" text not null);

With these basics you’re now set up to use PostgreSQL in combination with the Entity Framework.