PostgreSQL’s “DISTINCT ON” clause

Anyone who uses SQL databases knows the DISTINCT modifier for SELECT queries to get result sets without duplicates. However, PostgreSQL has another variant of it that not everyone knows, but which is very useful: the SELECT DISTINCT ON clause. It can be used to query only the first row of each set of rows according to a grouping.

To understand its usefulness, let’s look at an example and solve it in the classical way first.

The complicated way

Given the following table of items we want to query for each category the item with the highest value.

 name │ category │ value
-------------------------
 A    │ X        │ 52
 B    │ X        │ 35
 C    │ X        │ 52
 D    │ Y        │ 27
 E    │ Y        │ 31
 F    │ Y        │ 20

Usually we’d start out with a query like this:

SELECT
  category,
  MAX(value) AS highest_value
FROM items
GROUP BY category;
category │ highest_value
--------------------------
 X       │ 52
 Y       │ 31

And then use this query as a sub-select:

SELECT * FROM items
WHERE (category, value) IN (
  SELECT
    category,
    MAX(value) AS highest_value
  FROM items
  GROUP BY category
);
 name │ category │ value
-------------------------
 A    │ X        │ 52
 C    │ X        │ 52
 E    │ Y        │ 31

Unfortunately, there are multiple items in category X with the same highest value 52. But we really only want one row for each category. In this case we might use the ROW_NUMBER() function:

SELECT
  name, category, value
FROM (
  SELECT
    items.*,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY value DESC, name
    ) AS rownum
  FROM items
) WHERE rownum = 1;
 name │ category │ value
-------------------------
 A    │ X        │ 52
 E    │ Y        │ 31

This is finally our desired result.

The easy way

But I promised it can be easier with the DISTINCT ON clause. How does it work?

SELECT DISTINCT ON (category) *
FROM items
ORDER BY
  category, value DESC, name;

After DISTINCT ON we specify one or more columns by which to group by in parentheses. The ORDER BY clause determines which row will be the first in each group. We get the same result:

 name │ category │ value
-------------------------
 A    │ X        │ 52
 E    │ Y        │ 31