Anyone who uses SQL databases knows the DISTINCT modifier for SELECT queries to get result sets without duplicates. However, PostgreSQL has another variant of it that not everyone knows, but which is very useful: the SELECT DISTINCT ON clause. It can be used to query only the first row of each set of rows according to a grouping.
To understand its usefulness, let’s look at an example and solve it in the classical way first.
The complicated way
Given the following table of items we want to query for each category the item with the highest value.
name │ category │ value ------------------------- A │ X │ 52 B │ X │ 35 C │ X │ 52 D │ Y │ 27 E │ Y │ 31 F │ Y │ 20
Usually we’d start out with a query like this:
SELECT
category,
MAX(value) AS highest_value
FROM items
GROUP BY category;
category │ highest_value -------------------------- X │ 52 Y │ 31
And then use this query as a sub-select:
SELECT * FROM items
WHERE (category, value) IN (
SELECT
category,
MAX(value) AS highest_value
FROM items
GROUP BY category
);
name │ category │ value ------------------------- A │ X │ 52 C │ X │ 52 E │ Y │ 31
Unfortunately, there are multiple items in category X with the same highest value 52. But we really only want one row for each category. In this case we might use the ROW_NUMBER()
function:
SELECT
name, category, value
FROM (
SELECT
items.*,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY value DESC, name
) AS rownum
FROM items
) WHERE rownum = 1;
name │ category │ value ------------------------- A │ X │ 52 E │ Y │ 31
This is finally our desired result.
The easy way
But I promised it can be easier with the DISTINCT ON clause. How does it work?
SELECT DISTINCT ON (category) *
FROM items
ORDER BY
category, value DESC, name;
After DISTINCT ON we specify one or more columns by which to group by in parentheses. The ORDER BY clause determines which row will be the first in each group. We get the same result:
name │ category │ value ------------------------- A │ X │ 52 E │ Y │ 31