Today I’ll demonstrate a curiosity of SQL regarding the NOT IN operator in combination with a subquery and NULL values.
Let’s assume we have two database tables, users and profiles:
users profiles +--------------+ +-------------+ | id username | | id user_id | | 0 'joe' | | 0 2 | | 1 'kate' | | 1 0 | | 2 'john' | | 2 NULL | | 3 'maria' | +-------------+ +--------------+
We want to find all users, which have no associated profile. The intuitive solution would be a negated membership test (“NOT IN”) on the result set of a subquery:
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM profiles);
The anticipated result is:
+---------------+ | id username | | 1 'kate' | | 3 'maria' | +---------------+
However, the actual result is an empty set:
+--------------+ | id username | +--------------+
This is irritating, especially since the non-negated form produces a sensible result:
SELECT * FROM users WHERE id IN (SELECT user_id FROM profiles);
+--------------+ | id username | | 0 'joe' | | 2 'john' | +--------------+
So why does the NOT IN operator produce this strange result?
To understand what happens we replace the result of the subquery with a set literal:
SELECT * FROM users WHERE id NOT IN (2, 0, NULL);
This statement is internally translated to:
SELECT * FROM users WHERE id<>2 AND id<>0 AND id<>NULL;
And here comes the twist: a field<>NULL
clause evaluates to UNKNOWN in SQL, which is treated like FALSE in a boolean expression. The desired clause would be id IS NOT NULL
, but this is not what is used by SQL. As a consequence the result set is empty.
The result for the non-negated membership test (“IN”) can be explained as well. The IN clause is internally translated to:
SELECT * FROM users WHERE id=2 OR id=0 OR id=NULL;
A field=NULL
clause evaluates to UNKNOWN as well. But in this case it is of no consequence, since the clause is joined via OR.
Now that we know what’s going on, how can we fix it? There are two possibilities:
One is to use an outer join:
SELECT u.id FROM users u LEFT OUTER JOIN profiles p ON u.id=p.user_id WHERE p.id IS NULL;
The other option is to filter out all NULL values in the subquery:
SELECT id FROM users WHERE id NOT IN (SELECT user_id FROM profiles WHERE user_id IS NOT NULL);
Conclusion
Both field=NULL
and field<>NULL
evaluate to UNKNOWN in SQL. Unfortunately, SQL uses these clauses for IN and NOT IN set operations. The solution is to work around it.