Understanding SQL joins

2024-02-11

Understanding SQL joins is not hard if you think about them in terms of a Cartesian product filtered with a Boolean expression and then optionally extended with missing rows from the left and/or right side. This is how I will introduce them here. For some reason, people like to mislead beginners with Venn diagrams.¹ I’ll briefly show why this doesn’t work at the end of this post.

Syntax

A join has the following basic syntax.²

select *
from A
<join-type> join B on <bool-expression>

We select from a table A and join a table B based on some Boolean expression. For our purposes, the join type will be one out of CROSS, INNER, LEFT, RIGHT, and FULL. The Boolean expression is an SQL expression that can take a row of A and B as input and outputs true or false. (However, the Boolean expression does not need to operate on both rows or even a single row. In the simplest case, it’s just TRUE.)

Cartesian product

Let a database table be a set of rows and a row a tuple of values, where each value can also be accessed via a (column) name. Take the Cartesian product of the two tables A and B, such that we get the set of all ordered pairs (a, b), where a is a row from A and b is a row from B. Now evaluate the Boolean expression on each element (a, b) of the Cartesian product.

For example, if A has three rows and B has two rows, we end up with something like the following table.

A	B	Predicate
a1	b1	true/false
a1	b2	true/false
a2	b1	true/false
a2	b2	true/false
a3	b1	true/false
a3	b2	true/false

a1 is the first row from A, a2 the second row, and so on.

Let a0 be a row with the columns from A where each value is NULL, and b0 a row with the columns from B where each value is NULL. We’ll use both in the next section.

Join types

The following short definitions give you an accurate mental model for each join type.

The (result set of a) CROSS JOIN is the Cartesian product of A and B.
INNER JOIN (I) is the subset of pairs from the Cartesian product for which our Boolean expression evaluates to true.
LEFT (OUTER) JOIN (L) is the union of I and the set of pairs (a, b0), for all a that aren’t part of a pair in I.³
RIGHT (OUTER) JOIN (R) is the union of I and the set of pairs (a0, b), for all b that aren’t part of a pair in I.
FULL (OUTER) JOIN is the union of L and R.

Of course, this is not how you would actually implement all of this. There are optimizations to save time and memory, but you would get correct results.

Venn diagrams

Venn diagrams are only accurate in the special case where the join condition is an equality between columns of unique identifiers. Here is a short list of “pathological” cases to show you how this gives you wrong intuitions in general.

Based on the Venn diagram usually given for left joins, how many rows does this query return? The number of rows that A has?

select *
from A
left join B on B.name = A.name

Nope, if the same name appears multiple times in B, we will have multiple (a, b) tuples for the same row a, so there can be more rows than in A. OK, you might say that I just misunderstood the Venn diagram. The result set has all the rows from A and a subset of the rows from B, so it’s not wrong.

What about this? Does the result contain the intersection of rows from A and B?

select *
from A
inner join B on B.ident <> A.ident

Nope, each row of A is combined with each row of B, where the ids don’t match. So actually, ids that only appear in either A or B are guaranteed to show up as long as A and B aren’t empty.

In the next example, the Boolean expression operates on columns from both rows, but doesn’t relate them to another.

select *
from A
inner join B on B.dt > '2022-09-01' and A.dt > '2022-10-01'

Here we relate them to another, but with an inequality.

select *
from A
inner join B on B.dt > A.dt

Again and again, the Venn diagram with intersecting circles is worse than useless.

I’m not the first one to notice, see for example this blog post from 2016 – not sold on their visualization either, though.↩︎
I ignore the using clause for clarity.↩︎
In other words: all rows from the inner join together with all left rows missing from the inner join right-padded with NULL values.↩︎