Understanding SQL joins
Understanding SQL joins is not hard if you think about them in terms of a Cartesian product filtered with a Boolean expression and then optionally extended with missing rows from the left and/or right side. This is how I will introduce them here. For some reason, people like to mislead beginners with Venn diagrams.1 I’ll briefly show why this doesn’t work at the end of this post.
Syntax
A join has the following basic syntax.2
select *
from A
<join-type> join B on <bool-expression>
We select from a table A and join a table B based on some Boolean expression. For our purposes, the join type will be one out of CROSS
, INNER
, LEFT
, RIGHT
, and FULL
. The Boolean expression is an SQL expression that can take a row of A and B as input and outputs true or false. (However, the Boolean expression does not need to operate on both rows or even a single row. In the simplest case, it’s just TRUE
.)
Cartesian product
Let a database table be a set of rows and a row a tuple of values, where each value can also be accessed via a (column) name. Take the Cartesian product of the two tables A and B, such that we get the set of all ordered pairs (a, b), where a is a row from A and b is a row from B. Now evaluate the Boolean expression on each element (a, b) of the Cartesian product.
For example, if A has three rows and B has two rows, we end up with something like the following table.
A | B | Predicate |
---|---|---|
a1 | b1 | true/false |
a1 | b2 | true/false |
a2 | b1 | true/false |
a2 | b2 | true/false |
a3 | b1 | true/false |
a3 | b2 | true/false |
a1 is the first row from A, a2 the second row, and so on.
Let a0 be a row with the columns from A where each value is NULL
, and b0 a row with the columns from B where each value is NULL
. We’ll use both in the next section.
Join types
The following short definitions give you an accurate mental model for each join type.
- The (result set of a)
CROSS JOIN
is the Cartesian product of A and B. INNER JOIN
(I) is the subset of pairs from the Cartesian product for which our Boolean expression evaluates to true.LEFT (OUTER) JOIN
(L) is the union of I and the set of pairs (a, b0), for all a that aren’t part of a pair in I.3RIGHT (OUTER) JOIN
(R) is the union of I and the set of pairs (a0, b), for all b that aren’t part of a pair in I.FULL (OUTER) JOIN
is the union ofL
andR
.
Of course, this is not how you would actually implement all of this. There are optimizations to save time and memory, but you would get correct results.
Venn diagrams
Venn diagrams are only accurate in the special case where the join condition is an equality between columns of unique identifiers. Here is a short list of “pathological” cases to show you how this gives you wrong intuitions in general.
Based on the Venn diagram usually given for left joins, how many rows does this query return? The number of rows that A has?
select *
from A
left join B on B.name = A.name
Nope, if the same name appears multiple times in B, we will have multiple (a, b) tuples for the same row a, so there can be more rows than in A. OK, you might say that I just misunderstood the Venn diagram. The result set has all the rows from A and a subset of the rows from B, so it’s not wrong.
What about this? Does the result contain the intersection of rows from A and B?
select *
from A
inner join B on B.ident <> A.ident
Nope, each row of A is combined with each row of B, where the ids don’t match. So actually, ids that only appear in either A or B are guaranteed to show up as long as A and B aren’t empty.
In the next example, the Boolean expression operates on columns from both rows, but doesn’t relate them to another.
select *
from A
inner join B on B.dt > '2022-09-01' and A.dt > '2022-10-01'
Here we relate them to another, but with an inequality.
select *
from A
inner join B on B.dt > A.dt
Again and again, the Venn diagram with intersecting circles is worse than useless.
I’m not the first one to notice, see for example this blog post from 2016 – not sold on their visualization either, though.↩︎
I ignore the
using
clause for clarity.↩︎In other words: all rows from the inner join together with all left rows missing from the inner join right-padded with
NULL
values.↩︎