SQL window functions LAG and LEAD
LAG
and LEAD
allow you to access data from previous or subsequent rows in the same result set without any joins.
Let there be some time series data: a table of city names, dates, and daily maximum temperatures. That’s what we see in the first three columns of the following table (not necessarily ordered).
city | dt | temp | prev_temp | next_temp | diff |
---|---|---|---|---|---|
Leipzig | 2024-01-01 | 0 | NULL | 5 | NULL |
Leipzig | 2024-01-02 | 5 | 0 | -3 | 5 |
Leipzig | 2024-01-03 | -3 | 5 | -1 | -8 |
Leipzig | 2024-01-04 | -1 | -3 | NULL | 2 |
Paris | 2024-01-01 | 8 | NULL | 3 | NULL |
Paris | 2024-01-02 | 3 | 8 | 12 | -5 |
Paris | 2024-01-03 | 12 | 3 | NULL | 9 |
The remaining columns are derived from the first three with the help of LAG
and LEAD
(What’s new in SQL:2011, 8.2.1).
SELECT *,
LAG(temp) OVER city_dt AS prev_temp,
LEAD(temp) OVER city_dt AS next_temp,
- LAG(temp) OVER city_dt AS diff
temp FROM temperature
AS
WINDOW city_dt PARTITION BY city ORDER BY dt); (
What happens here? We partition the table by city and order each partition by date. Now we pick the previous and the next temperature for each row within each partition. If there is no previous or next data point within the same partition, we get NULL
. One use case for this is computing differences between adjacent rows.
Why is it fast? The data can be partitioned in linear time using hashes of the PARTITION BY
values. After this first step, we can work on each partition in parallel, starting with ORDER BY
. If the number of partitions is much smaller than the number of threads, we might want to add some intra-partition parallelism, see Efficient Processing of Window Functions in Analytical SQL Queries (Leis et al., 2015).