lateral join: optimising subqueries (for some cases)

I’m not entirely sure how i found lateral joins. Maybe i was searching for reasons why my query was slow. Maybe google watches my incognito/private search history and realised i don’t know anything. Regardless, lateral joins are awesome and can help simplify and optimise queries.

From PostgreSQL docs:

allows (subqueries) to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.

which, in passing, is not entirely easy to decipher its meaning. I’ve also seen LATERAL joins referred to as a correlated subquery which gives a bit more insight.

To make it a bit more tangible, consider you have an INVENTORY table that captures how many items of each product you have each day. There is also a PRICE table that captures the price of each product on a given day because you’re a capitalist.

              Table "public.inventory"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 person  | integer |           | not null |
 date    | date    |           | not null |
 product | integer |           | not null |
 count   | integer |           | not null |
Indexes:
    "idx_date" btree (date)
    "uniq_prod" UNIQUE, btree (person, product, date)


                    Table "public.price"
 Column  |       Type       | Collation | Nullable | Default
---------+------------------+-----------+----------+---------
 product | integer          |           | not null |
 date    | date             |           | not null |
 price   | double precision |           | not null |
Indexes:
    "uniq_prod_price" UNIQUE, btree (product, date)

Now if we want to generate a result that shows the value per product on a given day, we would join the two tables together. The problem is that the price of the product may not change daily so the price table may not have a price for every date in the inventory table. There are many options to address this, you could leverage tsrange data types for example but that may increase complexity in the model. Typically, i’ve used a window function to figure out a range of dates a price is valid for similar to the following:

select i.*, p.price
from inventory i
join (
  select product, date "start",
         coalesce(lag(date) over (partition BY product order by date desc),
                 '2100-01-01')::date "end",
         price
  from price) p on p.product = i.product and p.start <= i.date and i.date < p.end
where i.date = '2021-05-20';
-- '2100-01-01' is some arbitrary time in future so we have end date for latest price

Notice the subquery is built with no knowledge of its relationship to the INVENTORY table so it queries across entire PRICE table. To avoid this, we could try duplicating conditions to the subquery but that would make the query less generic.

Luckily, the same query can be done where are subquery has knowledge of other tables and thus does not need to blindly process an entire table by using LATERAL joins

select i.*, p.price
from inventory i
join lateral (
  select price
  from price p
  where p.product = i.product and p.date <= i.date
  order by date desc
  limit 1) p on true
where i.date = '2021-05-20';

Both queries will give the same result but the latter is more readable (once you know what a LATERAL join is). It also has better performance. With an INVENTORY table of ~45M rows (~100 people, each with ~25 products over ~50 years) and a PRICE table of ~4.5M rows (over ~50 years), querying the value of each product an individual holds on a given day generates the following plans:

-- normal join

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=731848.06..1908773.98 rows=6576966 width=24) (actual time=2138.093..7967.517 rows=2500 loops=1)
   Merge Cond: (i.product = price.product)
   Join Filter: ((price.date <= i.date) AND (i.date < (COALESCE(lag(price.date) OVER (?), '2100-01-01'::date))))
   Rows Removed by Join Filter: 45655000
   ->  Sort  (cost=276.90..284.94 rows=3214 width=16) (actual time=0.677..1.140 rows=2500 loops=1)
         Sort Key: i.product
         Sort Method: quicksort  Memory: 214kB
         ->  Index Scan using idx_date on inventory i  (cost=0.44..89.69 rows=3214 width=16) (actual time=0.022..0.369 rows=2500 loops=1)
               Index Cond: (date = '2021-05-20'::date)
   ->  Materialize  (cost=731422.71..879809.58 rows=4565750 width=20) (actual time=1940.290..4673.601 rows=45657501 loops=1)
         ->  WindowAgg  (cost=731422.71..822737.71 rows=4565750 width=20) (actual time=1940.282..2154.701 rows=456576 loops=1)
               ->  Sort  (cost=731422.71..742837.08 rows=4565750 width=16) (actual time=1940.249..1990.298 rows=456577 loops=1)
                     Sort Key: price.product, price.date DESC
                     Sort Method: external merge  Disk: 116224kB
                     ->  Seq Scan on price  (cost=0.00..70337.50 rows=4565750 width=16) (actual time=0.016..352.202 rows=4565750 loops=1)
 Planning Time: 0.670 ms
 JIT:
   Functions: 18
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.427 ms, Inlining 44.187 ms, Optimization 84.182 ms, Emission 60.225 ms, Total 190.021 ms
 Execution Time: 8012.483 ms

-- lateral join
                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.87..6662.25 rows=3214 width=24) (actual time=0.029..14.977 rows=2500 loops=1)
   ->  Index Scan using idx_date on inventory i  (cost=0.44..89.69 rows=3214 width=16) (actual time=0.010..0.397 rows=2500 loops=1)
         Index Cond: (date = '2021-05-20'::date)
   ->  Limit  (cost=0.43..2.02 rows=1 width=12) (actual time=0.005..0.006 rows=1 loops=2500)
         ->  Index Scan Backward using uniq_prod_price on price p  (cost=0.43..9695.44 rows=6088 width=12) (actual time=0.005..0.005 rows=1 loops=2500)
               Index Cond: ((product = i.product) AND (date <= i.date))
 Planning Time: 0.179 ms
 Execution Time: 15.130 ms

The LATERAL join query returns in a fraction of the time. Similarly, on a larger query that grabs the value of each product an individual holds at the end of each year, LATERAL join still provides marked performance gains:

-- regular join
explain analyze
select i.*, p.price
from inventory i
join (
  select product, date "start",
         coalesce(lag(date) over (partition BY product order by date desc),
                  '2100-01-01')::date "end",
         price
  from price) p on p.product = i.product and p.start <= i.date and i.date < p.end
where extract(month from "date") = 12 and extract(day from "date") = 31;
                                                                                                    QUERY PLAN                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=1455046.60..1868480.99 rows=2334884 width=24) (actual time=5016.843..296495.753 rows=125000 loops=1)
   Merge Cond: (i.product = price.product)
   Join Filter: ((price.date <= i.date) AND (i.date < (COALESCE(lag(price.date) OVER (?), '2100-01-01'::date))))
   Rows Removed by Join Filter: 2282750000
   ->  Sort  (cost=723569.00..723571.85 rows=1141 width=16) (actual time=3081.470..3129.406 rows=125000 loops=1)
         Sort Key: i.product
         Sort Method: external merge  Disk: 3192kB
         ->  Gather  (cost=1000.00..723511.06 rows=1141 width=16) (actual time=211.294..3040.475 rows=125000 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Seq Scan on inventory i  (cost=0.00..722396.96 rows=475 width=16) (actual time=223.979..2935.220 rows=41667 loops=3)
                     Filter: ((date_part('month'::text, (date)::timestamp without time zone) = '12'::double precision) AND (date_part('day'::text, (date)::timestamp without time zone) = '31'::double precision))
                     Rows Removed by Filter: 15177500
   ->  Materialize  (cost=731422.71..879809.58 rows=4565750 width=20) (actual time=1927.052..137719.308 rows=2282875001 loops=1)
         ->  WindowAgg  (cost=731422.71..822737.71 rows=4565750 width=20) (actual time=1927.047..2143.752 rows=456576 loops=1)
               ->  Sort  (cost=731422.71..742837.08 rows=4565750 width=16) (actual time=1927.020..1977.479 rows=456577 loops=1)
                     Sort Key: price.product, price.date DESC
                     Sort Method: external merge  Disk: 116224kB
                     ->  Seq Scan on price  (cost=0.00..70337.50 rows=4565750 width=16) (actual time=0.011..332.127 rows=4565750 loops=1)
 Planning Time: 0.231 ms
 JIT:
   Functions: 22
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.537 ms, Inlining 215.568 ms, Optimization 106.636 ms, Emission 86.688 ms, Total 411.429 ms
 Execution Time: 296524.393 ms


-- lateral join
explain analyze
select i.*, p.price
from inventory i
join lateral (
  select price
  from price p
  where p.product = i.product and p.date <= i.date
  order by date desc
  limit 1) p on true
where extract(month from "date") = 12 and extract(day from "date") = 31;
                                                                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1000.43..725844.38 rows=1141 width=24) (actual time=131.019..3501.278 rows=125000 loops=1)
   ->  Gather  (cost=1000.00..723511.06 rows=1141 width=16) (actual time=130.952..2388.902 rows=125000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on inventory i  (cost=0.00..722396.96 rows=475 width=16) (actual time=128.022..2826.574 rows=41667 loops=3)
               Filter: ((date_part('month'::text, (date)::timestamp without time zone) = '12'::double precision) AND (date_part('day'::text, (date)::timestamp without time zone) = '31'::double precision))
               Rows Removed by Filter: 15177500
   ->  Limit  (cost=0.43..2.02 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=125000)
         ->  Index Scan Backward using uniq_prod_price on price p  (cost=0.43..9695.44 rows=6088 width=12) (actual time=0.008..0.008 rows=1 loops=125000)
               Index Cond: ((product = i.product) AND (date <= i.date))
 Planning Time: 0.223 ms
 JIT:
   Functions: 14
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.765 ms, Inlining 160.967 ms, Optimization 58.548 ms, Emission 40.490 ms, Total 261.770 ms
 Execution Time: 3509.229 ms

Switching to LATERAL joins resulted in queries that took less than 1% of the original query with the added benefit of a more readable query.