How to Group Rows by Any Time Interval in Postgres?
Grouping by a time interval is a commonly needed SQL query. The time interval could be a day, week, or month. An example of this could be: finding the count of all the newly registered users for the past three months.
Usually, the suggested solution is to use a function to convert the timestamp column to the required time interval and do a GROUP BY
on it. For eg, In Postgres both of these queries will return the count of registered users grouped by the day with the newer days first:
SELECT
DATE_TRUNC('day', created) as day,
COUNT(*)
FROM requests
GROUP BY 1
ORDER BY 1 DESC;
SELECT
TO_CHAR(created, 'YYYY-MM-DD') as day,
COUNT(*)
FROM requests
GROUP BY 1
ORDER BY 1 DESC;
But if you need something more fine-grained? Let’s say you want to group rows in 15min intervals or six hours?
There is also another problem with converting existing timestamps. Suppose there was an hour when the site was down and no user could register on the site. Because no one could sign up, there won’t be any user whose created
value would be from that hour. In the rows of text, you’re likely to miss the gap between time intervals.
How can we solve this?
Generate the Time Series then Join
Postgres provides a handy function generate_series
to help us here. Instead of converting timestamps, we generate a list of timestamps and then JOIN rows on those timestamps.
For eg, we can generate a list of timestamps starting from the past week in the interval of 15min using this:
SELECT GENERATE_SERIES(NOW() - INTERVAL '7d', NOW(), INTERVAL '15m' ) as ts;
Now, to make our join easier, we should generate two columns that can represent the start and end of the interval like:
start | end |
---|---|
2022-07-09 00:00:00+00 | 2022-07-09 00:15:00+00 |
2022-07-09 00:15:00+00 | 2022-07-09 00:30:00+00 |
2022-07-09 00:30:00+00 | 2022-07-09 00:45:00+00 |
To do this, we can use WINDOW functions.
WITH timestamps as (
SELECT GENERATE_SERIES(NOW() - INTERVAL '7d', NOW(), INTERVAL '15m' ) as ts
)
SELECT
ts as start,
LEAD(ts) as end
OVER ()
FROM timestamps;
Now for the final JOIN, we need to check if the created
lies between start
and end
, which we can do by
WITH timestamps AS (
SELECT
GENERATE_SERIES(NOW() - INTERVAL '7d', NOW(), INTERVAL '15m') AS ts
), range_ts AS (
SELECT
ts AS start_ts,
LEAD(ts) OVER () AS end_ts
FROM
timestamps
)
SELECT
COUNT(requests.created),
start_ts,
end_ts
FROM
range_ts
LEFT JOIN requests ON requests.created BETWEEN start_ts AND end_ts
GROUP BY 2, 3;
But there’s a problem here. This query would be too slow. For the DB engine, it’s very hard to optimize non-equality-based JOIN clause like the BETWEEN clause we have. It would need to run “rows in timestamps X rows in requests” operations to perform the join.
Fortunately, we can optimize this by introducing an equality-based clause too. The only rows we are interested in are the rows that are from the same hour, so we can add a condition:
DATE_TRUNC('hour', start_ts) = DATE_TRUNC('hour', created)
Our final query becomes,
WITH timestamps AS (
SELECT
GENERATE_SERIES(NOW() - INTERVAL '7d', NOW(), INTERVAL '15m') AS ts
), range_ts AS (
SELECT
ts AS start_ts,
LEAD(ts) OVER () AS end_ts
FROM
timestamps
)
SELECT
COUNT(requests.created),
start_ts,
end_ts
FROM
range_ts
LEFT JOIN requests ON requests.created BETWEEN start_ts AND end_ts
AND DATE_TRUNC('hour', start_ts) = DATE_TRUNC('hour', created)
GROUP BY 2, 3;
Output
count | start_ts | end_ts |
---|---|---|
0 | 2022-07-09 16:50:54.39318+00 | 2022-07-09 17:05:54.39318+00 |
2 | 2022-07-09 17:05:54.39318+00 | 2022-07-09 17:20:54.39318+00 |
1 | 2022-07-09 17:20:54.39318+00 | 2022-07-09 17:35:54.39318+00 |
1 | 2022-07-09 17:35:54.39318+00 | 2022-07-09 17:50:54.39318+00 |
0 | 2022-07-09 17:50:54.39318+00 | 2022-07-09 18:05:54.39318+00 |
0 | 2022-07-09 18:05:54.39318+00 | 2022-07-09 18:20:54.39318+00 |
4 | 2022-07-09 18:20:54.39318+00 | 2022-07-09 18:35:54.39318+00 |
7 | 2022-07-09 18:35:54.39318+00 | 2022-07-09 18:50:54.39318+00 |
2 | 2022-07-09 18:50:54.39318+00 | 2022-07-09 19:05:54.39318+00 |
8 | 2022-07-09 19:05:54.39318+00 | 2022-07-09 19:20:54.39318+00 |
18 | 2022-07-09 19:20:54.39318+00 | 2022-07-09 19:35:54.39318+00 |
0 | 2022-07-09 19:35:54.39318+00 | 2022-07-09 19:50:54.39318+00 |
13 | 2022-07-09 19:50:54.39318+00 | 2022-07-09 20:05:54.39318+00 |
0 | 2022-07-09 20:05:54.39318+00 | 2022-07-09 20:20:54.39318+00 |