The problem can be rephrased as follows:
Calculate the number of daily processed unique pages while filtering pages which are less than the last day maximum page.
DB fiddle with step-by-step queries个
- 如果使用PostgreSQL 11+:
select day_date, count(distinct(page)) as page_count
from
(select
to_char(timestamp_updated, 'YYYY-MM-DD') as day_date,
page,
first_value(page) over (
order by to_char(timestamp_updated, 'YYYY-MM-DD')
groups between 1 preceding and current row
) as prev_day_max_page
from dictionary_word, unnest(dict_pages) page
) prev_day_page_data
where page >= prev_day_max_page
group by day_date
order by day_date;
GROUPS
窗口帮助聚合上一个日期组和当前日期组中的所有行.
文件100和101.
- Less elegant approach:个
with all_pages as (
select
to_char(timestamp_updated, 'YYYY-MM-DD') as day_date,
page
from dictionary_word dtw, unnest(dict_pages) page
)
select day_date, count(distinct(page)) as pages_count
from all_pages
join (
select
day_date,
coalesce(
lag(daily_max_page) over (order by day_date),
daily_max_page
) as prev_max_page
from (
select
to_char(timestamp_updated, 'YYYY-MM-DD') as day_date,
max(page::integer) as daily_max_page
from dictionary_word dtw, unnest(dict_pages) page
group by day_date
-- order by day_date
) s
) prev_day_max_page_stat using (day_date)
where page::integer >= prev_max_page
group by day_date
order by day_date;
Details:个
all_pages
表示unnest(dict_pages)
操作的所有页码.
prev_day_max_page_stat
查询计算当天之前的最大页数.
- 最终查询每天统计处理的唯一
pages_count
个.
它按day_date
字段对页面进行分组,并使用筛选器
where page::integer >= prev_max_page
筛选这些值.
Improved example:
I have improved insert statements to check the query for correctness. In the screenshot below you can see that pages with a value which is less than the previous day's max value are not included in the result.
Screenshot
day_date |
pages_count |
2023-07-26 |
1 |
2023-07-27 |
2 |
2023-07-28 |
1 |