这个SQL查询需要90分钟在DuckDB中运行我的百万行表.我使用的是自连接模式,因为这是我发现这个查询工作的唯一方法,但我确信一定有一个更快的查询模式.
-
查询从一个通用表表达式(CTE)开始,它 Select
main.openorder_tbl
表中的所有列,并根据"发票日期"列为每行分配行号. -
然后,主查询在CTE上执行自联接,其中它将每行(
self
)与属于同一"组"且满足与"发票日期"和"到期日"相关的某些条件的所有先前行(other
)联接起来. -
加入的条件是:
- 第
self
行的"发票日期"比第other
行的"到期日"晚了90天以上. - 第
other
行的"截止日期"比其"截止日期"晚了90天. - 第
other
行的"到期日"在第self
行的"发票日期"之前的200天内.
-
对于每
self
行,查询会统计满足这些条件的other
行的数量. -
结果是一个包括
self
行中的所有列以及每self
行other
行的计数的表. -
输出按第
self
行的行号排序.
此查询实质上计算同一组中满足某些日期相关条件的行的滚动计数.自连接是必要的,因为我将每行与同一组中的所有前面的行进行比较.
Example Input Table
row_number | Invoice Date | Due Date | Date Closed | days | amount | group |
---|---|---|---|---|---|---|
1 | 2022-10-22 | 2022-11-21 | 2024-03-01 | 466 | 111 | B |
2 | 2022-10-22 | 2022-11-21 | 2023-03-01 | 100 | 111 | B |
3 | 2022-10-24 | 2022-11-23 | 2023-03-03 | 100 | 150 | B |
4 | 2022-10-31 | 2022-11-30 | 2023-03-05 | 95 | 300 | A |
5 | 2022-11-10 | 2022-12-10 | 2023-02-02 | 54 | 180 | A |
6 | 2022-11-21 | 2022-12-21 | 2023-02-04 | 45 | 220 | B |
7 | 2022-12-04 | 2023-01-03 | 2024-01-04 | 366 | 210 | B |
8 | 2022-12-19 | 2023-01-18 | 2023-01-20 | 2 | 100 | A |
9 | 2023-01-01 | 2023-01-31 | 2023-02-20 | 20 | 200 | B |
10 | 2023-01-22 | 2023-02-21 | 2023-06-01 | 100 | 280 | B |
11 | 2023-02-28 | 2023-03-30 | 2023-04-02 | 3 | 250 | A |
12 | 2023-05-14 | 2023-06-13 | 2023-09-01 | 80 | 21 | A |
13 | 2023-06-18 | 2023-07-18 | 2023-10-01 | 75 | 456 | A |
14 | 2023-07-02 | 2023-08-01 | 2023-08-01 | 0 | 320 | B |
SQL
WITH cte AS (
SELECT
ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS row_number,
"Invoice Date",
"Due Date",
"Date Closed",
"days",
"amount",
"group"
FROM main.openorder_tbl
)
select
first(
columns(self.*)
),
count(other."row_number")
from
cte self left join cte other
on
self."group" = other."group"
and
self."row_number" > other."row_number"
AND
(self."Invoice Date" - other."Due Date") > 90
AND
(other."Date Closed" - other."Due Date") > 90
AND
other."Due Date" > (self."Invoice Date" -200)
group by
self."row_number"
order by
self."row_number"
Output
first(self.row_number) | first(self."Invoice Date") | first(self."Due Date") | first(self."Date Closed") | first(self."days") | first(self.amount) | first(self."group") | count(other.row_number) |
---|---|---|---|---|---|---|---|
1 | 2022-10-22 | 2022-11-21 | 2024-03-01 | 466 | 111 | B | 0 |
2 | 2022-10-22 | 2022-11-21 | 2023-03-01 | 100 | 111 | B | 0 |
3 | 2022-10-24 | 2022-11-23 | 2023-03-03 | 100 | 150 | B | 0 |
4 | 2022-10-31 | 2022-11-30 | 2023-03-05 | 95 | 300 | A | 0 |
5 | 2022-11-10 | 2022-12-10 | 2023-02-02 | 54 | 180 | A | 0 |
6 | 2022-11-21 | 2022-12-21 | 2023-02-04 | 45 | 220 | B | 0 |
7 | 2022-12-04 | 2023-01-03 | 2024-01-04 | 366 | 210 | B | 0 |
8 | 2022-12-19 | 2023-01-18 | 2023-01-20 | 2 | 100 | A | 0 |
9 | 2023-01-01 | 2023-01-31 | 2023-02-20 | 20 | 200 | B | 0 |
10 | 2023-01-22 | 2023-02-21 | 2023-06-01 | 100 | 280 | B | 0 |
11 | 2023-02-28 | 2023-03-30 | 2023-04-02 | 3 | 250 | A | 0 |
12 | 2023-05-14 | 2023-06-13 | 2023-09-01 | 80 | 21 | A | 1 |
13 | 2023-06-18 | 2023-07-18 | 2023-10-01 | 75 | 456 | A | 0 |
14 | 2023-07-02 | 2023-08-01 | 2023-08-01 | 0 | 320 | B | 2 |
我试过使用窗口函数,但问题是如果我使用窗口函数,SQL不会允许我在它内部使用条件,这当然是它应该如何工作的.也许有一种更快的方法,那就是用窗口函数或类似的方法进行子查询.我绝对不是一个SQL专家.不幸的是我不知所措,我知道我需要在进行滚动计数之前尽可能地对表进行筛选,以便它运行得更快,但我不确定在这种情况下是否有方法做到这一点.
下面是一个更大的表的解释分析输出,这样时间更真实,列名略有不同,但逻辑是相同的:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE
---Start CTE with index
WITH cte
AS (SELECT ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS "row_number",
"Doc|PayItem",
"Invoice Date",
"Date Closed",
"Due Date",
"DaysPastDue",
"Gross Amount",
"ParentName"
FROM main.df_CustLedger
)
---select the first value of
select first(columns(self."Doc|PayItem")) as "Doc|PayItem",
COALESCE(count(other."row_number"), 0) as "NumberOfPaidInvoices_std"
-- COALESCE(sum(other."Gross Amount"), 0) as "TotalGrossAmtPaid_std",
-- COALESCE(AVG(other."DaysPastDue"), 0) as "MeanDelay_std"
from cte self
left join cte other
on self."ParentName" = other."ParentName"
and self."row_number" > other."row_number"
and (self."Invoice Date" - other."Due Date") > 90
and (other."Date Closed" - other."Due Date") > 90
and (other."Due Date" > (self."Invoice Date" -200))
group by self."row_number"
order by self."row_number"
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 272.71s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ ORDER_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ #3 ASC │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (0.39s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Doc|PayItem │
│ Doc|PayItem │
│ NumberOfPaidInvoices_std │
│ row_number │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (0.03s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ first(#1) │
│ first(#2) │
│ count(#3) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (10.87s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ row_number │
│ Doc|PayItem │
│ Doc|PayItem │
│ row_number │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 119091014 │
│ (0.88s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BLOCKWISE_NL_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ LEFT │
│ (((Invoice Date - 200) < │
│ Due Date) AND ((row_n... │
│ row_number) AND (((Invoice│
│ Date - Due Date) > 90... ├──────────────┐
│(ParentName = ParentName)))│ │
│ ) │ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ 119091014 │ │
│ (962.51s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ row_number ││ row_number │
│ Doc|PayItem ││ Doc|PayItem │
│ Invoice Date ││ Due Date │
│ ParentName ││ ParentName │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 ││ 116421 │
│ (0.00s) ││ (0.00s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 ││ #1 │
│ #1 ││ #2 │
│ #2 ││ #3 │
│ #3 ││ #4 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ #5 │
│ 1628109 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (0.00s) ││ 116421 │
│ ││ (0.00s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ WINDOW ││ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ROW_NUMBER() OVER (ORDER BY││((Date Closed - Due Date) >│
│ Invoice Date ASC NULLS ││ 90) │
│ LAST) ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ EC: 1628109 │
│ 1628109 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (0.74s) ││ 116421 │
│ ││ (0.01s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ df_CustLedger ││ #0 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ #1 │
│ Invoice Date ││ #2 │
│ Doc|PayItem ││ #3 │
│ ParentName ││ #4 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ #5 │
│ EC: 1628109 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ 1628109 │
│ 1628109 ││ (0.00s) │
│ (0.09s) ││ │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ WINDOW │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ROW_NUMBER() OVER (ORDER BY│
│ Invoice Date ASC NULLS │
│ LAST) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (1.82s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ df_CustLedger │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Invoice Date │
│ Doc|PayItem │
│ Date Closed │
│ Due Date │
│ ParentName │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1628109 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1628109 │
│ (0.14s) │
└───────────────────────────┘
我欢迎任何优化查询的 idea .