我有一个用户的表历史任务,看起来是这样的.1个用户有9个任务.

user_id     task    IS_COMPLETED updated_at
123       Task 1    1            2024-01-01
123       Task 2    1            2024-01-01
123       Task 3    0            2024-01-01
123       Task 4    0            2024-01-01
123       Task 5    0            2024-01-01
123       Task 6    1            2024-01-01
123       Task 7    1            2024-01-01
123       Task 8    1            2024-01-01
123       Task 9    1            2024-01-01

我想把可能的任务组合在一起,变成这样的表.所以我知道什么样的任务组合是用户快速完成的.

combination     total_user_completed
task1_task2         20
task1_task3         15 
task1_task_4        15
task1_task_5        14
:                    : (and so on for 2 combination) 
task1_task2_task3   10
task4_task5_task_6  11
:                   : (and so on for 3 combination) 

(有关组合示例的更多详细信息:从A、B、C我得到了A-B、B-C、A-C和A-B-C之间的组合.不需要按顺序)

我曾try 使用递归SQL,但效果不佳

"递归联接内存不足.请重新运行此查询 更大的仓库"

然后我也试着用Case When来定义,但我需要像这样一个一个地定义,这并不理想.

    with all_task as (
            select 
                distinct user_id,
                task_name as task, 
                is_completed,
                updated_at
            from task
            where 
                is_completed = 1
            group by all
          ),
        agg as (
            select 
                distinct user_id, 
                activated_at as period,
                max(case when task in ('Task 1') and IS_COMPLETED = 1 then date(updated_at) end) as task_1,
                max(case when task in ('Task 2') and IS_COMPLETED = 1 then date(updated_at) end) as task_2,
                max(case when task in ('Task 3') and IS_COMPLETED = 1 then date(updated_at) end) as task_3,
                max(case when task in ('Task 4') and IS_COMPLETED = 1 then date(updated_at) end) as task_4,
                max(case when task in ('Task 5') and IS_COMPLETED = 1 then date(updated_at) end) as task_5,
                max(case when task in ('Task 6') and IS_COMPLETED = 1 then date(updated_at) end) as task_6,
                max(case when task in ('Task 7') and IS_COMPLETED = 1 then date(updated_at) end) as task_7,
                max(case when task in ('Task 8') and IS_COMPLETED = 1 then date(updated_at) end) as task_8,
                max(case when task in ('Task 9') and IS_COMPLETED = 1 then date(updated_at) end) as task_9
            from all_task 
            group by all
            ),
            task_group as (
            select user_id, 
                case when task_1 is not null and task_2 is not null then user_id end as task_1_2
                case when task_2 not null and task_3 is not null then user_id end as task_2_3
                .........
            from agg
             )
            select 'Task1_task2' as combination, 
            count(distinct task_1_2) as total
            from task_group

            union all 

            select 'task2_task3' as combination, 
            count(distinct task_2_3) as total
            from task_group
.... (and so on)

有没有解决这一问题的建议?非常感谢!

*注意:目前我需要至少2/3的任务组合.谢谢

推荐答案

因此,对于测试数据的CTE:

with data(user_id, task, is_completed, updated_at) as (
    select * from values
    (123, 'Task 1', 1, '2024-01-01'::date),
    (123, 'Task 2', 1, '2024-01-01'::date),
    (123, 'Task 3', 0, '2024-01-01'::date),
    (123, 'Task 4', 0, '2024-01-01'::date),
    (123, 'Task 5', 0, '2024-01-01'::date),
    (123, 'Task 6', 1, '2024-01-01'::date),
    (123, 'Task 7', 1, '2024-01-01'::date),
    (123, 'Task 8', 1, '2024-01-01'::date),
    (123, 'Task 9', 1, '2024-01-01'::date)
)

和CTE的2:

), combinations_2 as (
    select
        a.user_id, 
        a.task || '_' || b.task as combi
    from data as a
    join data as b
        on a.user_id = b.user_id 
        --and a.updated_at = b.updated_at
        and a.is_completed = 1 and b.is_completed = 1
        and a.task < b.task
),

enter image description here

3‘S的CTE:

), combinations_3 as (
    select
        a.user_id, 
        a.task || '_' || b.task || '_' || c.task as combi
    from data as a
    join data as b
        on a.user_id = b.user_id 
        --and a.updated_at = b.updated_at
        and a.is_completed = 1 and b.is_completed = 1
        and a.task < b.task
    join data as c
        on a.user_id = c.user_id 
        --and a.updated_at = c.updated_at
        and a.is_completed = 1 and c.is_completed = 1
        and b.task < c.task
),

enter image description here

将它们结合在一起:

), merged as (
    select * from combinations_2
    union all 
    select * from combinations_3
)

然后数一数:

select 
    combi, 
    count(user_id) 
from merged
group by 1
order by 1;

enter image description here

现在,您可能希望使用CTE来过滤这is_completed = 0行,但这也可能导致使用之间的同步,并导致性能损失.另一种方法是建立2个组合,并保留b.task,然后对它进行第三轮连接,这可能会执行得更好.

我还假设UPDATED_AT是没有意义的,并且同一ID/任务没有重复的日期,因此,同样,对数据进行预处理以删除这些日期可能会更好.因此:

with data(user_id, task, is_completed, updated_at) as (
    select * from values
    (123, 'Task 1', 1, '2024-01-01'::date),
    (123, 'Task 2', 1, '2024-01-01'::date),
    (123, 'Task 3', 0, '2024-01-01'::date),
    (123, 'Task 4', 0, '2024-01-01'::date),
    (123, 'Task 5', 0, '2024-01-01'::date),
    (123, 'Task 6', 1, '2024-01-01'::date),
    (123, 'Task 7', 1, '2024-01-01'::date),
    (123, 'Task 8', 1, '2024-01-01'::date),
    (123, 'Task 9', 1, '2024-01-01'::date)
), cleaned as (
    select distinct user_id, task
    from data
    where is_completed = 1
), combinations_2 as (
    select
        a.user_id, 
        a.task || '_' || b.task as combi
    from cleaned as a
    join cleaned as b
        on a.user_id = b.user_id 
        and a.task < b.task
), combinations_3 as (
    select
        a.user_id, 
        a.task || '_' || b.task || '_' || c.task as combi
    from cleaned as a
    join cleaned as b
        on a.user_id = b.user_id 
        and a.task < b.task
    join cleaned as c
        on a.user_id = c.user_id 
        and b.task < c.task
), merged as (
    select * from combinations_2
    union all 
    select * from combinations_3
)
select 
    combi, 
    count(user_id) 
from merged
group by 1
order by 1;

Sql相关问答推荐

带有双引号的json在Presto中是否有区别对待?

如何在T—SQL中找到值更改之前的日期?

出现5次后,将所有正斜杠替换为连字符

关于Postgres横向联接的谓词

PostgreSQL-按距离阈值挤压相邻行的性能

使用同一个表,为什么IN、NOT IN、NOT EXISTS和EXISTS有不同的输出?

SQL Server - 复杂场景 - 比较状态并填充值到后续行

在SQL中实现表格数据透视类型报表

SQL Server: 将JSON对象数组转换为表格格式

带有数组输入参数的Snowflake UDF优化

Teradata 多个进程的最大进程结束时间捕获

为 sqlite 全文搜索 (fts) 创建触发器时出现虚拟表的不安全使用

雅典娜弄错了操作顺序

Django only() 和 values() 不适用于 prefetch_related()

基于源表的 SQL INSERT、UPDATE 和 DELETE

在 MySql 数据库中的两个日期之间搜索

连接表时避免重复

创建一个将层次 struct 级别放入列中的查询

SQL Group By 然后映射出是否存在值

删除具有相同 ID 的重复记录 - Postgresql