假设我有一个包含3列的表:ID、DATE_TIME、COLOR.数据如下所示:

id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow

我需要计算"黄色"值在"COLOR"列中出现的频率,按"id"列分组,按DATE_TIME排序.然而,我有具体的条件:

  1. 如果"黄色"出现在"绿色"之后,我想把它算作"黄色".
  2. 只有当它后面跟第一个"red",或者它是由"id"定义的组中的最后一个值时,我才想将"Huang"计算在内.

它看起来像一个组中的子窗口.

我在AWS Athena中使用SQL Presto,我认为我应该使用窗口函数,但我不确定如何指定这些条件.

事先感谢你的提示

因此,预期结果应该是:

For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1

我试过了,但没有一个在条件下重复黄色的累积计数器.

`with testdata(id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
  SELECT id,
    date_time,
    color,
    LAG(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS prev_color,
    LEAD(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS next_color
  FROM testdata
)
select id,
  SUM(
    CASE
      WHEN color = 'yellow'
      AND (
        prev_color = 'green'
        and (
          next_color IS NULL
          OR next_color = 'red'
          OR next_color = 'yellow'
        )
      ) THEN 1 ELSE 0
    END
  )  AS yellow_count
FROM t1
group by id`

我有自己的价值观

  • 对于id=1:计数"黄色"=2(不正确)
  • 对于id=2:计数"黄色"=1(正确)

推荐答案

如果存在间隙和孤岛问题,可以使用两个ROW_NUMBERS之间的差值 for each 连续的行组提供唯一的ID:

with data (id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
),
cte AS (
  SELECT id, date_time, color, 
         ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
         - ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
  FROM data
),
cte2 AS (
  SELECT id, rn, 
         MAX(color) AS color,
         MAX(date_time) AS date_time,
         SUM(CASE WHEN color = 'yellow' THEN 1 END) AS total
  FROM cte c1
  GROUP BY id, rn
  ORDER BY date_time
),
cte3 AS (
  SELECT *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
          LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color 
  FROM cte2
)
SELECT id, SUM(total) AS total
FROM cte3
WHERE color = 'yellow' AND prev_color = 'green' AND ( next_color = 'red' OR next_color is null)
GROUP BY id

解释:

  1. 使用两个ROW_NUMBERS之间的差异 for each 连续的行组提供唯一的ID.

  2. 按id和rn对数据进行分组,以便轻松获得每种 colored颜色 的上一个和下一个值.

  3. 然后将您的条件应用于CTE3的结果,以获得预期的数据.

结果:

ID
1 4
2 1

Demo on postgresql

Sql相关问答推荐

使用自动增量ID插入失败(无法将值空插入列ID)

连接特定行号

Android房间fts4匹配语法AND OR

从日期开始向前填充重复项

如何利用单列历史SQLsnowflake获得合并结果

查询页面推荐

我可以在SQLite3中使用BLOB作为主键吗?

IF NOT EXISTS子查询的性能瓶颈

明细表中没有记录如何更新主表的值为0

在WHERE EXISTS子查询中,列返回是否重要?

从给定数据中查找下一个工作日期

按行值出现的顺序对行值进行分组

SAS proc freq 或 proc sql 获取数据子集和整个数据的频率

按公司和产品查询最近发票的平均价格的SQL查询

使用对 nvarchar 列的多个 LIKE 操作优化 SQL 查询

Select 字段,除非另一个字段包含重复项

如何防止 SQL 中的负收入值并将其重新分配到接下来的月份?

SELECT 用于 Parent、Children 和 ORDER BY [Order] 列

交叉应用 OPENJSON / PIVOT - 错误的顺序

BigQuery 错误:SELECT 列表表达式引用 esthetician.LICENSE_TYPE,它既未在 [49:8] 分组也未聚合