我在try 优化这个查询时遇到了困难,我希望你们中的一些数据库专家可能会有一些见解.以下是设置.

使用TimscaleDB作为我的数据库,我有一个包含传感器数据的wide table,如下所示:

time sensor_id wind_speed wind_direction
'2023-12-18 12:15:00' '1' NULL 176
'2023-12-18 12:13:00' '1' 4 177
'2023-12-18 12:11:00' '1' 3 NULL
'2023-12-18 12:09:00' '1' 8 179

我想要编写一个查询,为我提供一组按sensor_id筛选的列的最新非空值.对于上述数据(在sensor_id 1上过滤),此查询应返回

wind_speed wind_direction
4 176

话虽如此,我的查询看起来像下面这样(当在10个批次中查询sensor_id个时):

SELECT
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction,

    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_two,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_two,
    .
    .
    .
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_ten,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_ten;

我正在查询的表有1,000个唯一的sensor_id,所有这些都以2 minute为间隔报告数据.第100s of millions of rows章我们在谈

我在(sensor_id, time DESC)上创建了一个索引,以进一步优化查询.有了索引,这个查询分别花费了大约400ms50ms的规划和执行时间.

如何以不同方式编写查询(或添加索引)以实现最佳规划和执行时间?

推荐答案

不幸的是,Postgres没有(到第16页为止)实现IGNORE NULLS的窗口函数.这将允许对每个值列进行简单的first_value()调用.参见:

解决方案

fiddle

There are various shorter and possibly (much) faster options.
You should at least have a (partial) index on (ts). Possibly on (sensor_id, ts). Or more. See below. All depending on undisclosed details.

我觉得timestamp栏的"时间"这个名字有误导性.用"ts"代替.

first_value() + DISTINCT ON

一个更短的临时替补.

SELECT DISTINCT ON (sensor_id)
       sensor_id
     , first_value(wind_speed    ) OVER (w ORDER BY wind_speed     IS NULL, ts DESC) AS wind_speed
     , first_value(wind_direction) OVER (w ORDER BY wind_direction IS NULL, ts DESC) AS wind_direction
--   , ... more?
FROM   sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
WINDOW w AS (PARTITION BY sensor_id);

大约DISTINCT ON:

count() window function in subquery + filtered aggregate in main

SELECT sensor_id
     , min(wind_speed)     FILTER (WHERE ws_ct = 1) AS wind_speed
     , min(wind_direction) FILTER (WHERE wd_ct = 1) AS wind_direction
--   , ... more?
FROM  (
   SELECT *
        , count(wind_speed)     OVER w AS ws_ct
        , count(wind_direction) OVER w AS wd_ct
   --   ,  ... more?
   FROM   sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   WINDOW w AS (PARTITION BY sensor_id ORDER BY ts DESC)
   ) sub
GROUP  BY sensor_id;

请参见:

基于"传感器"表的更简单

如果您也有一个表"SENSOR",每相关的sensor_id有一行(您可能应该这样做),那么它就会变得更简单:

SELECT sensor_id
    , (SELECT wind_speed     FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_speed     IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_speed
    , (SELECT wind_direction FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_direction IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_direction
--  , ... more?
FROM   sensor s
    , (SELECT LOCALTIMESTAMP - interval '7 days') t(ts_min)
;

最后一个查询(与冗长的原始查询一样)可以使用定制的索引.理想情况下,部分索引-每个传感器有many行、few个值列、many个空值和many个过时的行.

CREATE INDEX sensor_data_wind_speed_idx     ON sensor_data (sensor_id, ts DESC, wind_speed)
WHERE  wind_speed IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!

CREATE INDEX sensor_data_wind_direction_idx ON sensor_data (sensor_id, ts DESC, wind_direction)
WHERE  wind_direction IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!

在创建时使用一个过go 一周的常量.随着时间的推移,索引的大小会增加,但仍然适用.重新创建索引,并不时地在稍后关闭,以保持大小不变.(不过,不确定时间戳绑定是否为您的hypertables付费.简单的索引可能就足够了.我有一个简单的Postgres.)

然后运行相同的查询,但时间戳为constant:

SELECT ...
FROM   sensor s
    , (SELECT timestamp '2023-12-12 03:47:16') t(ts_min)  -- MUST be a constant to use partial index!
;

Sorted subquery + first() aggregate function

If index-support is not an option or not efficient, the most convenient query would be with the aggregate function first() - probably fastest, too, if you use the C version from the additional module first_last_agg. 请参见:

每个数据库需要一次:

CREATE EXTENSION first_last_agg;
SELECT sensor_id
     , first(wind_speed    ) FILTER (WHERE wind_speed IS NOT NULL)     AS wind_speed
     , first(wind_direction) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
--   , ... more?
FROM   (
   SELECT * FROM sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   ORDER  BY ts DESC
   ) s
GROUP  BY 1;

Database相关问答推荐

prisma 中的隐式或显式多对多关系

获取总和列的比率

Mongodb聚合$group,限制数组长度

Spring DriverManagerDataSource vs apache BasicDataSource

在 PostgreSQL 的数组列中查找字符串

什么是非规范化 mysql 数据库的好方法?

多列索引的顺序

tzname字段/时区标识符名称的最大长度

微服务:每个实例或每个微服务的数据源?

Django:如何以线程安全的方式执行 get_or_create()?

使用存储过程是一种不好的做法吗?

表别名如何影响性能?

我应该如何使用 MySQL 构建我的设置表?

为 Java servlet 管理数据库连接的最佳方法

Rails 新手,设置 db 然后运行 ​​rake db:create/migrate

如何在每个 SQLite 行中插入唯一 ID?

最佳用户角色权限数据库设计实践?

为 django 模型自动创建数据的工具

单元测试数据库

有任何使用协议缓冲区的经验吗?