我在PostgreSQL v13.10中运行了以下SQL:
WITH stuckable_statuses AS (
SELECT status_id FROM status_descriptions
WHERE (tags @> ARRAY['stuckable']::varchar[])
)
SELECT jobs.* FROM jobs
WHERE jobs.status = ANY(select status_id from stuckable_statuses)
当用ID数组EX替换ANY(select status_id from stuckable_statuses)
时,运行速度非常慢.(1,2,3)
真的跑得很快.
下面是该查询的解释分析:
Gather (cost=1005.64..5579003.45 rows=1563473 width=2518) (actual time=45.495..40138.515 rows=303 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Semi Join (cost=5.64..5421656.15 rows=651447 width=2518) (actual time=44.533..40126.793 rows=101 loops=3)
Hash Cond: (jobs.status = status_descriptions.status_id)
-> Parallel Seq Scan on jobs (cost=0.00..5378777.15 rows=13571815 width=2518) (actual time=0.892..38662.091 rows=10537079 loops=3)
-> Hash (cost=5.56..5.56 rows=6 width=4) (actual time=0.377..0.378 rows=11 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on status_descriptions (cost=0.00..5.56 rows=6 width=4) (actual time=0.310..0.370 rows=11 loops=3)
Filter: (tags @> '{stuckable}'::character varying[])
Rows Removed by Filter: 146
Planning Time: 0.711 ms
Execution Time: 40138.654 ms
以下是表格定义(摘自Rails‘schema.rb
):
create_table "jobs", id: :serial, force: :cascade do |t|
t.string "filename"
t.string "sandbox"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.integer "status", default: 0, null: false
t.integer "provider_id"
t.integer "lang_id"
t.integer "profile_id"
t.datetime "extra_date"
t.datetime "main_date"
t.datetime "performer_id"
t.index ["provider_id", "status", "extra_date"], name: "jobs_on_media_provider_id__status__extra_date"
t.index ["provider_id", "status", "main_date"], name: "jobs_on_media_provider_id_and_status_and_due_date"
t.index ["profile_id", "status", "extra_date"], name: "index_jobs_on_profile_id__status__extra_date"
t.index ["profile_id", "status", "main_date"], name: "index_transcription_jobs_on_profile_id_and_status_and_due_date"
t.index ["status", "sandbox", "lang_id", "extra_date"], name: "index_jobs_on_status__sandbox__lang_id__extra_date"
t.index ["status", "sandbox", "lang_id", "main_date"], name: "index_jobs_on_status_and_sandbox_and_lang_id_and_due_date"
t.index ["performer_id", "status", "extra_date"], name: "index_jobs_on_performer_id__status__extra_date"
t.index ["performer_id", "status", "main_date"], name: "index_jobs_on_performer_id_and_status_and_due_date"
end
create_table "status_descriptions", id: :serial, force: :cascade do |t|
t.integer "status_id"
t.string "title"
t.string "tags", array: true
t.index ["status_id"], name: "index_status_descriptions_on_status_id"
end
与使用ARRAY的相同SQL相比,我可以看到它没有使用INDEX BY JOBS.STATUS,这可能是因为JOBS表非常大(大约15kk行),而STATUS_DESCRIPTIONSION大约有200行.
如果可能的话,你能帮我优化一下这个SQL吗?
谢谢!
最新情况:
以下是具有硬编码数组的查询:
SELECT jobs.* FROM transcription_jobs
WHERE jobs.status IN (2, 3, 4, 291, 290, 46, 142, 260, 6, 7, 270)
下面是它的解释分析:
Index Scan using index_jobs_on_status__sandbox__lang_id__current_stage_due_date on jobs (cost=0.56..98661.05 rows=26541 width=2518) (actual time=0.032..63.266 rows=483 loops=1)
Index Cond: (status = ANY ('{2,3,4,291,290,46,142,260,6,7,270}'::integer[]))
Planning Time: 0.356 ms
Execution Time: 63.337 ms