exist和left join 性能对比-526互联

今天遇到一个性能问题，再调优过程中发现耗时最久的计划是exist 部分涉及的三个表。

然后计划用left join 来替换exist，然后查询了很多资料，大部分都说exist和left join 性能差不多。为了验证这一结论进行了如下实验

步骤如下

1、创建测试表

drop table app_family;
CREATE TABLE app_family (
"family_id" character varying(32 char) NOT NULL,
"application_id" character varying(32 char) NULL,
"family_number" character varying(50 char) ,
"household_register_number" character varying(50 char),
"poverty_reason" character varying(32 char),
CONSTRAINT "pk_app_family_idpk" PRIMARY KEY (family_id));
insert into app_family select generate_series(1,1000000),generate_series(1,1000000),'aaaa','aaa','bbb' from dual ;

create table app_family2 as select * from app_family;

create table app_memeber as select * from app_family;

2、验证两张表join和exist 性能对比

语句1、两张表exist

explain analyze select a1.application_id,a1.family_id from app_family a1 where
a1.family_id >1000 and
EXISTS(
SELECT
1
FROM
app_family2 a2
WHERE
a2.application_id=a1.application_id
and a2.family_id > 500000
)

总计用时404 ms

语句2 两张表join

explain analyze select a1.application_id,a1.family_id from app_family a1 LEFT JOIN app_family2 a2 ON a2.application_id=a1.application_id
WHERE a1.family_id >1000 AND a2.family_id > 500000

总计执行时间416ms

两张表场景总结

针对两张表的对比可以发现join还相对满了10几ms但是总的来说两边差异不大。所以再两张表的关联情况下 join和exist 性能相近。

3、验证3张表join和exist 性能对比

语句1 三张表exist

本场景最开始执行时 exit 用户6s多，原因时用到了内存排序，后来调整了work_mem 排除了内存排序的影响，最终执行时间

2036 ms

explain analyze select a1.application_id,a1.family_id from app_family a1 ,app_family2 a2 where
a1.family_id >1000 and a2.family_id < 900000 and
EXISTS(
SELECT
1
FROM
app_memeber m
WHERE
m.application_id=a1.application_id
and m.family_id=a2.family_id
)

QUERY PLAN |
------------------------------------------------------------------------------------------------------------------------------------------------------+
Gather (cost=61282.11..88664.67 rows=111111 width=12) (actual time=1347.081..1976.824 rows=898999 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Hash Join (cost=60282.11..76553.57 rows=46296 width=12) (actual time=1386.451..1859.819 rows=299666 loops=3) |
Hash Cond: ((m.family_id)::text = (a2.family_id)::text) |
-> Hash Join (cost=44898.00..60455.72 rows=138889 width=18) (actual time=1186.741..1486.233 rows=333000 loops=3) |
Hash Cond: ((a1.application_id)::text = (m.application_id)::text) |
-> Parallel Seq Scan on app_family a1 (cost=0.00..13648.00 rows=138889 width=12) (actual time=0.074..76.516 rows=333000 loops=3) |
Filter: ((family_id)::integer > 1000) |
Rows Removed by Filter: 333 |
-> Hash (cost=32398.00..32398.00 rows=1000000 width=12) (actual time=1181.785..1181.786 rows=1000000 loops=3) |
Buckets: 1048576 Batches: 1 Memory Usage: 52897kB |
-> HashAggregate (cost=22398.00..32398.00 rows=1000000 width=12) (actual time=618.475..910.329 rows=1000000 loops=3) |
Group Key: (m.application_id)::text, (m.family_id)::text |
-> Seq Scan on app_memeber m (cost=0.00..17398.00 rows=1000000 width=12) (actual time=0.070..158.845 rows=1000000 loops=3)|
-> Parallel Hash (cost=13648.00..13648.00 rows=138889 width=6) (actual time=198.377..198.377 rows=300000 loops=3) |
Buckets: 1048576 (originally 524288) Batches: 1 (originally 1) Memory Usage: 47552kB |
-> Parallel Seq Scan on app_family2 a2 (cost=0.00..13648.00 rows=138889 width=6) (actual time=0.098..72.933 rows=300000 loops=3) |
Filter: ((family_id)::integer < 900000) |
Rows Removed by Filter: 33334 |
Planning Time: 0.495 ms |
Execution Time: 2036.815 ms |

语句2 三张表join

为了保证语句的一致性，三张表的join顺序保持和语句1的执行计划中的顺序一致，join总计用时897ms

explain analyze select a1.application_id,a1.family_id from app_family a1
left join app_memeber m on a1.application_id = m.application_id LEFT JOIN app_family2 a2 ON m.family_id = a2.family_id
WHERE a1.family_id >1000 AND a2.family_id < 900000

QUERY PLAN |
------------------------------------------------------------------------------------------------------------------------------------------------------+
Gather (cost=31768.22..57298.93 rows=111111 width=12) (actual time=386.163..865.284 rows=898999 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Hash Join (cost=30768.22..45187.83 rows=46296 width=12) (actual time=381.211..765.177 rows=299666 loops=3) |
Hash Cond: ((m.family_id)::text = (a2.family_id)::text) |
-> Parallel Hash Join (cost=15384.11..29089.98 rows=138889 width=18) (actual time=191.233..413.917 rows=333000 loops=3) |
Hash Cond: ((m.application_id)::text = (a1.application_id)::text) |
-> Parallel Seq Scan on app_memeber m (cost=0.00..11564.67 rows=416667 width=12) (actual time=0.025..33.077 rows=333333 loops=3) |
-> Parallel Hash (cost=13648.00..13648.00 rows=138889 width=12) (actual time=189.919..189.920 rows=333000 loops=3) |
Buckets: 1048576 (originally 524288) Batches: 1 (originally 1) Memory Usage: 59200kB |
-> Parallel Seq Scan on app_family a1 (cost=0.00..13648.00 rows=138889 width=12) (actual time=0.060..78.600 rows=333000 loops=3)|
Filter: ((family_id)::integer > 1000) |
Rows Removed by Filter: 333 |
-> Parallel Hash (cost=13648.00..13648.00 rows=138889 width=6) (actual time=188.522..188.523 rows=300000 loops=3) |
Buckets: 1048576 (originally 524288) Batches: 1 (originally 1) Memory Usage: 47584kB |
-> Parallel Seq Scan on app_family2 a2 (cost=0.00..13648.00 rows=138889 width=6) (actual time=0.052..81.396 rows=300000 loops=3) |
Filter: ((family_id)::integer < 900000) |
Rows Removed by Filter: 33334 |
Planning Time: 0.167 ms |
Execution Time: 897.552 ms |

总结三张表场景

在三张表的场景下exist用时2036 ms ，join用时897ms 可见 join的顺序明显优于exist。

在三张表的场景下可以看到，针对中间表appmember扫描时， exist语句用到HashAggregate 并做了 Group Key，所以导致exist 执行时间增加。如果work_mem 配置不合适时间会更长。

postgresql exists anyall join

update join inner left