Hive 的执行优化问题小结-526互联

hive 的谓词下推和optimize优化功能的讨论

sql 核心的大概结构为

with user_table as (
	select 
		user_id
	from 
		user
)


select 
	t1.user_id
from 
(	
	select
		t1.user_id,
		....
	from 
	(
		select 
			user_id
		from 
			user_1
	) t1 
	left join table_1_1 on ....
	left join table_1_2 on ....
	left join table_1_3 on ....
	
	
	union all 
	
	select
		t1.user_id,
		...

	from 
	(
		select 
			user_id
		from 
			user_2
	) t1 
	left join table_2_1 on ....
	left join table_2_2 on ....
	left join table_2_3 on ....
		
	union all 
	
	select
		t1.user_id,
		...
	from
	(
		select 
			user_id
		from 
			user_3
	) t1 
	left join table_3_1 on ....
	left join table_3_2 on ....
	left join table_3_3 on ....

) t1 
left join (
	select 
		user_id
	from 
		user_table
) t2 on t1.user_id = t2.user_id
where 
	t2.user_id is null

前提背景：
user_1，user_2，user_3，user 都是数据量很大的表

逻辑是需要三个表合并起来的user_id 排除掉 user_table 中的user_id，但是解析完执行计划后，发现 user_table 的逻辑仍然是最后进行join 执行的。
理想情况为应该为识别出子查询的主表，在里面的每个主表做左关联之前，先排除掉 user 中的数据，减少后续多个左连接的shuffle数据。

结论：
猜测hive的谓词下推的优化只要集中在where 条件中，在生成逻辑计划做optimize 时尽可能的使 filterOperator 贴近数据源，提前做数据的裁剪；
但是上面的实际例子中，复杂逻辑sql和 join 的场景下，并没有成功识别出在不影响执行逻辑的情况下的优化，所以hive 的optimize 也是不是完全的智能。
工作中的代码逻辑还是尽可能的优化，需要根据数据量、关联的逻辑，代码侧先完成优化，并结构清晰，不能完全依赖谓词下推、optimize的功能。