Iceberg元数据合并-metadata.json文件

发布时间 2024-01-02 10:22:47作者: 黑水滴

一、背景描述

元数据文件随时间增多,导致查询变慢。通过如下方式可以指定metadata个数,超过指定数量自动清理。

metadata文件对应Iceberg概念是Snapshots

二、解决方案

1、在建表时增加参数

‘write.metadata.delete-after-commit.enabled’=‘true’,
‘write.metadata.previous-versions-max’=‘5’

2、建表语句

自动清理metadata文件
CREATE TABLE iceberg_test.autoclean_usql_test(
  `id` BIGINT,
  `name_cn` STRING,
  `description` STRING,
  `status` INT
)USING iceberg
TBLPROPERTIES(
'format-version'='2'
,'property-version'='2'
,'write.upsert.enabled'='true'
,'write.metadata.delete-after-commit.enabled'='true'
,'write.metadata.previous-versions-max'='5'
)
 
不带metadata自动清理
CREATE TABLE iceberg_test.notclean_usql_test(
  `id` BIGINT,
  `name_cn` STRING,
  `description` STRING,
  `status` INT
)USING iceberg
TBLPROPERTIES(
'format-version'='2'
,'property-version'='2'
,'write.upsert.enabled'='true')
 

3、存储路径

/user/hive/warehouse/iceberg_test.db/autoclean_usql_test

三、插入数据测试

1、带metadata清理测试,因保留5个,第六次插入才清理

插入数据,查看生成的文件数量。查看命令如下

hdfs dfs -ls /user/hive/warehouse/iceberg_test.db/autoclean_usql_test/metadata |grep 'Found'

insert into iceberg_test.autoclean_usql_test values(1,'name1','desc',1);    Found 4 items    2个metadata.json   1个*m0.avro    1个***.avro
insert into iceberg_test.autoclean_usql_test values(2,'name2','desc',1);    Found 7 items    3个metadata.json   2个*m0.avro    2个***.avro
insert into iceberg_test.autoclean_usql_test values(3,'name3','desc',1);    Found 10 items   4个metadata.json   3个*m0.avro    3个***.avro
insert into iceberg_test.autoclean_usql_test values(4,'name4','desc',1);    Found 13 items   5个metadata.json   4个*m0.avro    4个***.avro
insert into iceberg_test.autoclean_usql_test values(5,'name5','desc',1);    Found 16 items   6个metadata.json   5个*m0.avro    5个***.avro
---开始清理生效
insert into iceberg_test.autoclean_usql_test values(6,'name6','desc',1);    Found 18 items   6个metadata.json   6个*m0.avro    6个***.avro
insert into iceberg_test.autoclean_usql_test values(7,'name7','desc',1);    Found 20 items   6个metadata.json   7个*m0.avro    7个***.avro

2、不带metadata清理,每次增加三个,一直增加。

hdfs dfs -ls /user/hive/warehouse/iceberg_test.db/notclean_usql_test/metadata

insert into iceberg_test.notclean_usql_test values(1,'name1','desc',1); Found 4 items    2个metadata.json   1个*m0.avro    1个***.avro
insert into iceberg_test.notclean_usql_test values(2,'name2','desc',1); Found 7 items    3个metadata.json   2个*m0.avro    2个***.avro
insert into iceberg_test.notclean_usql_test values(3,'name3','desc',1); Found 10 items   4个metadata.json   3个*m0.avro    3个***.avro
insert into iceberg_test.notclean_usql_test values(4,'name4','desc',1); Found 13 items   5个metadata.json   4个*m0.avro    4个***.avro
insert into iceberg_test.notclean_usql_test values(5,'name5','desc',1); Found 16 items   6个metadata.json   5个*m0.avro    5个***.avro
insert into iceberg_test.notclean_usql_test values(6,'name6','desc',1); Found 19 items   7个metadata.json   6个*m0.avro    6个***.avro
insert into iceberg_test.notclean_usql_test values(7,'name7','desc',1); Found 22 items   8个metadata.json   7个*m0.avro    7个***.avro

四、参考文章

1、实践数据湖iceberg 元数据合并

https://blog.csdn.net/spark_dev/article/details/122876819