使用-数据湖Iceberg和现有hive数仓打通并使用

发布时间 2024-01-02 09:47:21作者: 黑水滴

一、集群配置

1、版本使用

技术
版本
iceberg 1.3.1
flink 1.16.1
spark 3.2.1
hive 2.3.7
dlc-presto 待定

2、集群配置调整

(1)使用hive查询的话所有hiveserver2节点修改hive-site.xml文件,添加jar包,添加如下配置,然后重启服务

<property>
    <name>iceberg.engine.hive.enabled</name>
    <value>true</value>
</property>
 

(2)把iceberg包放入hive auxlib目录中,可以不用每次执行添加jar包操作,放入到hive auxlib目录可以不用重启hive就生效

--1.3.1版本,拷贝后修改权限

cp iceberg-hive-runtime-1.3.1.jar /usr/local/service/hive/auxlib

chmod 755 iceberg-hive-runtime-1.3.1.jar

--0.13.1版本

cp iceberg-hive-runtime-0.13.1.jar /usr/local/service/hive/auxlib

二、客户端连接

1、spark方式连接(推荐)

(1)堡垒机登录服务器glink01

(2)切换root用户有权限跳转服务器,sudo su - root

        测试环境:ssh 10.224.194.147

        灰度环境(spark3.0.1):ssh 10.224.196.134

        线上环境:堡垒机直登10.224.192.138

(3)登录服务器后切换hadoop用户

        sudo su - hadoop

(4)spark-sql方式登录(记得exit方式退出,否则资源不释放)

IcebergSparkSessionExtensions是iceberg对spark做的扩展功能,有存储过程、修改表结构等功能
测试环境新版(参数已加到配置文件中)
spark-sql --master local[*]
 
测试环境local方式登录(1.3.1版本):
spark-sql --master local[*] \
--jars /usr/local/service/iceberg/tx-iceberg-spark-runtime-3.2_2.12-1.3.1.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.iceberg_catalog.type=hive
 
spark-sql --master local[*] --jars /usr/local/service/iceberg/tx-iceberg-spark-runtime-3.2_2.12-1.3.1.jar --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive
测试环境local方式登录(旧的0.13.1版本)
spark-sql --master local[*] --jars /usr/local/service/iceberg/iceberg-spark-runtime-3.2_2.12-0.13.1.jar --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive
线上环境提交yarn方式(1.3.1版本,!!!退出记得用exit方式否则资源不释放)
spark-sql --master yarn \
--deploy-mode client \
--queue sailing \
--name wangshidatest \
--driver-memory 20G \
--num-executors 10 \
--executor-cores 1 \
--executor-memory 16G \
--jars /usr/local/service/iceberg/tx-iceberg-spark-runtime-3.2_2.12-1.3.1.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.iceberg_catalog.type=hive \
--conf spark.dynamicAllocation.maxExecutors=15
 
其它参数-小文件合并时大表才加driver参数
--driver-memory 20G \

线上时间戳类型查询报错的话加参数

set spark.sql.storeAssignmentPolicy=ANSI;

yarn上删除任务

yarn application -kill application_id

2、hive方式连接(不推荐)

(1)堡垒机登录服务器glink01

(2)切换root用户有权限跳转服务器,sudo su - root

        测试环境:ssh 10.224.194.147

        灰度环境(spark3.0.1):ssh 10.224.196.134

        线上环境:堡垒机直登10.224.192.138

(3)登录服务器后切换hadoop用户

        sudo su - hadoop

(4)方式1:hive客户端方式登录(推荐)

        控制台执行:hive

        添加iceberg jar包(集群auxlib目录配置的话可不添加):ADD JAR /usr/local/service/iceberg/iceberg-hive-runtime-1.3.1.jar

(5)方式2:beeline方式登录(不推荐-记得exit方式退出)

        在控制台执行beeline

        进入客户端后连接hiveserver2:!connect jdbc:hive2://10.224.195.139:2181,10.224.195.50:2181,10.224.196.83:2181/;serviceDiscoveryMode=zookeeper;zooKeeperNamespace=hiveserver2

        输入账号密码:root|hadoop 密码见内部群或集群配置

        查询数据:参考方式一添加jar包后才可以查询

        操作后退出:!quit

三、建库建表(管理员操作)

1、创建数据库-现有的话不需要创建

使用ofs存储-常用

create database iceberg_test location "/user/hive/warehouse/iceberg_test.db"

使用emr core节点存储(!!仅测试环境测试用)

create database iceberg_test;

2、删除表(添加PURGE才会删chdfs数据文件,不加只会删hive元数据)

use iceberg_test;

drop table iceberg_test.test3 PURGE;

3、建表-先进库里面

use iceberg_test;
(1)推荐建表sql(指定为format v2,只有v2才支持更新数据)
CREATE TABLE iceberg_test.test3(
    id bigint COMMENT 'unique id',
    `empcode` STRING
)USING iceberg
TBLPROPERTIES('format-version'='2','table.drop.base-path.enabled'='true');
 
(2)最基本的建表sql,默认format是v1版本
CREATE TABLE spark_catalog.iceberg_test.test3 (
`id` int ,
`empcode` STRING
) USING iceberg;
 
(3)修改表format版本
alter table spark_catalog.iceberg_test.test3 SET TBLPROPERTIES('format-version'='2');

4、插入测试数据
insert into table spark_catalog.iceberg_test.test3 values (1,"code1");

四、查询数据

#注释了些测试用的参数
#set hive.execution.engine=spark;
#set engine.hive.enabled=true;
#set iceberg.engine.hive.enabled=true;
#set iceberg.mr.catalog=hive;
#set spark.driver.extraClassPath=/usr/local/service/hive/lib/antlr-runtime-3.5.2.jar;
#set spark.executor.extraClassPath=/usr/local/service/hive/lib/antlr-runtime-3.5.2.jar;
#set hive.vectorized.execution.enabled=false;
#set tez.mrreader.config.update.properties=hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids;
use iceberg_test;
select * from iceberg_test.test3;
 

五、文件分析

1、建表后可查看底层数据存储位置。

hdfs dfs -ls /usr/hive/warehouse/iceberg_test.db/test3

目录数据文件如下

drwxr-xr-x - hadoop supergroup 0 2023-10-20 14:51 /usr/hive/warehouse/iceberg_test.db/test3/data
drwxr-xr-x - hadoop supergroup 0 2023-10-20 14:51 /usr/hive/warehouse/iceberg_test.db/test3/metadata

2、分析metadata文件

hdfs dfs -ls /usr/hive/warehouse/iceberg_test.db/test3/metadata

目录数据文件如下

-rw-r--r-- 2 hadoop supergroup 1202 2023-10-20 14:45 /usr/hive/warehouse/iceberg_test.db/test3/metadata/00000-3e5e7244-e7b0-43f4-a5e2-09f166cb0427.metadata.json
-rw-r--r-- 2 hadoop supergroup 2198 2023-10-20 14:51 /usr/hive/warehouse/iceberg_test.db/test3/metadata/00001-063d2d17-58df-41c4-af5f-899add03281c.metadata.json
-rw-r--r-- 2 hadoop supergroup 5799 2023-10-20 14:51 /usr/hive/warehouse/iceberg_test.db/test3/metadata/5183f5e9-7516-4ee8-a454-9c6263b7e624-m0.avro
-rw-r--r-- 2 hadoop supergroup 3776 2023-10-20 14:51 /usr/hive/warehouse/iceberg_test.db/test3/metadata/snap-2379142480402238200-1-5183f5e9-7516-4ee8-a454-9c6263b7e624.avro

3、查看json文件

hdfs dfs -cat /usr/hive/warehouse/iceberg_test.db/test3/metadata/00001-063d2d17-58df-41c4-af5f-899add03281c.metadata.json

{
  "format-version" : 1,
  "table-uuid" : "26a04890-7a11-4fec-a10f-1344a0e47843",
  "location" : "hdfs://HDFS8000006/usr/hive/warehouse/iceberg_test.db/test3",
  "last-updated-ms" : 1697784686845,
  "last-column-id" : 2,
  "schema" : {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "empcode",
      "required" : false,
      "type" : "string"
    } ]
  },
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "empcode",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "partition-spec" : [ ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "hadoop"
  },
  "current-snapshot-id" : 2379142480402238200,
  "snapshots" : [ {
    "snapshot-id" : 2379142480402238200,
    "timestamp-ms" : 1697784686845,
    "summary" : {
      "operation" : "append",
      "spark.app.id" : "local-1697784671445",
      "added-data-files" : "1",
      "added-records" : "1",
      "added-files-size" : "656",
      "changed-partition-count" : "1",
      "total-records" : "1",
      "total-files-size" : "656",
      "total-data-files" : "1",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "hdfs://HDFS8000006/usr/hive/warehouse/iceberg_test.db/test3/metadata/snap-2379142480402238200-1-5183f5e9-7516-4ee8-a454-9c6263b7e624.avro",
    "schema-id" : 0
  } ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1697784686845,
    "snapshot-id" : 2379142480402238200
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1697784337831,
    "metadata-file" : "hdfs://HDFS8000006/usr/hive/warehouse/iceberg_test.db/test3/metadata/00000-3e5e7244-e7b0-43f4-a5e2-09f166cb0427.metadata.json"
  } ]
}

 

 

4、下载hdfs文件到本地

hadoop fs -get /usr/hive/warehouse/iceberg_test.db/test3/metadata/5183f5e9-7516-4ee8-a454-9c6263b7e624-m0.avro ~/wangshida/iceberg

hadoop fs -get /usr/hive/warehouse/iceberg_test.db/test3/metadata/snap-2379142480402238200-1-5183f5e9-7516-4ee8-a454-9c6263b7e624.avro ~/wangshida/iceberg

hadoop fs -get /usr/hive/warehouse/iceberg_test.db/test3/data/00000-0-d5a9981c-8476-4505-aa83-c9983ba467ef-00001.parquet ~/wangshida/iceberg

 

5、查看avro文件内容,需要用单独的jar包查看,所以先下载到本地查看

java -jar ./avro-tools-1.8.1.jar tojson ./iceberg/snap-2379142480402238200-1-5183f5e9-7516-4ee8-a454-9c6263b7e624.avro | less

{
    "manifest_path": "hdfs://HDFS8000006/usr/hive/warehouse/iceberg_test.db/test3/metadata/5183f5e9-7516-4ee8-a454-9c6263b7e624-m0.avro",
    "manifest_length": 5799,
    "partition_spec_id": 0,
    "added_snapshot_id": {
        "long": 2379142480402238200
    },
    "added_data_files_count": {
        "int": 1
    },
    "existing_data_files_count": {
        "int": 0
    },
    "deleted_data_files_count": {
        "int": 0
    },
    "partitions": {
        "array": []
    },
    "added_rows_count": {
        "long": 1
    },
    "existing_rows_count": {
        "long": 0
    },
    "deleted_rows_count": {
        "long": 0
    }
}

java -jar ./avro-tools-1.8.1.jar tojson ./iceberg/5183f5e9-7516-4ee8-a454-9c6263b7e624-m0.avro | less

{
  "status": 1,
  "snapshot_id": {
    "long": 2379142480402238000
  },
  "data_file": {
    "file_path": "hdfs://HDFS8000006/usr/hive/warehouse/iceberg_test.db/test3/data/00000-0-d5a9981c-8476-4505-aa83-c9983ba467ef-00001.parquet",
    "file_format": "PARQUET",
    "partition": {},
    "record_count": 1,
    "file_size_in_bytes": 656,
    "block_size_in_bytes": 67108864,
    "column_sizes": {
      "array": [
        {
          "key": 1,
          "value": 47
        },
        {
          "key": 2,
          "value": 52
        }
      ]
    },
    "value_counts": {
      "array": [
        {
          "key": 1,
          "value": 1
        },
        {
          "key": 2,
          "value": 1
        }
      ]
    },
    "null_value_counts": {
      "array": [
        {
          "key": 1,
          "value": 0
        },
        {
          "key": 2,
          "value": 0
        }
      ]
    },
    "nan_value_counts": {
      "array": []
    },
    "lower_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u0001\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "code1"
        }
      ]
    },
    "upper_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u0001\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "code1"
        }
      ]
    },
    "key_metadata": null,
    "split_offsets": {
      "array": [
        4
      ]
    },
    "sort_order_id": {
      "int": 0
    }
  }
}
 

6、查看parquet数据文件-hadoop用户。因版本兼容和环境问题费时较多

下载工具包:https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.0/

(1)查看数据

查看前几行数据

hadoop jar ./parquet-tools-1.11.0.jar head -n 10 /usr/hive/warehouse/iceberg_test.db/test3/data/00000-0-d5a9981c-8476-4505-aa83-c9983ba467ef-00001.parquet

查看所有数据

hadoop jar ./parquet-tools-1.11.0.jar cat /usr/hive/warehouse/iceberg_test.db/test3/data/00000-0-d5a9981c-8476-4505-aa83-c9983ba467ef-00001.parquet

(2)查看元数据

hadoop jar ./parquet-tools-1.11.0.jar schema /usr/hive/warehouse/iceberg_test.db/test3/data/00000-0-d5a9981c-8476-4505-aa83-c9983ba467ef-00001.parquet

(3)查看meta信息

hadoop jar ./parquet-tools-1.11.0.jar meta /usr/hive/warehouse/iceberg_test.db/test3/data/00000-0-d5a9981c-8476-4505-aa83-c9983ba467ef-00001.parquet

 

7、通过sql查看元数据信息

(1)查看表快照信息
select * from spark_catalog.data_lake_ods.dws_service_subclazz_lesson_user_learn_stat_rt_v2.snapshots;

数据样例:

2023-12-04 16:59:54.979 4270722947875310895 4387207514912842768 overwrite snap-4270722947875310895-1-886c8c53-4409-47c6-ac93-a107280ffb59.avro  {"added-data-files":"16","added-delete-files":"16","added-equality-delete-files":"16","added-equality-deletes":"22754348","added-files-size":"871385292","added-records":"22754348","changed-partition-count":"1","flink.job-id":"000000001de734af0000000000000000","flink.max-committed-checkpoint-id":"2","flink.operator-id":"e883208d19e3c34f8aaf2a3168a63337","total-data-files":"32","total-delete-files":"32","total-equality-deletes":"45471042","total-files-size":"1738325852","total-position-deletes":"0","total-records":"45471042"}

(2)查询表历史信息

select * from spark_catalog.data_lake_ods.test.history;

数据样例:

2023-12-04 16:59:54.979 4270722947875310895 4387207514912842768 true

(3)查询表data file文件列表

select * from spark_catalog.data_lake_ods.test.files ;

数据样例:

2 00002-0-09e082bc-4d7e-4f52-94c9-67e672e5c135-00002.parquet PARQUET 0 1419808 19356556 {1:7902810,2:2824871,4:8620639} {1:1419808,2:1419808,4:1419808} {1:0,2:0,4:0} {} {1:,2:,4:-?=-?=-?=b}} {1:-?=-?=T-?=G-?=,2:X-?=-?=-?=W,4:-?=c-?=-?=v} NULL [4] [1,2,4] 0 {"clazz_lesson_number":{"column_size":8620639,"value_count":1419808,"null_value_count":0,"nan_value_count":null,"lower_bound":1263763655281666,"upper_bound":393727528547897347},"clazz_number":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"companion_learn_duration":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"companion_lesson_length":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"is_black_user":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"is_combine_valid_learn_user":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"is_live_attend_user":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"is_should_attend_user":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"is_valid_learn":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"is_valid_live_learn":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"learn_combine_duration":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"learn_duration":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"lesson_length":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"lesson_live_length":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"lesson_live_property":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"lesson_standard_length":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"lesson_video_length":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"lesson_video_property":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"live_learn_duration":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"subclazz_number":{"column_size":2824871,"value_count":1419808,"null_value_count":0,"nan_value_count":null,"lower_bound":0,"upper_bound":24696091250786816},"user_number":{"column_size":7902810,"value_count":1419808,"null_value_count":0,"nan_value_count":null,"lower_bound":0,"upper_bound":247697329422028},"video_learn_duration":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null}}

(4)查询表manifests信息

select * from spark_catalog.data_lake_ods.dws_service_subclazz_lesson_user_learn_stat_rt_v2.manifests ;

数据样例:

1  9462 0 4270722947875310895 0 0 0 16 0 0 []