Hudi学习笔记(2)

发布时间 2023-05-06 17:08:05作者: -见

https://hudi.apache.org/docs/configurations

Hudi配置分类

  • Spark Datasource Configs

Spark Datasource 的配置。

  • Flink Sql Configs

Flink SQL source/sink connectors 的配置,如:index.type、write.tasks、write.operation、clean.policy、clean.retain_commits、clean.retain_hours、compaction.max_memory、hive_sync.db、hive_sync.table、hive_sync.metastore.uris、write.retry.times、write.task.max.size 等。

  • Write Client Configs

控制 Hudi 使用 RDD 的 HoodieWriteClient API 的配置。

  • Metastore and Catalog Sync Configs

同步外部元数据的配置。

  • Metrics Configs

度量配置。

  • Record Payload Config

低级别定制配置,比如设置 Payload 的配置 hoodie.compaction.payload.class 等。

  • Kafka Connect Configs

使用 Kafka 作为 Sink Connector 的写 Hudi 表的配置。

  • Amazon Web Services Configs

亚马逊 Web Service 配置。

Spark Datasource Configs

  • 读配置
配置项 是否必须 默认值 配置说明
as.of.instant Y N/A 0.9.0 版本新增,时间旅行查询从哪儿开始,有两种格式的值:yyyyMMddHHmmss 和 yyyy-MM-dd HH:mm:ss,如果不指定则从最新的 snapshot 开始
hoodie.file.index.enable N true
hoodie.schema.on.read.enable N false
hoodie.datasource.streaming.startOffset N earliest
hoodie.datasource.write.precombine.field N ts
hoodie.datasource.read.begin.instanttime Y N/A
hoodie.datasource.read.end.instanttime Y N/A
hoodie.datasource.read.paths Y N/A
hoodie.datasource.merge.type N payload_combine
hoodie.datasource.query.incremental.format N latest_state
hoodie.datasource.query.type N snapshot
hoodie.datasource.read.extract.partition.values.from.path N false
hoodie.datasource.read.file.index.listing.mode N lazy
hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled N true
  • 写配置
配置项 是否必须 默认值 配置说明
hoodie.datasource.hive_sync.mode Y N/A
hoodie.datasource.write.partitionpath.field Y N/A
hoodie.datasource.write.precombine.field N ts
hoodie.datasource.write.recordkey.field Y N/A
hoodie.datasource.write.precombine.field N COPY_ON_WRITE
hoodie.sql.insert.mode N upsert
hoodie.sql.bulk.insert.enable N false
hoodie.datasource.write.table.name Y N/A
hoodie.datasource.write.operation N upsert
hoodie.datasource.write.payload.class N hoodie.datasource.write.payload.class
hoodie.datasource.write.partitionpath.urlencode N false
hoodie.datasource.hive_sync.partition_fields N N/A
hoodie.datasource.hive_sync.auto_create_database N true 自动创建不存在的数据库
hoodie.datasource.hive_sync.database N default
hoodie.datasource.hive_sync.table N unknown
hoodie.datasource.hive_sync.use_jdbc N hive
hoodie.datasource.hive_sync.password N hive
hoodie.datasource.hive_sync.enable N false
hoodie.datasource.hive_sync.ignore_exceptions N false
hoodie.datasource.hive_sync.use_jdbc N true
hoodie.datasource.hive_sync.jdbcurl N jdbc:hive2://localhost:10000 Hive metastore url
hoodie.datasource.hive_sync.metastore.uris N thrift://localhost:9083 Hive metastore url
hoodie.datasource.hive_sync.base_file_format N PARQUET
hoodie.datasource.hive_sync.support_timestamp N false
hoodie.datasource.meta.sync.enable N false
hoodie.clustering.inline N false
hoodie.datasource.write.partitions.to.delete Y N/A 逗号分隔的待删除分区列表,支持星号通配符
  • PreCommit Validator 配置
配置项 是否必须 默认值 配置说明
hoodie.precommit.validators N
hoodie.precommit.validators.equality.sql.queries N
hoodie.precommit.validators.inequality.sql.queries N
hoodie.precommit.validators.single.value.sql.queries N
配置项 是否必须 默认值 配置说明
path Y N/A Hudi表的 base path,如果不存在会创建,否则应是一个已初始化成功的 hudi 表
read.end-commit Y N/A
read.start-commit Y N/A
read.tasks Y N/A
write.tasks Y N/A
write.partition.format Y N/A 分区路径格式,仅 write.datetime.partitioning 为 true 是有效。两种默认值:1、yyyyMMddHH,当分区字段类型为 timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL 是;2、yyyyMMdd,当分区字段类型为 DATE 和 INT 时。
write.bucket_assign.tasks Y N/A
archive.max_commits N 50
archive.min_commits N 40
cdc.enabled N false
changelog.enabled N false
clean.async.enabled N true
clean.policy N KEEP_LATEST_COMMITS 清理策略,可取值:KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS.Default is KEEP_LATEST_COMMITS
clean.retain_commits N 30
clean.retain_file_versions N 5
clean.retain_hours N 24
clustering.async.enabled N false
clustering.delta_commits N 4
clustering.plan.partition.filter.mode N NONE 可取值:NONE, RECENT_DAYS, SELECTED_PARTITIONS, DAY_ROLLING
clustering.plan.strategy.class N org.apache.hudi.client.clustering.plan.strategy.FlinkSizeBasedClusteringPlanStrategy
clustering.tasks Y N/A
clustering.schedule.enabled N false
compaction.async.enabled N true
compaction.delta_commits N 5
compaction.delta_seconds N 3600
compaction.max_memory N 100
compaction.schedule.enabled N true
compaction.target_io N 512000
compaction.timeout.seconds N 1200
compaction.trigger.strategy N num_commits 可取值:num_commits, time_elapsed, num_or_time
hive_sync.conf.dir Y N/A
hive_sync.table_properties Y N/A
hive_sync.assume_date_partitioning N false 假定分区为 yyyy/mm/dd 格式
hive_sync.auto_create_db N true 自动创建不存在的数据库
hive_sync.db N default
hive_sync.table N unknown
hive_sync.table.strategy N ALL
hive_sync.enabled N false
hive_sync.file_format N PARQUET
hive_sync.jdbc_url N jdbc:hive2://localhost:10000
hive_sync.metastore.uris N '' Hive Metastore uris
hive_sync.mode N HMS
hive_sync.partition_fields N ''
hive_sync.password N hive
hive_sync.support_timestamp N true
hive_sync.use_jdbc N true
hive_sync.username N hive
hoodie.bucket.index.hash.field N
hoodie.bucket.index.num.buckets N 4
hoodie.datasource.merge.type N payload_combine
hoodie.datasource.query.type N snapshot
hoodie.datasource.write.hive_style_partitioning N false
hoodie.datasource.write.keygenerator.type N SIMPLE
hoodie.datasource.write.partitionpath.field N ''
hoodie.datasource.write.recordkey.field N uuid
hoodie.datasource.write.partitionpath.urlencode N false
hoodie.database.name Y N/A
hoodie.table.name Y N/A
hoodie.datasource.write.keygenerator.class Y N/A
index.bootstrap.enabled N false
index.global.enabled N true
index.partition.regex N *
index.state.ttl N 0.0
index.type N FLINK_STATE
metadata.enabled N false
metadata.compaction.delta_commits N 10
partition.default_name N HIVE_DEFAULT_PARTITION
payload.class N org.apache.hudi.common.model.EventTimeAvroPayload
precombine.field N ts
read.streaming.enabled N false
read.streaming.skip_compaction N false
read.streaming.skip_clustering N false
read.utc-timezone N true
record.merger.impls N org.apache.hudi.common.model.HoodieAvroRecordMerger
record.merger.strategy N eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
table.type N COPY_ON_WRITE 指定表类型,可取:COPY_ON_WRITE 或 MERGE_ON_READ
write.batch.size N 256.0
write.commit.ack.timeout N -1
write.ignore.failed N false
write.insert.cluster N false
write.log.max.size N 1024
write.log_block.size N 128
write.log_block.size N 100 单位:MB
write.operation N upsert
write.precombine N false
write.parquet.block.size N 120
write.rate.limit N 0
write.retry.interval.ms N 2000
write.retry.times N 3
write.sort.memory N 128 单位:MB
write.task.max.size N 1024.0 单位:MB

Write Client Configs

  • Layout Configs

  • Clean Configs

  • Memory Configurations

  • Archival Configs

  • Metadata Configs

  • Consistency Guard Configurations

  • FileSystem Guard Configurations

  • Write Configurations

  • Metastore Configs

  • Key Generator Options

  • Storage Configs

  • Compaction Configs

  • File System View Storage Configurations

  • Clustering Configs

  • Common Configurations

  • Bootstrap Configs

  • Commit Callback Configs

  • Lock Configs

  • Index Configs

Metastore and Catalog Sync Configs

  • Common Metadata Sync Configs

  • Global Hive Sync Configs

  • DataHub Sync Configs

  • BigQuery Sync Configs

  • Hive Sync Configs

Metrics Configs

  • Metrics Configurations for Datadog reporter

  • Metrics Configurations for Amazon CloudWatch

  • Metrics Configurations

  • Metrics Configurations for Jmx

  • Metrics Configurations for Prometheus

  • Metrics Configurations for Graphite

Record Payload Config

  • Payload Configurations
配置项 是否必须 默认值 配置说明
hoodie.compaction.payload.class N org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.payload.event.time.field N ts
hoodie.payload.ordering.field N ts

Kafka Connect Configs

  • Kafka Sink Connect Configurations
配置项 是否必须 默认值 配置说明
hadoop.conf.dir Y N/A
hadoop.home Y N/A
bootstrap.servers N bootstrap.servers Kafka 集群的 bootstrap.servers
hoodie.kafka.control.topic N hudi-control-topic
hoodie.meta.sync.classes N org.apache.hudi.hive.HiveSyncTool
hoodie.meta.sync.enable N false
hoodie.meta.sync.enable N org.apache.hudi.schema.FilebasedSchemaProvider
hoodie.kafka.coordinator.write.timeout.secs N 300
hoodie.kafka.compaction.async.enable N true

Amazon Web Services Configs

配置项 是否必须 默认值 配置说明
hoodie.aws.access.key Y N/A AWS access key id
hoodie.aws.secret.key Y N/A AWS secret key
hoodie.aws.session.token N N/A AWS session token