Hudi的Index类型-526互联

Hudi 的索引是 hoodiekey 到文件组（File Group）或者文件 ID（File ID）的映射，hoodiekey 由 recordkey 和 partitionpath 两部分组成。

分一下几种：

类型	说明
SIMPLE	简单索引，分区内唯一所以，把 update 和 delete 操作的新数据和老数据进行 join，性能比较差
GLOBAL_SIMPLE	简单索引，全局唯一索引
BUCKET	桶所以，通过 hoodie.index.bucket.engine 设置桶索引类型
FLINK_STATE	Flink 专用索引，内存需求同索引数量成正比
INMEMORY	使用 hashmap 作为索引
BLOOM	布隆索引，仅限单个分区内唯一
GLOBAL_BLOOM	布隆索引，所有分区唯一
HBASE	使用外部的 HBase 存储索引，为全局唯一索引
自定义索引	实现接口 HoodieIndex，设置 hoodie.index.class 来应用

@EnumDescription("Determines how input records are indexed, i.e., looked up based on the key "
    + "for the location in the existing table. Default is SIMPLE on Spark engine, and INMEMORY "
    + "on Flink and Java engines.")
public enum IndexType {
    HBASE,
    INMEMORY,
    BLOOM,
    GLOBAL_BLOOM,
    SIMPLE,
    GLOBAL_SIMPLE,
    BUCKET,
    FLINK_STATE
}

从 hudi-0.13.0 版本开始，BUCKET 类型的索引又分为两种：

分类	说明	限制
SIMPLE	固定桶数	无
CONSISTENT_HASHING	动态桶数	仅 MOR 表可用

通用配置

配置项名	默认值	说明	引入版本
hoodie.index.type	无默认值	索引类型，可取值：HBASE、INMEMORY、BLOOM、GLOBAL_BLOOM、SIMPLE、GLOBAL_SIMPLE、BUCKET、FLINK_STATE
hoodie.index.class	""	指定索引类，必须为 HoodieIndex 的子类，自带的有 SparkHoodieHBaseIndex、HoodieBloomIndex、FlinkInMemoryStateIndex、HoodieSimpleBucketIndex、HoodieSparkConsistentBucketIndex

布隆索引配置

配置项名	默认值	说明
hoodie.index.bloom.num_entries	60000	布隆索引配置，指定布隆槽数
hoodie.index.bloom.fpp	0.000000001	布隆精度
hoodie.bloom.index.parallelism	0	布隆并行查询数，0 表示根据负载自动
hoodie.bloom.index.use.caching	true	是否缓存布隆的计算
hoodie.bloom.index.filter.type	DYNAMIC_V0	布隆过滤器类型，可选 DYNAMIC_V0 或 SIMPLE
hoodie.bloom.index.keys.per.bucket	10000000

SIMPLE索引配置

配置项名	默认值	说明	引入版本
hoodie.simple.index.use.caching	true
hoodie.simple.index.parallelism	0
hoodie.global.simple.index.parallelism	100

clustering性能amazon hudi

precombine field hudi