hudi学习

发布时间 2023-05-23 20:37:18作者: 钱塘江畔

1.背景

想要对自己的各种数据(非结构化)进行统一管理,突然想到数据湖,看看是否符合我的需求。

2.Hudi简介

2.1 hudi的特性

mutability support for all data lake workoads
Quickly update & delete data with Hudi's fast, pluggable indexing. This includes streaming workloads, with full support for out-of-order data, bursty traffic & data deduplication.

使用Hudi的快速可插拔索引快速更新和删除数据。这包括流式工作负载,完全支持无序数据、突发流量和重复数据消除。

Improved efficiency by incrementally processing new data
Replace old-school batch pipelines with incremental streaming on your data lake.Experience faster ingestion and lower processing times for analytical workloads

用数据湖上的增量流取代老式的批处理管道。体验更快的接收和更低的分析工作负载处理时间.

ACID Transactionnal guarantees to your data lake
Bring transactional guarantees to your data lake, with consistent, atomic writes and concurrency controls tailored for longer-running lake transactions.

ACID对数据湖的事务性保证
为您的数据湖提供事务性保证,为长期运行的湖事务量身定制一致的原子写入和并发控制。

Unlock historical data with time travel
Query historical data with the ability to roll back to a table version; debug data versions to understand what changed over time; audit data changes by viewing the commit history.

通过时间旅行解锁历史数据
查询历史数据,能够回滚到表格版本;调试数据版本以了解随时间变化的内容;通过查看提交历史记录来审核数据更改。

Interoperable multi-cloud ecosystem support
Extensive ecosystem support with plug-and-play options for popullar data sources & query engines. Build future-proof architectures interoperable with your vendor of choice.

可互操作的多云生态系统支持
广泛的生态系统支持,为大众数据源和查询引擎提供即插即用选项。构建经得起未来考验的体系结构,可与您选择的供应商进行互操作。

Comprehensive table services for high-performance analytics
Fully automated table services that continuously schedule & orchestrate clustering, compaction, cleaning, file sizeing & indexing to ensure tables area always ready.

用于高性能分析的全面表格服务
全自动的表服务,不断安排和协调集群、压缩、清理、文件大小和索引,以确保表区域始终准备就绪。

A rich platform to build your lakehouse faster
Effortlessly build your lakehouse with built-in tools for auto ingestion from services like Debezium and Kafka and auto catalog sync for easy discoverability & more.

一个丰富的平台,可以更快地建造你的湖屋
使用内置工具轻松构建您的lakehouse,用于从Debezium和Kafka等服务中自动获取信息,并自动同步目录以便于发现等等。

Query acceleration through multi-modal indexes
Experience faster write transactions on huge/wide tables & faster query performance with first-of-its kind multi-modal indexing subsystem.

通过多模式索引实现查询加速
使用第一个多模式索引子系统,在巨大/宽表上体验更快的写入事务和更快的查询性能。

Resilient Pipelines with schema evolution & enforcement
Easily change the current schema of a Hudi table to adapt to the data that is changing over time and ensure pipeline resilience by failing fast and avoiding data corruption.

具有模式演变和实施的弹性管道
轻松更改Hudi表的当前模式,以适应随时间变化的数据,并通过快速故障和避免数据损坏来确保管道弹性。

2.2

3.引用