数据湖之Hudi基础：集成Spark环境Spark-shell方式Spark SQLSpark代码访问

75 阅读 0 评论 50 点赞

我是靠谱客的博主畅快棒球，最近开发中收集的这篇文章主要介绍数据湖之Hudi基础：集成Spark环境Spark-shell方式Spark SQLSpark代码访问，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

主要记录下Hudi的整合Spark操作，操作内容参考尚硅谷Hudi公开资料以及Hudi官方文档

具体参看官方文档：https://hudi.apache.org/docs/0.12.1/quick-start-guide

文章目录

环境
Spark-shell方式
- 启动sparkshell
- 设置表名、基本路径、数据生成器
- 插入数据
- 查询数据
- 时间旅行查询
- 更新数据
- 增量查询
- 指定时间点查询
- 删除数据
- 覆盖数据
Spark SQL
- 创建表
- - 创建非分区表
  - 创建分区表
  - 在已有的hudi表上创建新表
  - 通过CTAS(Create Table As Select)建表
- 插入数据
- - 向非分区表插入数据
  - 向分区表动态分区插入数据
  - 向分区表静态分区插入数据
  - 使用bulk_insert插入数据
- 查询数据
- - 查询
  - 时间旅行查询
- 更新数据
- - update
  - MergeInto
- 删除数据
- 覆盖数据
- - insert overwrite 非分区表
  - 通过动态分区insert overwrite table到分区表
  - 通过静态分区insert overwrite
- 修改表结构
- 修改分区
- 存储过程（Procedures）
Spark代码访问
- maven项目pom
- 插入数据
- 查询数据
- 更新数据
- 指定时间点查询
- 增量查询
- 删除数据
- 覆盖数据
- 提交运行

环境

Hudi	Spark3的版本
0.12.x	3.3.x、3.2.x、3.1.x
0.11.x	3.2.x(default build, Spark bundle only), 3.1.x
0.10.x	3.1.x(default build)， 3.0.x
0.7.0-0.9.0	3.0.x
0.6.0 and prior	Not supported

安装好spark3.2.2，本文使用Spark3.2.2测试
启动hadoop集群
拷贝编译好的包到Spark home的jars目录

参考：

Spark-shell方式

以下scala代码都能正常执行

启动sparkshell

spark-shell 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' 
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

设置表名、基本路径、数据生成器

不需要单独的建表。如果表不存在，写表将创建该表

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator

插入数据

通过生成器生成数据并加载到DF中，将DF中数据写入Hudi表

val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)

Mode（overwrite）将覆盖重新创建表（如果已存在）。可以检查/tmp/hudi_trps_cow 路径下是否有数据生成

[root@m3 sao_paulo]# pwd
/tmp/hudi_trips_cow/americas/brazil/sao_paulo
[root@m3 sao_paulo]# ls
4fb0565d-8c7c-4375-b038-c6068753bf24-0_0-28-34_20230116101745208.parquet

数据文件命名规则：String.format(“%s_%s_%s.%s”, fileId, writeToken, instantTime, fileExtension)

查询数据

val tripsSnapshotDF = spark.
read.
format("hudi").
load(basePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from
hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()

时间旅行查询

spark.read.
format("hudi").
option("as.of.instant", "20210728141108100").
load(basePath)
spark.read.
format("hudi").
option("as.of.instant", "2021-07-28 14:11:08.200").
load(basePath)
// 表示 "as.of.instant = 2021-07-28 00:00:00"
spark.read.
format("hudi").
option("as.of.instant", "2021-07-28").
load(basePath)

更新数据

类似于插入新数据，使用数据生成器生成新数据对历史数据进行更新。将数据加载到DataFrame中并将DataFrame写入Hudi表中

val updates = convertToStringList(dataGen.generateUpdates(10))
val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)

注意：保存模式现在是Append。通常，除非是第一次创建表，否则请始终使用追加模式。现在再次查询数据将显示更新的行程数据。每个写操作都会生成一个用时间戳表示的新提交。查找以前提交中相同的_hoodie_record_keys在该表的_hoodie_commit_time、rider、driver字段中的变化。

查询更新后的数据，要重新加载该hudi表，或者使用新的DF：

val tripsSnapshotDF = spark.
read.
format("hudi").
load(basePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()

增量查询

增量查询可以获取从给定提交时间戳以来更改的数据流。需要指定增量查询的beginTime，选择性指定endTime。如果我们希望在给定提交之后进行所有更改，则不需要指定endTime（这是常见的情况）

// 重新加载数据
spark.
read.
format("hudi").
load(basePath).
createOrReplaceTempView("hudi_trips_snapshot")
// 获取指定beginTime
val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from
hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
val beginTime = commits(commits.length - 2)
// 创建增量查询的表
val tripsIncrementalDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
// 查询增量表
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()

指定时间点查询

// 查询特定时间点的数据，可以将endTime指向特定时间，beginTime指向000（表示最早提交时间）
// 指定beginTime和endTime
val beginTime = "000"
val endTime = commits(commits.length - 2)
// 根据指定时间创建表
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
// 查询
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()

删除数据

// 根据传入的HoodieKeys来删除（uuid + partitionpath），只有append模式，才支持删除功能。
// 获取总行数
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
// 取其中2条用来删除
val ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
// 将待删除的2条数据构建DF
val deletes = dataGen.generateDeletes(ds.collectAsList())
val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2))
// 执行删除
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION_OPT_KEY,"delete").
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
// 统计删除数据后的行数，验证删除是否成功
val roAfterDeleteViewDF = spark.
read.
format("hudi").
load(basePath)
roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
// 返回的总行数应该比原来少2行
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()

覆盖数据

对于表或分区来说，如果大部分记录在每个周期都发生变化，那么做upsert或merge的效率就很低。我们希望类似hive的 "insert overwrite "操作，以忽略现有数据，只用提供的新数据创建一个提交。

也可以用于某些操作任务，如修复指定的问题分区。我们可以用源文件中的记录对该分区进行’插入覆盖’。对于某些数据源来说，这比还原和重放要快得多。

Insert overwrite操作可能比批量ETL作业的upsert更快，批量ETL作业是每一批次都要重新计算整个目标分区（包括索引、预组合和其他重分区步骤）。

// 查看当前表的key
spark.
read.format("hudi").
load(basePath).
select("uuid","partitionpath").
sort("partitionpath","uuid").
show(100, false)
// 生成一些新的行程数据
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.
read.json(spark.sparkContext.parallelize(inserts, 2)).
filter("partitionpath = 'americas/united_states/san_francisco'")
// 覆盖指定分区
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION.key(),"insert_overwrite").
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "uuid").
option(PARTITIONPATH_FIELD.key(), "partitionpath").
option(TBL_NAME.key(), tableName).
mode(Append).
save(basePath)
// 查询覆盖后的key，发生了变化
spark.
read.format("hudi").
load(basePath).
select("uuid","partitionpath").
sort("partitionpath","uuid").
show(100, false)

Spark SQL

创建表

启动Hive的Metastore
```
nohup hive --service metastore &
```

启动spark-sql

spark-sql 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' 
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

注意把hive-site.xml复制到spark的conf；把hive的lib下的mysql驱动包复制到spark的jars目录

注意spark启动executor和Driver默认是1G

建表参数

参数名	默认值	说明
primaryKey	uuid	表的主键名，多个字段用逗号分隔。同 hoodie.datasource.write.recordkey.field
preCombineField		表的预合并字段。同 hoodie.datasource.write.precombine.field
type	cow	创建的表类型： type = ‘cow’ type = 'mor’同hoodie.datasource.write.table.type

创建非分区表

创建一个cow表，默认primaryKey ‘uuid’，不提供preCombineField

create table hudi_cow_nonpcf_tbl (
uuid int,
name string,
price double
) using hudi;

创建一个mor非分区表

create table hudi_mor_tbl (
id int,
name string,
price double,
ts bigint
) using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts'
);

创建分区表

创建一个cow分区外部表，指定primaryKey和preCombineField

create table hudi_cow_pt_tbl (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh)
location '/tmp/hudi_test/hudi_cow_pt_tbl';

在已有的hudi表上创建新表

不需要指定模式和非分区列（如果存在）之外的任何属性，Hudi可以自动识别模式和配置。

非分区表

create table hudi_existing_tbl0 using hudi
location 'file:///tmp/hudi_test/hudi_cow_pt_tbl';

分区表

create table hudi_existing_tbl1 using hudi
partitioned by (dt, hh)
location 'file:///tmp/hudi_test/hudi_cow_pt_tbl';

通过CTAS(Create Table As Select)建表

为了提高向hudi表加载数据的性能，CTAS使用批量插入作为写操作。

通过CTAS创建cow非分区表，不指定preCombineField

create table hudi_ctas_cow_nonpcf_tbl
using hudi
tblproperties (primaryKey = 'id')
as
select 1 as id, 'a1' as name, 10 as price;

插入数据

默认情况下，如果提供了preCombineKey，则insert into的写操作类型为upsert，否则使用insert。

向非分区表插入数据

insert into hudi_cow_nonpcf_tbl select 1, 'a1', 20;
insert into hudi_mor_tbl select 1, 'a1', 20, 1000;

向分区表动态分区插入数据

insert into hudi_cow_pt_tbl partition (dt, hh)
select 1 as id, 'a1' as name, 1000 as ts, '2023-01-17' as dt, '10' as hh;

向分区表静态分区插入数据

insert into hudi_cow_pt_tbl partition(dt = '2023-01-17', hh='11') select 2, 'a2', 1000;

使用bulk_insert插入数据

hudi支持使用bulk_insert作为写操作的类型，只需要设置两个配置：

hoodie.sql.bulk.insert.enable和hoodie.sql.insert.mode

-- 向指定preCombineKey的表插入数据，则写操作为upsert
insert into hudi_mor_tbl select 1, 'a1_1', 20, 1001;
select id, name, price, ts from hudi_mor_tbl;
1
a1_1
20.0
1001
Time taken: 0.907 seconds, Fetched 1 row(s)
-- 向指定preCombineKey的表插入数据，指定写操作为bulk_insert 
set hoodie.sql.bulk.insert.enable=true;
set hoodie.sql.insert.mode=non-strict;
insert into hudi_mor_tbl select 1, 'a1_2', 20, 1002;
select id, name, price, ts from hudi_mor_tbl;
1
a1_2
20.0
1002
1
a1_1
20.0
1001
Time taken: 0.328 seconds, Fetched 2 row(s)

查询数据

查询

select fare, begin_lon, begin_lat, ts from
hudi_trips_snapshot where fare > 20.0;

时间旅行查询

Hudi从0.9.0开始就支持时间旅行查询。Spark SQL方式要求Spark版本 3.2及以上。

-- 关闭前面开启的bulk_insert
set hoodie.sql.bulk.insert.enable=false;
create table hudi_cow_pt_tbl1 (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl1';
-- 插入一条id为1的数据
insert into hudi_cow_pt_tbl1 select 1, 'a0', 1000, '2023-01-17', '10';
select * from hudi_cow_pt_tbl1;
-- 修改id为1的数据
insert into hudi_cow_pt_tbl1 select 1, 'a1', 1001, '2023-01-17', '10';
select * from hudi_cow_pt_tbl1;
-- 基于第一次提交时间进行时间旅行
select * from hudi_cow_pt_tbl1 timestamp as of '20230116143633359' where id = 1;
-- 其他时间格式的时间旅行写法
select * from hudi_cow_pt_tbl1 timestamp as of '2023-01-16 14:36:33.359' where id = 1;
select * from hudi_cow_pt_tbl1 timestamp as of '2023-01-16' where id = 1;

更新数据

update

更新操作需要指定preCombineField

UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]

测试更新

update hudi_mor_tbl set price = price * 2, ts = 1111 where id = 1;
update hudi_cow_pt_tbl1 set name = 'a1_1', ts = 1001 where id = 1;
-- update using non-PK field
update hudi_cow_pt_tbl1 set ts = 1111 where name = 'a1_1';

MergeInto

MERGE INTO tableIdentifier AS target_alias
USING (sub_query | tableIdentifier) AS source_alias
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ]
THEN <not_matched_action> ]
<merge_condition> =A equal bool condition
<matched_action>
=
DELETE
|
UPDATE SET *
|
UPDATE SET column1 = expression1 [, column2 = expression2 ...]
<not_matched_action>
=
INSERT *
|
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])

测试mergeInto

-- 1、准备source表：非分区的hudi表，插入数据
create table merge_source (id int, name string, price double, ts bigint) using hudi
tblproperties (primaryKey = 'id', preCombineField = 'ts');
insert into merge_source values (1, "old_a1", 22.22, 2900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000);
merge into hudi_mor_tbl as target
using merge_source as source
on target.id = source.id
when matched then update set *
when not matched then insert *
;
-- 2、准备source表：分区的parquet表，插入数据
create table merge_source2 (id int, name string, flag string, dt string, hh string) using parquet;
insert into merge_source2 values (1, "new_a1", 'update', '2021-12-09', '10'), (2, "new_a2", 'delete', '2021-12-09', '11'), (3, "new_a3", 'insert', '2021-12-09', '12');
merge into hudi_cow_pt_tbl1 as target
using (
select id, name, '2000' as ts, flag, dt, hh from merge_source2
) source
on target.id = source.id
when matched and flag != 'delete' then
update set id = source.id, name = source.name, ts = source.ts, dt = source.dt, hh = source.hh
when matched and flag = 'delete' then delete
when not matched then
insert (id, name, ts, dt, hh) values(source.id, source.name, source.ts, source.dt, source.hh)
;

删除数据

DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]

delete from hudi_cow_nonpcf_tbl where uuid = 1;
delete from hudi_mor_tbl where id % 2 = 0;
-- 使用非主键字段删除
delete from hudi_cow_pt_tbl1 where name = 'a1_1';

覆盖数据

使用INSERT_OVERWRITE类型的写操作覆盖分区表
使用INSERT_OVERWRITE_TABLE类型的写操作插入覆盖非分区表或分区表（动态分区）

insert overwrite 非分区表

insert overwrite hudi_mor_tbl select 99, 'a99', 20.0, 900;
insert overwrite hudi_cow_nonpcf_tbl select 99, 'a99', 20.0;

通过动态分区insert overwrite table到分区表

insert overwrite table hudi_cow_pt_tbl1 select 10, 'a10', 1100, '2021-12-09', '11';

通过静态分区insert overwrite

insert overwrite hudi_cow_pt_tbl1 partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100;

修改表结构

-- Alter table name
ALTER TABLE oldTableName RENAME TO newTableName
-- Alter table add columns
ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)
-- Alter table column type
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType
-- Alter table properties
ALTER TABLE tableIdentifier SET TBLPROPERTIES (key = 'value')

测试

--rename to:
ALTER TABLE hudi_cow_nonpcf_tbl RENAME TO hudi_cow_nonpcf_tbl2;
--add column:
ALTER TABLE hudi_cow_nonpcf_tbl2 add columns(remark string);
--change column:
ALTER TABLE hudi_cow_nonpcf_tbl2 change column uuid uuid int;
--set properties;
alter table hudi_cow_nonpcf_tbl2 set tblproperties (hoodie.keep.max.commits = '10');

修改分区

-- 语法
-- Drop Partition
ALTER TABLE tableIdentifier DROP PARTITION ( partition_col_name = partition_col_val [ , ... ] )
-- Show Partitions
SHOW PARTITIONS tableIdentifier

测试

--show partition:
show partitions hudi_cow_pt_tbl1;
--drop partition：
alter table hudi_cow_pt_tbl1 drop partition (dt='2021-12-09', hh='10');

注意：show partition结果是基于文件系统表路径的。删除整个分区数据或直接删除某个分区目录并不精确。

存储过程（Procedures）

-- 语法
--Call procedure by positional arguments
CALL system.procedure_name(arg_1, arg_2, ... arg_n)
--Call procedure by named arguments
CALL system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1, ... arg_name_n => arg_n)

测试

存储过程参考：https://hudi.apache.org/docs/procedures/

--show commit's info
call show_commits(table => 'hudi_cow_pt_tbl1', limit => 10);

Spark代码访问

除了用shell交互式的操作，还可以自己编写Spark程序，打包提交。

maven项目pom

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>spark_learn</artifactId>
<groupId>org.example</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>spark_hudi</artifactId>
<properties>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.2.2</spark.version>
<hadoop.version>3.1.3</hadoop.version>
<hudi.version>0.12.0</hudi.version>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<!-- 依赖Scala语言 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark Core 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- Spark SQL 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- Hadoop Client 依赖 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<!-- hudi-spark3.2 -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark3.2-bundle_${scala.binary.version}</artifactId>
<version>${hudi.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- assembly打包插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<archive>
<manifest>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<!--Maven编译scala所需依赖-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

插入数据

import org.apache.hudi.QuickstartUtils._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
object InsertDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val tableName = "hudi_trips_cow"
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = sparkSession.read.json(sparkSession.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "uuid").
option(PARTITIONPATH_FIELD.key(), "partitionpath").
option(TBL_NAME.key(), tableName).
mode(Overwrite).
save(basePath)
}
}

查询数据

import org.apache.spark.SparkConf
import org.apache.spark.sql._
object QueryDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
val tripsSnapshotDF = sparkSession.
read.
format("hudi").
load(basePath)
//
时间旅行查询写法一
//
sparkSession.read.
//
format("hudi").
//
option("as.of.instant", "20210728141108100").
//
load(basePath)
//
//
时间旅行查询写法二
//
sparkSession.read.
//
format("hudi").
//
option("as.of.instant", "2021-07-28 14:11:08.200").
//
load(basePath)
//
//
时间旅行查询写法三：等价于"as.of.instant = 2021-07-28 00:00:00"
//
sparkSession.read.
//
format("hudi").
//
option("as.of.instant", "2021-07-28").
//
load(basePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
sparkSession
.sql("select fare, begin_lon, begin_lat, ts from
hudi_trips_snapshot where fare > 20.0")
.show()
}
}

更新数据

import org.apache.hudi.QuickstartUtils._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
object UpdateDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val tableName = "hudi_trips_cow"
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val updates = convertToStringList(dataGen.generateUpdates(10))
val df = sparkSession.read.json(sparkSession.sparkContext.parallelize(updates, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "uuid").
option(PARTITIONPATH_FIELD.key(), "partitionpath").
option(TBL_NAME.key(), tableName).
mode(Append).
save(basePath)
//
val tripsSnapshotDF = sparkSession.
//
read.
//
format("hudi").
//
load(basePath)
//
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
//
//
sparkSession
//
.sql("select fare, begin_lon, begin_lat, ts from
hudi_trips_snapshot where fare > 20.0")
//
.show()
}
}

指定时间点查询

import org.apache.hudi.DataSourceReadOptions._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object PointInTimeQueryDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
import sparkSession.implicits._
val commits = sparkSession.sql("select distinct(_hoodie_commit_time) as commitTime from
hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
val beginTime = "000"
val endTime = commits(commits.length - 2)
val tripsIncrementalDF = sparkSession.read.format("hudi").
option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME.key(), beginTime).
option(END_INSTANTTIME.key(), endTime).
load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_point_in_time")
sparkSession.
sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").
show()
}
}

增量查询

import org.apache.hudi.DataSourceReadOptions._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object IncrementalQueryDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
import sparkSession.implicits._
val commits = sparkSession.sql("select distinct(_hoodie_commit_time) as commitTime from
hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
val beginTime = commits(commits.length - 2)
val tripsIncrementalDF = sparkSession.read.format("hudi").
option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME.key(), beginTime).
load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
sparkSession.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()
}
}

删除数据

import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql._
import scala.collection.JavaConversions._
object DeleteDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val tableName = "hudi_trips_cow"
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
val dataGen = new DataGenerator
sparkSession.
read.
format("hudi").
load(basePath).
createOrReplaceTempView("hudi_trips_snapshot")
sparkSession.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
val ds = sparkSession.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
val deletes = dataGen.generateDeletes(ds.collectAsList())
val df = sparkSession.read.json(sparkSession.sparkContext.parallelize(deletes, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION.key(),"delete").
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "uuid").
option(PARTITIONPATH_FIELD.key(), "partitionpath").
option(TBL_NAME.key(), tableName).
mode(Append).
save(basePath)
val roAfterDeleteViewDF = sparkSession.
read.
format("hudi").
load(basePath)
roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot")
// 返回的总行数应该比原来少2行
sparkSession.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
}
}

覆盖数据

import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql._
import scala.collection.JavaConversions._
object InsertOverwriteDemo {
def main( args: Array[String] ): Unit = {
// 创建 SparkSession
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val tableName = "hudi_trips_cow"
val basePath = "hdfs://m1:8020/tmp/hudi_trips_cow"
val dataGen = new DataGenerator
sparkSession.
read.format("hudi").
load(basePath).
select("uuid","partitionpath").
sort("partitionpath","uuid").
show(100, false)
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = sparkSession.read.json(sparkSession.sparkContext.parallelize(inserts, 2)).
filter("partitionpath = 'americas/united_states/san_francisco'")
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION.key(),"insert_overwrite").
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "uuid").
option(PARTITIONPATH_FIELD.key(), "partitionpath").
option(TBL_NAME.key(), tableName).
mode(Append).
save(basePath)
sparkSession.
read.format("hudi").
load(basePath).
select("uuid","partitionpath").
sort("partitionpath","uuid").
show(100, false)
}
}

提交运行

mvn package打包，jar上传到目录/opt/jars/spark，执行提交命令

bin/spark-submit 
--class com.mym.spark.hudi.InsertDemo 
/opt/jars/spark/spark_hudi-1.0-SNAPSHOT-jar-with-dependencies.jar
bin/spark-submit 
--class com.mym.spark.hudi.QueryDemo 
/opt/jars/spark/spark_hudi-1.0-SNAPSHOT-jar-with-dependencies.jar

执行结果：（上述提交先进行insert，然后进行query）

±-----------------±------------------±------------------±------------+
| fare| begin_lon| begin_lat| ts|
±-----------------±------------------±------------------±------------+
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1673893159737|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1673323605737|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1673741221969|
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1673398228760|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1673572403185|
| 43.4923811219014| 0.8779402295427752| 0.6100070562136587|1673688335445|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1673848647775|
| 41.06290929046368| 0.8192868687714224| 0.651058505660742|1673523482687|
±-----------------±------------------±------------------±------------+

最后

以上就是畅快棒球为你收集整理的数据湖之Hudi基础：集成Spark环境Spark-shell方式Spark SQLSpark代码访问的全部内容，希望文章能够帮你解决数据湖之Hudi基础：集成Spark环境Spark-shell方式Spark SQLSpark代码访问所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Hudi
浏览次数：75 次浏览
发布日期：2023-12-07 01:15:04
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_6_fz_13_jk5.html

数据湖之Hudi基础：集成Spark环境Spark-shell方式Spark SQLSpark代码访问

概述

文章目录

环境

Spark-shell方式

启动sparkshell

设置表名、基本路径、数据生成器

插入数据

查询数据

时间旅行查询

更新数据

增量查询

指定时间点查询

删除数据

覆盖数据

Spark SQL

创建表

创建非分区表

创建分区表

在已有的hudi表上创建新表

通过CTAS(Create Table As Select)建表

插入数据

向非分区表插入数据

向分区表动态分区插入数据

向分区表静态分区插入数据

使用bulk_insert插入数据

查询数据

查询

时间旅行查询

更新数据

update

MergeInto

删除数据

覆盖数据

insert overwrite 非分区表

通过动态分区insert overwrite table到分区表

通过静态分区insert overwrite

修改表结构

修改分区

存储过程（Procedures）

Spark代码访问

maven项目pom

插入数据

查询数据

更新数据

指定时间点查询

增量查询

删除数据

覆盖数据

提交运行

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复