概述
HDFS-Spark-Hudi环境的搭建及测试
由于需要进行Hudi的详细选型,本文从0开始搭建一个Spark+Hudi的环境,并进行简单使用。
1. 前提假设
1)假设在Linux进行环境安装,操作系统选择Ubuntu 22.04 LTS版本。
2)Ubuntu的源配置清华源。
3)JDK安装完毕(当前是1.8版本,1.8.0_333)。
2. 配置免密登录(Hadoop脚本需要)
2.1 安装openssh服务
sudo apt-get install openssh-server
sudo service ssh restart
2.2 localhost免密登录
ssh-keygen -t rsa
ssh-copy-id -i id_rsa <your user name>@localhost
ssh localhost
3. Hadoop安装
Hadoop安装的是单节点伪分布式环境,版本选择和后继的Spark选择有关联。
例如:Hadoop 3.2.3
Hudi当前支持的是Spark3.2,对应的Spark也是3.2。
3.1 下载并解压
1)下载hadoop二进制包 https://hadoop.apache.org/releases.html
2)解压缩到自己方便的安装目录
3.2 Hadoop配置并运行 HDFS
1)添加环境变量 ~/.profile:
vi ~/.profile
export HADOOP_HOME=<YOUR_HADOOP_DECOMPRESSD_PATH>
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"
source ~/.profile
给$HADOOP_HOME/etc/hadoop/hadoop-env.sh添加两个环境变量设置:
export JAVA_HOME=<YOUR_JAVA_HOME_PATH>
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"
此处很奇怪,不知道为什么系统的JAVA_HOME没生效,在这个hadoop-env.sh中还要再设置一次,不然后面启动dfs的时候报错。
运行下hadoop version看下是否能执行
$ hadoop version
Hadoop 3.2.3
Source code repository https://github.com/apache/hadoop -r abe5358143720085498613d399be3bbf01e0f131
Compiled by ubuntu on 2022-03-20T01:18Z
Compiled with protoc 2.5.0
From source with checksum 39bb14faec14b3aa25388a6d7c345fe8
This command was run using /<your path>/hadoop-3.2.3/share/hadoop/common/hadoop-common-3.2.3.jar
2)创建几个hadoop存储文件的本地目录:
$ tree dfs
dfs
├── data
├── name
└── temp
3)修改配置文件
$HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/dfs/temp</value>
<---- 你自己的临时文件目录
<description>Abase for other temporary directories.</description>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<!--指定hdfs中namenode的存储位置-->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/dfs/name</value>
<---- 你自己的HDFS本地Namenode存储目录
</property>
<!--指定hdfs中datanode的存储位置-->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/dfs/data</value>
<---- 你自己的HDFS本地数据存储目录
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
4)格式化namenode
hdfs namenode -format
$ tree
.
├── data
├── name
│
└── current
│
├── fsimage_0000000000000000000
│
├── fsimage_0000000000000000000.md5
│
├── seen_txid
│
└── VERSION
└── temp
4 directories, 4 files
5)启动Hadoop
start-all.sh
$ jps
13392 Jps
12363 NameNode
12729 SecondaryNameNode
12526 DataNode
12931 ResourceManager
13077 NodeManager
http://localhost:9870
就可以看到HDFS的管理页面。
4. Hive安装
因为Spark SQL需要使用Hive Meta Store(HMS),需要安装Hive。
1)下载,解压
从 http://hive.apache.org/downloads.html 下载了最新的版本3.1.3。
下载后解压,配置HIVE_HOME环境变量。
vi ~/.profile
export HIVE_HOME=/<YOUR PATH>/apache-hive-3.1.3-bin
export PATH=$HIVE_HOME/bin:$PATH
2)配置Hive,缺省连接hadoop和derby元数据,修改为连接MySQL作为元数据存储。
缺省hive的配置在$HIVE_HOME/conf中,以template结尾。需要改为使用的名字。
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-log4j2.properties.template hive-log4j2.properties
然后进行修改。
hive-site.xml添加:
<property>
<name>system:java.io.tmpdir</name>
<value>/<YOUR_PATH>/setup/hive</value>
<--- 建立一个临时文件目录
<description></description>
</property>
<property>
<name>system:user.name</name>
<value><YOUR_NAME></value>
<--- 访问HDFS用户名
<description></description>
</property>
如果运行失败,需要将hadoop中的guava-27.0-jre.jar拷贝到$HIVE_HOME/lib中,并将guava-19.0.jar删除。
另外,hive-site.xml的3227行有一个异常字符,需要删除掉。
针对hive-env.sh的修改:
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/home/redstar/setup/hadoop-3.2.2
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/home/redstar/setup/apache-hive-3.1.2-bin/conf
修改hadoop的core-site.xml配置文件,此处修改为hadoop文件的用户名。
<property>
<name>hadoop.proxyuser.<your name>.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.<your name>.groups</name>
<value>*</value>
</property>
安装MySQL并创建HiveMetaStore对应的库。
-
安装MySQL,(MariaDB-Server也可以)
-
创建相关的用户,库和设置权限。
create user 'hive' identified by '<#YOUR_PASSWORD>';
create database hive;
grant all privileges on hive.* to 'hive'@'localhost' identified by '<#YOUR_PASSWORD>';
flush privileges;
修改hive的Metatore存储为MySQL
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/hive?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><#YOUR_PASSWORD></value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
<description>Whether to print the names of the columns in query output.</description>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
将JDBC的驱动mysql-connector-java-8.0.27.jar拷贝到$HIVE_HOME/lib目录下。
初始化Hive的Metastore。使用MySQL存储。
schematool -initSchema -dbType mysql
3)验证hive的运行。
启动hadoop
cd $HADOOP_HOME/sbin
$ ./hadoop-daemon.sh start datanode
$ ./hadoop-daemon.sh start namenode
$ yarn-daemon.sh start nodemanager
$ yarn-daemon.sh start resourcemanager
启动hive
cd $HIVE_HOME/bin
./hiveserver2
查看日志,确保启动成功。
启动hive客户端beeline
./beeline
!connect jdbc:hive2://127.0.0.1:10000
如果提示连接成功,可以进行基本的hive建表及查询操作。
如果有错误,检查hadoop的磁盘文件权限。
4. Spark安装
4.1 Scala安装
下载,解压,并配置环境变量。
下载的时候需要注意一下Scala的版本和Spark的匹配。当前Spark 3.2+是使用Scala2.12进行的预编译。
https://www.scala-lang.org/download/scala2.html
vi ~/.profile
export SCALA_HOME=/home/redstar/setup/scala-2.12.16
export PATH=$SCALA_HOME/bin:$PATH
source ~/.profile
运行scala查看是否安装正常。
$ scala
Welcome to Scala 2.12.16 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_333).
Type in expressions for evaluation. Or try :help.
scala>
4.2 Spark下载并配置
1)下载 http://spark.apache.org/downloads.html
下载的时候需要看下Hadoop的支持版本。例如:spark-3.2.1-bin-hadoop3.2.tgz是指支持的Hadoop是3.2以上。
2)添加环境变量: ~/.profile:
vi ~/.profile
export SPARK_HOME=/<YOUR_SPARK_PATH>/spark-3.2.1-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
source ~/.profile
3)修改Spark的配置
初始下载的是$SPARK_HOME/conf/spark-env.sh.template,更名为spark-env.sh
JAVA_HOME=<YOUR_JAVA_HOME>
SCALA_HOME=<YOUR_SCALA_HOME>
HADOOP_CONF_DIR=/<YOUR_HADOOP_HOME>/etc/hadoop
SPARK_MASTER_HOST=localhost
SPARK_WORKER_MEMORY=4g
4)启动spark
$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /<your path>/spark-3.2.1-bin-hadoop3.2/logs/spark-xxxxxxx-org.apache.spark.deploy.master.Master-1-xxxxxxx-Precision-5520.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /<your path>/spark-3.2.1-bin-hadoop3.2/logs/spark-xxxxxxx-org.apache.spark.deploy.worker.Worker-1-xxxxxxx-Precision-5520.out
$ jps
12931 ResourceManager
13077 NodeManager
12729 SecondaryNameNode
12363 NameNode
14173 Jps
12526 DataNode
13974 Master
14101 Worker
http://localhost:8080
就可以看到Spark的管理页面。
5)跑个简单的计算PI的例子,看下是否安装成功。
$ spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 ../examples/jars/spark-examples_2.12-3.2.1.jar 10
...
...
22/07/10 00:05:02 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.756610 s
Pi is roughly 3.143159143159143
22/07/10 00:05:02 INFO SparkUI: Stopped Spark web UI at http://10.0.0.13:4040
22/07/10 00:05:02 INFO StandaloneSchedulerBackend: Shutting down all executors
22/07/10 00:05:02 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/07/10 00:05:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/10 00:05:02 INFO MemoryStore: MemoryStore cleared
22/07/10 00:05:02 INFO BlockManager: BlockManager stopped
22/07/10 00:05:02 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/10 00:05:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/10 00:05:02 INFO SparkContext: Successfully stopped SparkContext
22/07/10 00:05:02 INFO ShutdownHookManager: Shutdown hook called
22/07/10 00:05:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-4a66d2a4-b0c3-4b0c-b9de-ca1ec61f745b
22/07/10 00:05:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-328e3ea2-5843-454d-9749-3af87223ef6a
6)Spark-shell执行一个简单的统计。
hdfs dfs -put /<YOUR_PATH>/hadoop-3.2.3/README.txt /test_data
hdfs dfs -ls /test_data
spark-shell --master local[4]
scala> val datasRDD = sc.textFile("/test_data/README.txt")
scala> datasRDD.count
scala> datasRDD.first
scala> quit
将mysql-connector-java-8.0.27.jar拷贝到$SPACK_HOME/jar目录下。
启动Spark-SQL
#启动HMS
hive --service metastore
cd $SPARK_HOME/bin
spark-sql
5. Hudi安装
1)下载,解压,编译
https://hudi.apache.org/releases/download
下载最新的源码,当前是0.11.1
解压,并进入目录,进行编译:
$ mvn clean package -DskipTests -Dspark3.2 -Dscala-2.12
...
...
[INFO] hudi-examples-common ............................... SUCCESS [
2.279 s]
[INFO] hudi-examples-spark ................................ SUCCESS [
6.174 s]
[INFO] hudi-flink-datasource .............................. SUCCESS [
0.037 s]
[INFO] hudi-flink1.14.x ................................... SUCCESS [
0.143 s]
[INFO] hudi-flink ......................................... SUCCESS [
3.311 s]
[INFO] hudi-examples-flink ................................ SUCCESS [
1.859 s]
[INFO] hudi-examples-java ................................. SUCCESS [
2.403 s]
[INFO] hudi-flink1.13.x ................................... SUCCESS [
0.338 s]
[INFO] hudi-kafka-connect ................................. SUCCESS [
2.154 s]
[INFO] hudi-flink1.14-bundle_2.12 ......................... SUCCESS [ 23.147 s]
[INFO] hudi-kafka-connect-bundle .......................... SUCCESS [ 27.037 s]
[INFO] hudi-spark2_2.12 ................................... SUCCESS [ 12.065 s]
[INFO] hudi-spark2-common ................................. SUCCESS [
0.061 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:
07:08 min
[INFO] Finished at: 2022-07-10T13:22:24+08:00
[INFO] ------------------------------------------------------------------------
将编译好的Jar包拷贝到spark-3.2.1-bin-hadoop3.2/jars目录下。
启动SparkSQL客户端:
spark-sql
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
一些SQL的测试:
create table test_t1_hudi_cow (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh);
insert into test_t1_hudi_cow select 1, 'a0', 1000, '2021-12-09', '10';
select * from test_t1_hudi_cow;
-- record id=1 changes `name`
insert into test_t1_hudi_cow select 1, 'a1', 1001, '2021-12-09', '10';
select * from test_t1_hudi_cow;
-- time travel based on first commit time, assume `20220307091628793`
select * from test_t1_hudi_cow timestamp as of '20220307091628793' where id = 1;
-- time travel based on different timestamp formats
select * from test_t1_hudi_cow timestamp as of '2022-03-07 09:16:28.100' where id = 1;
select * from test_t1_hudi_cow timestamp as of '2022-03-08' where id = 1;
查看下Hudi生成了什么数据:
#------------------------------------------------------
create table test_t2_hudi_cow (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh);
#------------------------------------------------------
test_t2_hudi_cow$ tree -a
.
└── .hoodie
├── archived
├── .aux
│
└── .bootstrap
│
├── .fileids
│
└── .partitions
├── hoodie.properties
├── .hoodie.properties.crc
├── .schema
└── .temp
#------------------------------------------------------
insert into test_t2_hudi_cow select 1, 'a0', 1000, '2021-12-09', '10';
test_t2_hudi_cow$ tree -a
.
├── dt=2021-12-09
│
└── hh=10
│
├── 6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet
│
├── .6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet.crc
│
├── .hoodie_partition_metadata
│
└── ..hoodie_partition_metadata.crc
└── .hoodie
├── 20220710225229810.commit
├── .20220710225229810.commit.crc
├── 20220710225229810.commit.requested
├── .20220710225229810.commit.requested.crc
├── 20220710225229810.inflight
├── .20220710225229810.inflight.crc
├── archived
├── .aux
│
└── .bootstrap
│
├── .fileids
│
└── .partitions
├── hoodie.properties
├── .hoodie.properties.crc
├── metadata
│
├── files
│
│
├── .files-0000_00000000000000.log.1_0-0-0
│
│
├── ..files-0000_00000000000000.log.1_0-0-0.crc
│
│
├── .files-0000_00000000000000.log.1_0-119-1312
│
│
├── ..files-0000_00000000000000.log.1_0-119-1312.crc
│
│
├── .files-0000_00000000000000.log.2_0-150-2527
│
│
├── ..files-0000_00000000000000.log.2_0-150-2527.crc
│
│
├── .hoodie_partition_metadata
│
│
└── ..hoodie_partition_metadata.crc
│
└── .hoodie
│
├── 00000000000000.deltacommit
│
├── .00000000000000.deltacommit.crc
│
├── 00000000000000.deltacommit.inflight
│
├── .00000000000000.deltacommit.inflight.crc
│
├── 00000000000000.deltacommit.requested
│
├── .00000000000000.deltacommit.requested.crc
│
├── 20220710225229810.deltacommit
│
├── .20220710225229810.deltacommit.crc
│
├── 20220710225229810.deltacommit.inflight
│
├── .20220710225229810.deltacommit.inflight.crc
│
├── 20220710225229810.deltacommit.requested
│
├── .20220710225229810.deltacommit.requested.crc
│
├── archived
│
├── .aux
│
│
└── .bootstrap
│
│
├── .fileids
│
│
└── .partitions
│
├── .heartbeat
│
├── hoodie.properties
│
├── .hoodie.properties.crc
│
├── .schema
│
└── .temp
├── .schema
└── .temp
#------------------------------------------------------
insert into test_t2_hudi_cow select 1, 'a1', 1001, '2021-12-09', '10';
select * from test_t2_hudi_cow;
test_t2_hudi_cow$ tree -a
.
├── dt=2021-12-09
│
└── hh=10
│
├── 6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet
│
├── .6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet.crc
│
├── 6f3c5a3b-e562-4398-b01d-223d85165193-0_0-178-3798_20220710230154422.parquet
│
├── .6f3c5a3b-e562-4398-b01d-223d85165193-0_0-178-3798_20220710230154422.parquet.crc
│
├── .hoodie_partition_metadata+
│
└── ..hoodie_partition_metadata.crc
├── .hoodie
│
├── 20220710225229810.commit
│
├── .20220710225229810.commit.crc
│
├── 20220710225229810.commit.requested
│
├── .20220710225229810.commit.requested.crc
│
├── 20220710225229810.inflight
│
├── .20220710225229810.inflight.crc
│
├── 20220710230154422.commit
│
├── .20220710230154422.commit.crc
│
├── 20220710230154422.commit.requested
│
├── .20220710230154422.commit.requested.crc
│
├── 20220710230154422.inflight
│
├── .20220710230154422.inflight.crc
│
├── archived
│
├── .aux
│
│
└── .bootstrap
│
│
├── .fileids
│
│
└── .partitions
│
├── hoodie.properties
│
├── .hoodie.properties.crc
│
├── metadata
│
│
├── files
│
│
│
├── .files-0000_00000000000000.log.1_0-0-0
│
│
│
├── ..files-0000_00000000000000.log.1_0-0-0.crc
│
│
│
├── .files-0000_00000000000000.log.1_0-119-1312
│
│
│
├── ..files-0000_00000000000000.log.1_0-119-1312.crc
│
│
│
├── .files-0000_00000000000000.log.2_0-150-2527
│
│
│
├── ..files-0000_00000000000000.log.2_0-150-2527.crc
│
│
│
├── .files-0000_00000000000000.log.3_0-188-3804
│
│
│
├── ..files-0000_00000000000000.log.3_0-188-3804.crc
│
│
│
├── .hoodie_partition_metadata
│
│
│
└── ..hoodie_partition_metadata.crc
│
│
└── .hoodie
│
│
├── 00000000000000.deltacommit
│
│
├── .00000000000000.deltacommit.crc
│
│
├── 00000000000000.deltacommit.inflight
│
│
├── .00000000000000.deltacommit.inflight.crc
│
│
├── 00000000000000.deltacommit.requested
│
│
├── .00000000000000.deltacommit.requested.crc
│
│
├── 20220710225229810.deltacommit
│
│
├── .20220710225229810.deltacommit.crc
│
│
├── 20220710225229810.deltacommit.inflight
│
│
├── .20220710225229810.deltacommit.inflight.crc
│
│
├── 20220710225229810.deltacommit.requested
│
│
├── .20220710225229810.deltacommit.requested.crc
│
│
├── 20220710230154422.deltacommit
│
│
├── .20220710230154422.deltacommit.crc
│
│
├── 20220710230154422.deltacommit.inflight
│
│
├── .20220710230154422.deltacommit.inflight.crc
│
│
├── 20220710230154422.deltacommit.requested
│
│
├── .20220710230154422.deltacommit.requested.crc
│
│
├── archived
│
│
├── .aux
│
│
│
└── .bootstrap
│
│
│
├── .fileids
│
│
│
└── .partitions
│
│
├── .heartbeat
│
│
├── hoodie.properties
│
│
├── .hoodie.properties.crc
│
│
├── .schema
│
│
└── .temp
│
├── .schema
│
└── .temp
└── .idea
├── .gitignore
├── misc.xml
├── modules.xml
├── runConfigurations.xml
├── test_t2_hudi_cow.iml
├── vcs.xml
└── workspace.xml
最后
以上就是迷人小蜜蜂为你收集整理的HDFS-Spark-Hudi环境的搭建及测试HDFS-Spark-Hudi环境的搭建及测试的全部内容,希望文章能够帮你解决HDFS-Spark-Hudi环境的搭建及测试HDFS-Spark-Hudi环境的搭建及测试所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复