Sqoop使用教程

333 阅读 0 评论 220 点赞

我是靠谱客的博主不安羊，这篇文章主要介绍Sqoop使用教程，现在分享给大家，希望可以做个参考。

Sqoop安装：（要有hadoop环境）

1.上传源码 sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz，并解压

# tar -zxvf sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/

2.安装和配置

2.1在/etc/profile添加sqoop到环境变量

export SQOOP_HOME=/opt/sqoop-1.4.4.bin__hadoop-2.0.4-alpha

export PATH=$PATH:$SQOOP_HOME/bin

2.2让配置生效

source /etc/profile

3.将数据库连接驱动拷贝到$SQOOP_HOME/lib里

Sqooq使用

第一类

数据库中的数据导入到HDFS上

# sqoop import --connect jdbc:mysql://192.168.0.104:3306/mysql --username root --password 12334234--table help_category --target-dir '/sqoop/td'

指定输出路径、指定数据分隔符

# sqoop import --connect jdbc:mysql://192.168.0.104:3306/mysql --username root --password 123456 --table help_category --target-dir '/sqoop/td' --fields-terminated-by 't'

指定Map数量 -m

#sqoop import --connect jdbc:mysql://192.168.0.104:3306/mysql --username root --password 123456 --table help_category --target-dir '/sqoop/td1' --fields-terminated-by 't' -m 1

增加where条件, 注意：条件必须用引号引起来

#sqoop import --connect jdbc:mysql://192.168.0.104:3306/mysql --username root --password 123456 --table help_category --where 'help_category_id>3' --target-dir '/sqoop/td2' --fields-terminated-by 't' -m 1

增加query语句(使用将语句换行)

sqoop import --connect jdbc:mysql://192.168.0.104:3306/mysql --username root --password 123456

--query 'SELECT * FROM help_category where help_category_id > 2 AND $CONDITIONS' --split-by help_category.help_category_id --target-dir '/sqoop/td3'

注意：如果使用--query这个命令的时候，需要注意的是where后面的参数，AND $CONDITIONS这个参数必须加上

而且存在单引号与双引号的区别，如果--query后面使用的是双引号，那么需要在$CONDITIONS前加上即$CONDITIONS

如果设置map数量为1个时即-m 1，不用加上--split-by ${tablename.column}，否则需要加上

自定义列导入

sqoop import --connect jdbc:mysql://192.168.0.104:3306/taxbook --username root --password 1231231 --columns " id,username,emall,mobile,emp_number,status,yxbz,online,userlevel,company_id,create_time,modify_time,create_ry,modify_ry,agent_no" --table sys_user --fields-terminated-by "t" --lines-terminated-by "n" --split-by id --delete-target-dir --target-dir hdfs://zsCluster/user/sqoop/ -m 2

--columns : 要导出的MySQL表中的字段

第二类

1）将HDFS上的数据导出到数据库中

sqoop export --connect jdbc:mysql://192.168.0.104:3306/mytest --username root --password 123455--export-dir '/td3' --table td_bak -m 1 --fields-termianted-by 't'

2）将hive的数据导出到数据库中

sqoop export --connect jdbc:mysql://192.168.0.104:3306/bigdata --username root --password 234144 --table user_tb_summary --fields-terminated-by '01' --update-key date_str --update-mode allowinsert --export-dir /warehouse/tablespace/managed/hive/taxbook1.db/user_summary/delta_0000*

参数解释：

--update-key：更新标识，即根据某个字段进行更新，例如id，可以指定多个更新标识的字段，多个字段之间用逗号分隔。

--update-mode：两种模式：

1)allowinsert：如果update-key指定的字段是主键，则MySQL中存在则更新，不存在则插入；不是主键，则插入；

2)updateonly：只会更新，不会插入；

注意：以上测试要配置mysql远程连接

GRANT ALL PRIVILEGES ON mytest.* TO 'root'@'192.168.0.104' IDENTIFIED BY '123446' WITH GRANT OPTION;
FLUSH PRIVILEGES;
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;
FLUSH PRIVILEGES

第三类：将MySQL上的数据导入到Hive中（以下两种方法都会自动创建hive表）

1）全字段导入

sqoop import
--connect jdbc:mysql://192.168.0.100:3306/taxbook
--username root
--password 213445
--table sys_user
--fields-terminated-by "01"
--hive-import
--create-hive-table
--fields-terminated-by "t"
--hive-database taxbook2
--hive-table user_info
--delete-target-dir
--hive-drop-import-delims
-m 2
--hive-partition-key dt
--hive-partition-value 2021-12-20

注：这里不能用--hive-table taxbook2.user这种，必须用-hive-database和--hive-table分别制定hive的数据库和表，否则会报如下错误：Table or database name may not contain dot(.) character 'taxbook2.user_info' (state=42000,code=40000)

2）自定义字段导入

sqoop import --connect jdbc:mysql://192.168.0.104:3306/taxbook --username root --password 123456 --query "select id,username,emall,mobile,emp_number,status,yxbz,online,userlevel,company_id,create_time,modify_time,create_ry,modify_ry,agent_no from sys_user WHERE $CONDITIONS" --fields-terminated-by "01" --lines-terminated-by "n" --split-by id --hive-import --hive-overwrite --create-hive-table --delete-target-dir --target-dir /user/sqoop/ --hive-drop-import-delims --hive-database taxbook2 --hive-table user_info -m 2

或者

sqoop import --connect jdbc:mysql://192.168.0.104:3306/taxbook --username root --password 123456 --columns "id,username,emall,mobile,emp_number,status,yxbz,online,userlevel,company_id,create_time,modify_time,create_ry,modify_ry,agent_no" --table sys_user --fields-terminated-by "01" --split-by id --delete-target-dir --target-dir /user/sqoop/user_info/ --hive-drop-import-delims --hive-import --hive-overwrite --create-hive-table --hive-database taxbook2 --hive-table user_info -m 1

参数解释：

--table ：要导出的MySQL表

--fields-terminated-by：分隔字段的字符（默认是逗号）

--lines-terminated-by：行分隔符

--split-by：分区字段；假设有一张表test，sqoop命令中–split-by ‘id’，-m 10，首先sqoop会向关系型数据库比如mysql发送一个命令:select max(id),min(id) from test。然后会把max、min之间的区间平均分为10分，最后10个并行的map去找数据库。–split-by对非数字类型的字段支持不好。一般用于主键及数字类型的字段

--hive-import：指定导入到hive中

--hive-overwrite：覆盖Hive表中的现有数据

--create-hive-table：如果设置该参数，则如果目标hive表已存在，当前作业就失败；默认为false（不要被字面意义误解，不加此参数也可以自动创建hive表）

--delete-target-dir：如果目标目录存在，则删除

--target-dir：指定存放于hdfs的目标目录

--hive-table：指定导入到的hive表，格式为：目标库.目标表

-m( --num-mappers)：使用几个map任务并行导入

--hive-drop-import-delims 可以将如mysql中取到的n, r, and 1等特殊字符丢弃

--hive-partition-key：分区字段

--hive-partition-value：分区值

使用hcatalog

HCatalog是Hadoop的一种表和存储管理服务，它使用户能够使用不同的数据处理工具Pig、MapReduce和Hive更容易地在网格上读写数据。HCatalog的表抽象向用户提供Hadoop分布式文件系统(HDFS)中数据的关系视图，并确保用户不必担心数据存储的位置或格式：RCFile格式、文本文件或SequenceFiles。值得一提的是，hive自己生成的表（create.. as; create..like）是ROW FORMAT SERDE的，即经过序列化处理的，直接导出时不行的，这时就可以使用hcatalog进行处理。

sqoop export
--connect jdbc:mysql://192.57.132.33:3306/taxbook3_bigdata
--username root
--password 121212
--table year_dim_statistics
--hcatalog-database ads
--hcatalog-table year_dim_statistics
--update-key date_value
--update-mode allowinsert

--hcatalog-database ：hive 中对应的数据库

--hcatalog-table：hive中对应的表

注意：

由于hive进入3.0版本之后，所建的表默认的都是事务表，而sqoop推数是不支持事务表，会报错；

ERROR tool.ExportTool: Encountered IOException running export job:

org.apache.hive.hcatalog.common.HCatException : 2016 : Error operation not supported : Store into a transactional table fxgl.fxgl_recnclt_result_upd from Pig/Mapreduce is not supported

解决办法：

将需要推数的hive表建成外部表就可以了，外部表是非事务表，推数不会存在问题；

create external table xxxx;