Hadoop群集与关系数据库RDBMS之间的协同工作

103 阅读 0 评论 68 点赞

我是靠谱客的博主坦率日记本，最近开发中收集的这篇文章主要介绍Hadoop群集与关系数据库RDBMS之间的协同工作，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

1.Hadoop集群上把mysql中的数据导入HDFS：

（1）准备工作：安装mysql数据库，并创建一个database 名为：hadooptest；

然后在hadooptest下创建table名为employees：其内容为（此数据来自hadoop基础教程一书）：

Alice Engineering 50000 2009-03-12

Bob Sales 35000 2011-10-01

Camille Marketing 40000 2003-04-20

David Executive 75000 2001-03-20

Erica Support 34000 2011-07-07

（2）下载并安装sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz；注意要设置相应的环境变量：在 /etc/profile 里面；

然后下载mysql-connector-java-5.1.34-bin.jar，并把其拷到 sqoop的lib目录下：注意要设置相应的环境变量： vim /etc/profile ；

#set sqoop path

export SQOOP_HOME=/usr/sqoop

export PATH=${SQOOP_HOME}/bin:${PATH}

export CLASSPATH=${SQOOP_HOME}/lib/mysql-connector-java-5.1.34-bin.jar:${CLASSPATH}

（3）运行命令： sqoop import --connect jdbc:mysql://localhost/hadooptest --username hadoop --password 20082009 --table employees -m 1时，产生下面错误：

查看logs：从中发现是因为其他从机无法连接Master的数据库，所以出现错误；

针对上述问题，必须在Master主机里面以root的身份来赋予其他从机的访问权限，而且是无密码访问：

设置完后再次运行命令：把localhost变为Master 主机的IP地址，命令如下：提示主机Master.hadoop无法访问数据库：

下面修改数据库的访问权限，赋予Master.hadoop权限，密码设置为空：

设置完成后，再次执行命令：

sqoop import --connect jdbc:mysql://172.16.2.17/hadooptest --username hadoop --table employees -m 1

（默认情况下，sqoop会启动4个mapper读取数据,尽管数据集很小；通过-m可以指定mapper的数量）运行成功：

查看HDFS上的输出结果：成功把mysql的数据导入HDFS上去：

2.Hadoop 群集通过sqoop把mysql数据导入hive中：

（1）首先要删除之前的HDFS上的输出目录:  hadoop  fs  -rmr  em* 
（2）使用sqoop执行数据导入任务：
[hadoop@Master ~]$ sqoop import --connect jdbc:mysql://172.16.2.17/hadooptest --username hadoop --table employees --hive-import --hive-table employees
Warning: /usr/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
14/12/29 10:29:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/12/29 10:29:56 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
14/12/29 10:29:56 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
14/12/29 10:29:56 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
14/12/29 10:29:56 INFO tool.CodeGenTool: Beginning code generation
14/12/29 10:29:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
14/12/29 10:29:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
14/12/29 10:29:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hadoop
Note: /tmp/sqoop-hadoop/compile/934eea95d6f0f4b0d86d229dbc48686c/employees.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/12/29 10:29:57 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/934eea95d6f0f4b0d86d229dbc48686c/employees.jar
14/12/29 10:29:57 WARN manager.MySQLManager: It looks like you are importing from mysql.
14/12/29 10:29:57 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
14/12/29 10:29:57 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
14/12/29 10:29:57 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
14/12/29 10:29:57 INFO mapreduce.ImportJobBase: Beginning import of employees
14/12/29 10:29:57 INFO db.DBInputFormat: Using read commited transaction isolation
14/12/29 10:29:57 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`first_name`), MAX(`first_name`) FROM `employees`
14/12/29 10:29:57 WARN db.TextSplitter: Generating splits for a textual index column.
14/12/29 10:29:57 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
14/12/29 10:29:57 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
14/12/29 10:29:57 INFO mapred.JobClient: Running job: job_201412282237_0008
14/12/29 10:29:58 INFO mapred.JobClient:  map 0% reduce 0%
14/12/29 10:30:12 INFO mapred.JobClient:  map 50% reduce 0%
14/12/29 10:30:15 INFO mapred.JobClient:  map 100% reduce 0%
14/12/29 10:30:20 INFO mapred.JobClient: Job complete: job_201412282237_0008
14/12/29 10:30:20 INFO mapred.JobClient: Counters: 18
14/12/29 10:30:20 INFO mapred.JobClient:   Map-Reduce Framework
14/12/29 10:30:20 INFO mapred.JobClient:     Spilled Records=0
14/12/29 10:30:20 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2141048832
14/12/29 10:30:20 INFO mapred.JobClient:     Map input records=5
14/12/29 10:30:20 INFO mapred.JobClient:     SPLIT_RAW_BYTES=513
14/12/29 10:30:20 INFO mapred.JobClient:     Map output records=5
14/12/29 10:30:20 INFO mapred.JobClient:     Physical memory (bytes) snapshot=280956928
14/12/29 10:30:20 INFO mapred.JobClient:     CPU time spent (ms)=2510
14/12/29 10:30:20 INFO mapred.JobClient:     Total committed heap usage (bytes)=216793088
14/12/29 10:30:20 INFO mapred.JobClient:   File Input Format Counters 
14/12/29 10:30:20 INFO mapred.JobClient:     Bytes Read=0
14/12/29 10:30:20 INFO mapred.JobClient:   FileSystemCounters
14/12/29 10:30:20 INFO mapred.JobClient:     HDFS_BYTES_READ=513
14/12/29 10:30:20 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=121353
14/12/29 10:30:20 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=161
14/12/29 10:30:20 INFO mapred.JobClient:   Job Counters 
14/12/29 10:30:20 INFO mapred.JobClient:     Launched map tasks=4
14/12/29 10:30:20 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/12/29 10:30:20 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/12/29 10:30:20 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=24231
14/12/29 10:30:20 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/12/29 10:30:20 INFO mapred.JobClient:   File Output Format Counters 
14/12/29 10:30:20 INFO mapred.JobClient:     Bytes Written=161
14/12/29 10:30:20 INFO mapreduce.ImportJobBase: Transferred 161 bytes in 23.6772 seconds (6.7998 bytes/sec)
14/12/29 10:30:20 INFO mapreduce.ImportJobBase: Retrieved 5 records.
14/12/29 10:30:20 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
14/12/29 10:30:20 WARN hive.TableDefWriter: Column start_date had to be cast to a less precise type in Hive
14/12/29 10:30:20 INFO hive.HiveImport: Removing temporary files from import process: hdfs://172.16.2.17:9000/user/hadoop/employees/_logs
14/12/29 10:30:20 INFO hive.HiveImport: Loading uploaded data into Hive
14/12/29 10:30:21 INFO hive.HiveImport: Logging initialized using configuration in file:/usr/hive/hive/conf/hive-log4j.properties
14/12/29 10:30:21 INFO hive.HiveImport: Hive history file=/tmp/hadoop/hive_job_log_hadoop_201412291030_581129438.txt
14/12/29 10:30:25 INFO hive.HiveImport: OK
14/12/29 10:30:25 INFO hive.HiveImport: Time taken: 3.157 seconds
14/12/29 10:30:25 INFO hive.HiveImport: Loading data to table default.employees
14/12/29 10:30:25 INFO hive.HiveImport: OK
14/12/29 10:30:25 INFO hive.HiveImport: Time taken: 0.185 seconds
14/12/29 10:30:25 INFO hive.HiveImport: Hive import complete.
14/12/29 10:30:25 INFO hive.HiveImport: Export directory is empty, removing it.
（3）查看输出的结果：
[hadoop@Master ~]$ hive -e "select * from employees"
Logging initialized using configuration in file:/usr/hive/hive/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201412291031_1993004175.txt
OK
Alice   Engineering     50000   2009-03-12
Bob     Sales   35000   2011-10-01
Camille Marketing       40000   2003-04-20
David   Executive       75000   2001-03-20
Erica   Support 34000   2011-07-07
Time taken: 2.586 seconds
[hadoop@Master ~]$ hive -e "describe employees"
Logging initialized using configuration in file:/usr/hive/hive/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201412291041_685213829.txt
OK
first_name      string
dept    string
salary  int
start_date      string
Time taken: 2.297 seconds