概述
上一篇,我们已经将Hadoop环境搭建好了,今天我们就根据Esri提供的gis-tools-for-hadoop来切身体会一下Hadoop的工作原理以及Hadoop与GIS的初次约会。
相关环境:
Esri/gis-tools-for-hadoop:https://github.com/Esri/gis-tools-for-hadoop
- Sample tools that demonstrate full stack implementations of all the resources provided to solve GIS problems using Hadoop
- Templates for building custom tools that solve specific problems
Resources for building custom tools
- Spatial Framework for Hadoop
- Java helper utilities for Hadoop developers
- Hive spatial user-defined functions
- Esri Geometry API Java - Java geometry library for spatial data processing
- Geoprocessing Tools - ArcGIS Geoprocessing tools for Hadoop
hive-0.11.0:http://www.apache.org/dyn/closer.cgi/hive/
在使用Esri提供的工具之前,我们需要安装HIVE
hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
HIVE是Facebook贡献的,比较类似于MySQL的语法,其实经常使用数据库SQL语句的基本都不陌生,这也是Hive项目研究的原因,因为SQL的用户群太广泛了。
------------------------------------------------------------------
**************************分割线*****************************
-----------------------------------------------------------------------------------------------
安装Hive跟Hadoop一样,就是一个解压缩命令,而且只需要安装在namenode机器上(master)
安装完毕之后,需要配置相关文件
1:安装HIVE在hadoop用户下安装,确保hadoop用户对hive文件夹有权限
drwxrwxr-x. 10 hadoop hadoop
4096 Jun 25 18:55 hive-0.11.0
2:我们需要将Hadoop和JDK的环境变量添加到如下文件中:
/home/hadoop/hive-0.11.0/conf/hive-env.sh
/home/hadoop/hive-0.11.0/bin/hive-config.sh
3:需要将HIVE_HOME的信息添加到hadoop用户的环境变量中
export JAVA_HOME=/home/jdk/jdk1.7.0_25
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib
#Hadoop
export HADOOP_HOME=/home/hadoop/hadoop-1.2.0
export HIVE_HOME=/home/hadoop/hive-0.11.0
export HADOOP_HOME_WARN_SUPPRESS=1
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib
4:默认情况下$HIVE_HOME/conf下面没有hive-default.xml、hive-site.xml文件,只有hive-default.xml.template文件,使用cp命令复制一份即可
[hadoop@namenode conf]$ cp hive-default.xml.template hive-default.xml
[hadoop@namenode conf]$ cp hive-default.xml.template hive-site.xml
[hadoop@namenode conf]$ ll
total 248
-rw-rw-r--. 1 hadoop hadoop 75005 Jun 26 02:19 hive-default.xml
-rw-rw-r--. 1 hadoop hadoop 75005 May 11 12:06 hive-default.xml.template
-rw-rw-r--. 1 hadoop hadoop
2714 Jun 26 02:34 hive-env.sh
-rw-rw-r--. 1 hadoop hadoop
2378 May 11 12:06 hive-env.sh.template
-rw-rw-r--. 1 hadoop hadoop
2465 May 11 12:06 hive-exec-log4j.properties.template
-rw-rw-r--. 1 hadoop hadoop
2941 Jun 26 02:38 hive-log4j.properties
-rw-rw-r--. 1 hadoop hadoop
2870 May 11 12:06 hive-log4j.properties.template
-rw-rw-r--. 1 hadoop hadoop 75005 Jun 26 02:19 hive-site.xml
5:用户可以修改warehouse的默认路径,修改$HIVE_HOME/conf/hive-site.xml,该文件比较大
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
测试hive
[hadoop@namenode ~]$ hive
Logging initialized using configuration in file:/home/hadoop/hive-0.11.0/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_4317@namenode.com_201306261028_23687925.txt
hive>
------------------------------------------------------------------
**************************分割线*****************************
-----------------------------------------------------------------------------------------------
Esri的gis-tools-for-hadoop介绍
[hadoop@namenode samples]$ ll
total 20
drwxr-xr-x. 4 hadoop hadoop 4096 Jun 26 04:54 data
drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 lib
drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 05:30 point-in-polygon-aggregation-hive
drwxr-xr-x. 5 hadoop hadoop 4096 Jun 26 04:54 point-in-polygon-aggregation-mr
-rw-r--r--. 1 hadoop hadoop
98 Jun 26 04:54 README.md
里面有data文件夹,包含一个地震数据(csv)和美国州立数据(json)
[hadoop@namenode data]$ ll
total 556
drwxr-xr-x. 2 hadoop hadoop
4096 Jun 26 04:54 counties-data
drwxr-xr-x. 2 hadoop hadoop
4096 Jun 26 04:54 earthquake-data
-rw-r--r--. 1 hadoop hadoop 560721 Jun 26 04:54 samples.gdb.zip
lib文件夹里面是非常重要的两个Jar包
[hadoop@namenode lib]$ ll
total 908
-rw-r--r--. 1 hadoop hadoop 794165 Jun 26 04:54 esri-geometry-api.jar
-rw-r--r--. 1 hadoop hadoop 135121 Jun 26 04:54 spatial-sdk-hadoop.jar
point-in-polygon...-hive文件里面有一个SQL文件,该文件非常详细介绍了使用步骤
[hadoop@namenode point-in-polygon-aggregation-hive]$ ll
total 8
-rw-r--r--. 1 hadoop hadoop 3195 Jun 26 04:54 README.md
-rw-r--r--. 1 hadoop hadoop 1161 Jun 26 04:54 run-sample.sql
point-in-polygon...-mr里面有关于使用java开发hadoop的源代码
[hadoop@namenode point-in-polygon-aggregation-mr]$ ll
total 28
-rw-r--r--. 1 hadoop hadoop 4885 Jun 26 04:54 aggregation-sample.jar
-rw-r--r--. 1 hadoop hadoop 1012 Jun 26 04:54 build.xml
drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 cmd
drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 gp
-rw-r--r--. 1 hadoop hadoop 1913 Jun 26 04:54 README.md
drwxr-xr-x. 3 hadoop hadoop 4096 Jun 26 04:54 src
------------------------------------------------------------------
**************************分割线*****************************
-----------------------------------------------------------------------------------------------
在执行相关SQL之前,需要将数据导入到HDFS里面,我们不能使用Linux系统的cp和mv来执行,必须使用hadoop自带的工具来完成
[hadoop@namenode ~]$ hadoop fs
Usage: java FsShell
[-ls <path>]
[-lsr <path>]
[-du <path>]
[-dus <path>]
[-count[-q] <path>]
[-mv <src> <dst>]
[-cp <src> <dst>]
[-rm [-skipTrash] <path>]
[-rmr [-skipTrash] <path>]
[-expunge]
[-put <localsrc> ... <dst>]
[-copyFromLocal <localsrc> ... <dst>]
[-moveFromLocal <localsrc> ... <dst>]
[-get [-ignoreCrc] [-crc] <src> <localdst>]
[-getmerge <src> <localdst> [addnl]]
[-cat <src>]
[-text <src>]
[-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
[-moveToLocal [-crc] <src> <localdst>]
[-mkdir <path>]
[-setrep [-R] [-w] <rep> <path/file>]
[-touchz <path>]
[-test -[ezd] <path>]
[-stat [format] <path>]
[-tail [-f] <file>]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-chgrp [-R] GROUP PATH...]
[-help [cmd]]
Generic options supported are
-conf <configuration file>
specify an application configuration file
-D <property=value>
use value for given property
-fs <local|namenode:port>
specify a namenode
-jt <local|jobtracker:port>
specify a job tracker
-files <comma separated list of files>
specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>
specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>
specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
1:将数据导入到hdfs
[hadoop@namenode ~]$ hadoop fs -put /home/hadoop/gis-tools-for-hadoop-master/samples/
/home/hadoop/hadoop-1.2.0/tmp/
以上就是使用-put将原来物理存储在/home/hadoop/gis-tools-for-hadoop-master/samples/ 文件夹里面的信息,导入到/home/hadoop/hadoop-1.2.0/tmp/里面。
导入进去之后,我们可以在任意datanode节点机器看到很多碎文件
[hadoop@datanode1 current]$ pwd
/home/hadoop/hadoop-1.2.0/tmp/dfs/data/current
[hadoop@datanode1 current]$ ll
total 7052
-rw-rw-r--. 1 hadoop hadoop
98 Jun 26 06:12 blk_2562006058303613171
-rw-rw-r--. 1 hadoop hadoop
11 Jun 26 06:12 blk_2562006058303613171_1050.meta
-rw-rw-r--. 1 hadoop hadoop
560721 Jun 26 06:12 blk_3056013857537171121
-rw-rw-r--. 1 hadoop hadoop
4391 Jun 26 06:12 blk_3056013857537171121_1052.meta
-rw-rw-r--. 1 hadoop hadoop
2047 Jun 26 06:12 blk_3813361044238402711
-rw-rw-r--. 1 hadoop hadoop
23 Jun 26 06:12 blk_3813361044238402711_1063.meta
-rw-rw-r--. 1 hadoop hadoop
2060 Jun 26 06:12 blk_5126515286091847995
-rw-rw-r--. 1 hadoop hadoop
27 Jun 26 06:12 blk_5126515286091847995_1064.meta
-rw-rw-r--. 1 hadoop hadoop
794165 Jun 26 06:12 blk_5144324295121310544
-rw-rw-r--. 1 hadoop hadoop
6215 Jun 26 06:12 blk_5144324295121310544_1055.meta
-rw-rw-r--. 1 hadoop hadoop
1913 Jun 26 06:12 blk_7055687596152865845
-rw-rw-r--. 1 hadoop hadoop
23 Jun 26 06:12 blk_7055687596152865845_1062.meta
-rw-rw-r--. 1 hadoop hadoop 5742811 Jun 26 06:12 blk_7385460214599207016
-rw-rw-r--. 1 hadoop hadoop
44875 Jun 26 06:12 blk_7385460214599207016_1053.meta
-rw-rw-r--. 1 hadoop hadoop
1045 Jun 26 06:12 blk_-787033794569559952
-rw-rw-r--. 1 hadoop hadoop
19 Jun 26 06:12 blk_-787033794569559952_1056.meta
-rw-rw-r--. 1 hadoop hadoop
3195 Jun 26 06:12 blk_8646433984325059766
-rw-rw-r--. 1 hadoop hadoop
35 Jun 26 06:12 blk_8646433984325059766_1048.meta
-rw-rw-r--. 1 hadoop hadoop
772 Jun 26 06:12 dncp_block_verification.log.curr
-rw-rw-r--. 1 hadoop hadoop
159 Jun 26 03:30 VERSION
2:
列出HDFS文件
[hadoop@namenode ~]$ hadoop fs -ls /home/hadoop/hadoop-1.2.0/tmp/
Found 6 items
-rw-r--r--
1 hadoop supergroup
98 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/README.md
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/lib
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:38 /home/hadoop/hadoop-1.2.0/tmp/mapred
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/point-in-polygon-aggregation-hive
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/point-in-polygon-aggregation-mr
3:列出HDFS目录下某个文档中的文件
[hadoop@namenode ~]$ hadoop fs -ls input /home/hadoop/hadoop-1.2.0/tmp/data
ls: Cannot access input: No such file or directory.
Found 3 items
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data/counties-data
drwxr-xr-x
- hadoop supergroup
0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data/earthquake-data
-rw-r--r--
1 hadoop supergroup
560721 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data/samples.gdb.zip
4:查看HDFS目录下某个文档文件的内容
hadoop fs -cat
input /home/hadoop/hadoop-1.2.0/tmp/data/earthquake-data/earthquake.csv
hadoop fs –rmr /home/hadoop/hadoop-1.2.0/tmp/data
说明:进入HDFS里面的文件夹结构后,就不能使用Linux的命令比如查看某个文档下的文件不能使用ls,或者cd到某个路径下
使用上面的方法可以验证用户的数据是否上传到HDFS上面去。
------------------------------------------------------------------
**************************分割线*****************************
-----------------------------------------------------------------------------------------------
接下来就是用esri提供的SQL语句来进行操作了
// 添加jar包
add jar
/home/hadoop/esri-geometry-api.jar
/home/hadoop/spatial-sdk-hadoop.jar;
//创建临时函数,以下SQL语句会用到ST_Point和ST_Contains,如果还用到其他,按照下面方法创建即可
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
//创建一个外部表,指定相关结构,然后根据上面提供的csv文件,将该文件导入到该表中
//特别注意这个路径需要指定导入到HDFS里面的路径
//如果成功导入,可以使用Select * from earthquakes1;是会有记录返回的,如果
//没有记录返回说明没有导入成功
//HIVE的表类型有托管表和外部表,外部表可以理解为如果删除数据,并没有删除实际数据,只是将HDFS里面的元数据信息删除掉,具体参考HIVE帮助
CREATE EXTERNAL TABLE IF NOT EXISTS earthquakes1 (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/home/hadoop/hadoop-1.2.0/tmp/data/earthquake-data';
//同样道理创建一个counties1表
CREATE EXTERNAL TABLE IF NOT EXISTS counties1 (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/home/hadoop/hadoop-1.2.0/tmp/data/counties-data';
//执行如下SQL语句即可
SELECT counties1.name, count(*) cnt FROM counties1
JOIN earthquakes1
WHERE ST_Contains(counties1.boundaryshape, ST_Point(earthquakes1.longitude, earthquakes1.latitude))
GROUP BY counties1.name
ORDER BY cnt desc;
说明:其实HIVE比较类似于我们常用的SQL语句,大家在使用过程中可以类比SQL尤其是学过MySQL就更加熟悉了,我们都知道SQL语句的有简单的比如Select * from table;insert into ...;delete from table等,也有复杂的,其实也就是用户自己编写存储过程等,那么上面的这个步骤就是HIVE一个非常典型的用户定义函数(user-defined function,UDF)这个也可以类比一下SQL的存储过程,有了这种我们就可以编写复杂的语句来完成我们的功能了,因为HIVE本身是Java写的,那么用户编写UDF也必须使用Java来编写,本身UDF分为三种,这个具体看帮助。
编写好的UDF我们怎么使用呢?参考上面的例子
1:我们需要将Java开发的UDF打成jar包
2:在Hive注册这个文件 add jar
3:我们还需要为java的类名起一个别名
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
以上步骤2、3每次启动hive都需要重新执行这两步骤
4:在Hive里面创建表并导入数据
5:执行相关语句获得相关功能
感兴趣的话大家也可以进入以下目录来查看相关的源代码
%Hadoop%spatial-framework-for-hadoop-masterspatial-framework-for-hadoop-masterhivesrccomesrihadoophive
而且Esri提供的这个工具包包含了基本类似于oracle ST_Geometry的空间构造函数、空间关系函数
可以查看%Hadoop%spatial-framework-for-hadoop-masterspatial-framework-for-hadoop-masterhivefunction-ddl.sql
create temporary function ST_AsBinary as 'com.esri.hadoop.hive.ST_AsBinary';
create temporary function ST_AsGeoJSON as 'com.esri.hadoop.hive.ST_AsGeoJson';
create temporary function ST_AsJSON as 'com.esri.hadoop.hive.ST_AsJson';
create temporary function ST_AsText as 'com.esri.hadoop.hive.ST_AsText';
create temporary function ST_GeomFromJSON as 'com.esri.hadoop.hive.ST_GeomFromJson';
create temporary function ST_GeomFromGeoJSON as 'com.esri.hadoop.hive.ST_GeomFromGeoJson';
create temporary function ST_GeomFromText as 'com.esri.hadoop.hive.ST_GeomFromText';
create temporary function ST_GeomFromWKB as 'com.esri.hadoop.hive.ST_GeomFromWKB';
create temporary function ST_PointFromWKB as 'com.esri.hadoop.hive.ST_PointFromWKB';
create temporary function ST_LineFromWKB as 'com.esri.hadoop.hive.ST_LineFromWKB';
create temporary function ST_PolyFromWKB as 'com.esri.hadoop.hive.ST_PolyFromWKB';
create temporary function ST_MPointFromWKB as 'com.esri.hadoop.hive.ST_MPointFromWKB';
create temporary function ST_MLineFromWKB as 'com.esri.hadoop.hive.ST_MLineFromWKB';
create temporary function ST_MPolyFromWKB as 'com.esri.hadoop.hive.ST_MPolyFromWKB';
create temporary function ST_GeomCollection as 'com.esri.hadoop.hive.ST_GeomCollection';
create temporary function ST_GeometryType as 'com.esri.hadoop.hive.ST_GeometryType';
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_PointZ as 'com.esri.hadoop.hive.ST_PointZ';
create temporary function ST_LineString as 'com.esri.hadoop.hive.ST_LineString';
create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
create temporary function ST_MultiPoint as 'com.esri.hadoop.hive.ST_MultiPoint';
create temporary function ST_MultiLineString as 'com.esri.hadoop.hive.ST_MultiLineString';
create temporary function ST_MultiPolygon as 'com.esri.hadoop.hive.ST_MultiPolygon';
create temporary function ST_SetSRID as 'com.esri.hadoop.hive.ST_SetSRID';
create temporary function ST_SRID as 'com.esri.hadoop.hive.ST_SRID';
create temporary function ST_IsEmpty as 'com.esri.hadoop.hive.ST_IsEmpty';
create temporary function ST_IsSimple as 'com.esri.hadoop.hive.ST_IsSimple';
create temporary function ST_Dimension as 'com.esri.hadoop.hive.ST_Dimension';
create temporary function ST_X as 'com.esri.hadoop.hive.ST_X';
create temporary function ST_Y as 'com.esri.hadoop.hive.ST_Y';
create temporary function ST_MinX as 'com.esri.hadoop.hive.ST_MinX';
create temporary function ST_MaxX as 'com.esri.hadoop.hive.ST_MaxX';
create temporary function ST_MinY as 'com.esri.hadoop.hive.ST_MinY';
create temporary function ST_MaxY as 'com.esri.hadoop.hive.ST_MaxY';
create temporary function ST_IsClosed as 'com.esri.hadoop.hive.ST_IsClosed';
create temporary function ST_IsRing as 'com.esri.hadoop.hive.ST_IsRing';
create temporary function ST_Length as 'com.esri.hadoop.hive.ST_Length';
create temporary function ST_GeodesicLengthWGS84 as 'com.esri.hadoop.hive.ST_GeodesicLengthWGS84';
create temporary function ST_Area as 'com.esri.hadoop.hive.ST_Area';
create temporary function ST_Is3D as 'com.esri.hadoop.hive.ST_Is3D';
create temporary function ST_Z as 'com.esri.hadoop.hive.ST_Z';
create temporary function ST_MinZ as 'com.esri.hadoop.hive.ST_MinZ';
create temporary function ST_MaxZ as 'com.esri.hadoop.hive.ST_MaxZ';
create temporary function ST_IsMeasured as 'com.esri.hadoop.hive.ST_IsMeasured';
create temporary function ST_M as 'com.esri.hadoop.hive.ST_M';
create temporary function ST_MinM as 'com.esri.hadoop.hive.ST_MinM';
create temporary function ST_MaxM as 'com.esri.hadoop.hive.ST_MaxM';
create temporary function ST_CoordDim as 'com.esri.hadoop.hive.ST_CoordDim';
create temporary function ST_NumPoints as 'com.esri.hadoop.hive.ST_NumPoints';
create temporary function ST_PointN as 'com.esri.hadoop.hive.ST_PointN';
create temporary function ST_StartPoint as 'com.esri.hadoop.hive.ST_StartPoint';
create temporary function ST_EndPoint as 'com.esri.hadoop.hive.ST_EndPoint';
create temporary function ST_ExteriorRing as 'com.esri.hadoop.hive.ST_ExteriorRing';
create temporary function ST_NumInteriorRing as 'com.esri.hadoop.hive.ST_NumInteriorRing';
create temporary function ST_InteriorRingN as 'com.esri.hadoop.hive.ST_InteriorRingN';
create temporary function ST_NumGeometries as 'com.esri.hadoop.hive.ST_NumGeometries';
create temporary function ST_GeometryN as 'com.esri.hadoop.hive.ST_GeometryN';
create temporary function ST_Centroid as 'com.esri.hadoop.hive.ST_Centroid';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
create temporary function ST_Crosses as 'com.esri.hadoop.hive.ST_Crosses';
create temporary function ST_Disjoint as 'com.esri.hadoop.hive.ST_Disjoint';
create temporary function ST_EnvIntersects as 'com.esri.hadoop.hive.ST_EnvIntersects';
create temporary function ST_Envelope as 'com.esri.hadoop.hive.ST_Envelope';
create temporary function ST_Equals as 'com.esri.hadoop.hive.ST_Equals';
create temporary function ST_Overlaps as 'com.esri.hadoop.hive.ST_Overlaps';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';
create temporary function ST_Relate as 'com.esri.hadoop.hive.ST_Relate';
create temporary function ST_Touches as 'com.esri.hadoop.hive.ST_Touches';
create temporary function ST_Within as 'com.esri.hadoop.hive.ST_Within';
create temporary function ST_Distance as 'com.esri.hadoop.hive.ST_Distance';
create temporary function ST_Boundary as 'com.esri.hadoop.hive.ST_Boundary';
create temporary function ST_Buffer as 'com.esri.hadoop.hive.ST_Buffer';
create temporary function ST_ConvexHull as 'com.esri.hadoop.hive.ST_ConvexHull';
create temporary function ST_Intersection as 'com.esri.hadoop.hive.ST_Intersection';
create temporary function ST_Union as 'com.esri.hadoop.hive.ST_Union';
create temporary function ST_Difference as 'com.esri.hadoop.hive.ST_Difference';
create temporary function ST_SymmetricDiff as 'com.esri.hadoop.hive.ST_SymmetricDiff';
create temporary function ST_SymDifference as 'com.esri.hadoop.hive.ST_SymmetricDiff';
create temporary function ST_Aggr_Union as 'com.esri.hadoop.hive.ST_Aggr_Union'
确保数据都导入进去之后,执行SQL语句后,信息如下
hive> SELECT counties1.name, count(*) cnt FROM counties1
> JOIN earthquakes1
> WHERE ST_Contains(counties1.boundaryshape, ST_Point(earthquakes1.longitude, earthquakes1.latitude))
> GROUP BY counties1.name
> ORDER BY cnt desc;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306260649_0003, Tracking URL = http://namenode.com:50030/jobdetails.jsp?jobid=job_201306260649_0003
Kill Command = /home/hadoop/hadoop-1.2.0/libexec/../bin/hadoop job
-kill job_201306260649_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2013-06-26 09:57:13,072 Stage-1 map = 0%,
reduce = 0%
2013-06-26 09:57:27,152 Stage-1 map = 50%,
reduce = 0%, Cumulative CPU 5.83 sec
2013-06-26 09:57:28,160 Stage-1 map = 50%,
reduce = 0%, Cumulative CPU 5.83 sec
2013-06-26 09:57:29,167 Stage-1 map = 50%,
reduce = 0%, Cumulative CPU 5.83 sec
2013-06-26 09:57:30,174 Stage-1 map = 50%,
reduce = 0%, Cumulative CPU 5.83 sec
2013-06-26 09:57:31,187 Stage-1 map = 100%,
reduce = 0%, Cumulative CPU 10.4 sec
2013-06-26 09:57:32,200 Stage-1 map = 100%,
reduce = 0%, Cumulative CPU 10.4 sec
2013-06-26 09:57:33,210 Stage-1 map = 100%,
reduce = 0%, Cumulative CPU 10.4 sec
2013-06-26 09:57:34,219 Stage-1 map = 100%,
reduce = 0%, Cumulative CPU 10.4 sec
2013-06-26 09:57:35,237 Stage-1 map = 100%,
reduce = 0%, Cumulative CPU 10.4 sec
2013-06-26 09:57:36,246 Stage-1 map = 100%,
reduce = 0%, Cumulative CPU 10.4 sec
2013-06-26 09:57:37,256 Stage-1 map = 100%,
reduce = 17%, Cumulative CPU 10.4 sec
2013-06-26 09:57:38,265 Stage-1 map = 100%,
reduce = 17%, Cumulative CPU 10.4 sec
2013-06-26 09:57:39,271 Stage-1 map = 100%,
reduce = 17%, Cumulative CPU 10.4 sec
2013-06-26 09:57:40,278 Stage-1 map = 100%,
reduce = 70%, Cumulative CPU 10.4 sec
2013-06-26 09:57:41,286 Stage-1 map = 100%,
reduce = 70%, Cumulative CPU 10.4 sec
2013-06-26 09:57:42,294 Stage-1 map = 100%,
reduce = 70%, Cumulative CPU 10.4 sec
2013-06-26 09:57:43,301 Stage-1 map = 100%,
reduce = 70%, Cumulative CPU 10.4 sec
2013-06-26 09:57:44,308 Stage-1 map = 100%,
reduce = 70%, Cumulative CPU 10.4 sec
2013-06-26 09:57:45,314 Stage-1 map = 100%,
reduce = 70%, Cumulative CPU 10.4 sec
2013-06-26 09:57:46,323 Stage-1 map = 100%,
reduce = 71%, Cumulative CPU 10.4 sec
2013-06-26 09:57:47,330 Stage-1 map = 100%,
reduce = 71%, Cumulative CPU 10.4 sec
2013-06-26 09:57:48,337 Stage-1 map = 100%,
reduce = 71%, Cumulative CPU 10.4 sec
2013-06-26 09:57:49,343 Stage-1 map = 100%,
reduce = 72%, Cumulative CPU 10.4 sec
2013-06-26 09:57:50,354 Stage-1 map = 100%,
reduce = 72%, Cumulative CPU 10.4 sec
2013-06-26 09:57:51,360 Stage-1 map = 100%,
reduce = 72%, Cumulative CPU 10.4 sec
2013-06-26 09:57:52,369 Stage-1 map = 100%,
reduce = 73%, Cumulative CPU 10.4 sec
2013-06-26 09:57:53,379 Stage-1 map = 100%,
reduce = 73%, Cumulative CPU 10.4 sec
2013-06-26 09:57:54,385 Stage-1 map = 100%,
reduce = 73%, Cumulative CPU 10.4 sec
2013-06-26 09:57:55,391 Stage-1 map = 100%,
reduce = 74%, Cumulative CPU 10.4 sec
2013-06-26 09:57:56,397 Stage-1 map = 100%,
reduce = 74%, Cumulative CPU 10.4 sec
2013-06-26 09:57:57,403 Stage-1 map = 100%,
reduce = 74%, Cumulative CPU 10.4 sec
2013-06-26 09:57:58,411 Stage-1 map = 100%,
reduce = 76%, Cumulative CPU 10.4 sec
2013-06-26 09:57:59,418 Stage-1 map = 100%,
reduce = 76%, Cumulative CPU 10.4 sec
2013-06-26 09:58:00,425 Stage-1 map = 100%,
reduce = 76%, Cumulative CPU 10.4 sec
2013-06-26 09:58:01,433 Stage-1 map = 100%,
reduce = 76%, Cumulative CPU 10.4 sec
2013-06-26 09:58:02,439 Stage-1 map = 100%,
reduce = 76%, Cumulative CPU 10.4 sec
2013-06-26 09:58:03,448 Stage-1 map = 100%,
reduce = 76%, Cumulative CPU 10.4 sec
2013-06-26 09:58:04,464 Stage-1 map = 100%,
reduce = 77%, Cumulative CPU 10.4 sec
2013-06-26 09:58:05,476 Stage-1 map = 100%,
reduce = 77%, Cumulative CPU 10.4 sec
2013-06-26 09:58:06,482 Stage-1 map = 100%,
reduce = 77%, Cumulative CPU 10.4 sec
2013-06-26 09:58:07,488 Stage-1 map = 100%,
reduce = 79%, Cumulative CPU 10.4 sec
2013-06-26 09:58:08,497 Stage-1 map = 100%,
reduce = 79%, Cumulative CPU 10.4 sec
2013-06-26 09:58:09,503 Stage-1 map = 100%,
reduce = 79%, Cumulative CPU 10.4 sec
2013-06-26 09:58:10,516 Stage-1 map = 100%,
reduce = 80%, Cumulative CPU 10.4 sec
2013-06-26 09:58:11,524 Stage-1 map = 100%,
reduce = 80%, Cumulative CPU 10.4 sec
2013-06-26 09:58:12,533 Stage-1 map = 100%,
reduce = 80%, Cumulative CPU 43.27 sec
2013-06-26 09:58:13,541 Stage-1 map = 100%,
reduce = 81%, Cumulative CPU 43.27 sec
2013-06-26 09:58:14,547 Stage-1 map = 100%,
reduce = 81%, Cumulative CPU 43.27 sec
2013-06-26 09:58:15,554 Stage-1 map = 100%,
reduce = 81%, Cumulative CPU 43.27 sec
2013-06-26 09:58:16,559 Stage-1 map = 100%,
reduce = 82%, Cumulative CPU 43.27 sec
2013-06-26 09:58:17,566 Stage-1 map = 100%,
reduce = 82%, Cumulative CPU 43.27 sec
2013-06-26 09:58:18,575 Stage-1 map = 100%,
reduce = 82%, Cumulative CPU 43.27 sec
2013-06-26 09:58:19,582 Stage-1 map = 100%,
reduce = 83%, Cumulative CPU 43.27 sec
2013-06-26 09:58:20,592 Stage-1 map = 100%,
reduce = 83%, Cumulative CPU 43.27 sec
2013-06-26 09:58:21,599 Stage-1 map = 100%,
reduce = 83%, Cumulative CPU 43.27 sec
2013-06-26 09:58:22,606 Stage-1 map = 100%,
reduce = 83%, Cumulative CPU 43.27 sec
2013-06-26 09:58:23,614 Stage-1 map = 100%,
reduce = 84%, Cumulative CPU 43.27 sec
2013-06-26 09:58:24,620 Stage-1 map = 100%,
reduce = 84%, Cumulative CPU 43.27 sec
2013-06-26 09:58:25,626 Stage-1 map = 100%,
reduce = 84%, Cumulative CPU 43.27 sec
2013-06-26 09:58:26,632 Stage-1 map = 100%,
reduce = 85%, Cumulative CPU 43.27 sec
2013-06-26 09:58:27,638 Stage-1 map = 100%,
reduce = 85%, Cumulative CPU 43.27 sec
2013-06-26 09:58:28,650 Stage-1 map = 100%,
reduce = 85%, Cumulative CPU 43.27 sec
2013-06-26 09:58:29,656 Stage-1 map = 100%,
reduce = 86%, Cumulative CPU 43.27 sec
2013-06-26 09:58:30,661 Stage-1 map = 100%,
reduce = 86%, Cumulative CPU 43.27 sec
2013-06-26 09:58:31,668 Stage-1 map = 100%,
reduce = 86%, Cumulative CPU 43.27 sec
2013-06-26 09:58:32,677 Stage-1 map = 100%,
reduce = 87%, Cumulative CPU 43.27 sec
2013-06-26 09:58:33,682 Stage-1 map = 100%,
reduce = 87%, Cumulative CPU 43.27 sec
2013-06-26 09:58:34,690 Stage-1 map = 100%,
reduce = 87%, Cumulative CPU 43.27 sec
2013-06-26 09:58:35,700 Stage-1 map = 100%,
reduce = 89%, Cumulative CPU 43.27 sec
2013-06-26 09:58:36,706 Stage-1 map = 100%,
reduce = 89%, Cumulative CPU 43.27 sec
2013-06-26 09:58:37,713 Stage-1 map = 100%,
reduce = 89%, Cumulative CPU 43.27 sec
2013-06-26 09:58:38,719 Stage-1 map = 100%,
reduce = 90%, Cumulative CPU 43.27 sec
2013-06-26 09:58:39,726 Stage-1 map = 100%,
reduce = 90%, Cumulative CPU 43.27 sec
2013-06-26 09:58:40,734 Stage-1 map = 100%,
reduce = 90%, Cumulative CPU 43.27 sec
2013-06-26 09:58:41,741 Stage-1 map = 100%,
reduce = 91%, Cumulative CPU 43.27 sec
2013-06-26 09:58:42,747 Stage-1 map = 100%,
reduce = 91%, Cumulative CPU 43.27 sec
2013-06-26 09:58:43,754 Stage-1 map = 100%,
reduce = 91%, Cumulative CPU 43.27 sec
2013-06-26 09:58:44,760 Stage-1 map = 100%,
reduce = 92%, Cumulative CPU 43.27 sec
2013-06-26 09:58:45,767 Stage-1 map = 100%,
reduce = 92%, Cumulative CPU 43.27 sec
2013-06-26 09:58:46,773 Stage-1 map = 100%,
reduce = 92%, Cumulative CPU 43.27 sec
2013-06-26 09:58:47,780 Stage-1 map = 100%,
reduce = 93%, Cumulative CPU 43.27 sec
2013-06-26 09:58:48,786 Stage-1 map = 100%,
reduce = 93%, Cumulative CPU 43.27 sec
2013-06-26 09:58:49,791 Stage-1 map = 100%,
reduce = 93%, Cumulative CPU 43.27 sec
2013-06-26 09:58:50,802 Stage-1 map = 100%,
reduce = 94%, Cumulative CPU 43.27 sec
2013-06-26 09:58:51,807 Stage-1 map = 100%,
reduce = 94%, Cumulative CPU 43.27 sec
2013-06-26 09:58:52,814 Stage-1 map = 100%,
reduce = 94%, Cumulative CPU 43.27 sec
2013-06-26 09:58:53,820 Stage-1 map = 100%,
reduce = 95%, Cumulative CPU 43.27 sec
2013-06-26 09:58:54,826 Stage-1 map = 100%,
reduce = 95%, Cumulative CPU 43.27 sec
2013-06-26 09:58:55,834 Stage-1 map = 100%,
reduce = 95%, Cumulative CPU 43.27 sec
2013-06-26 09:58:56,841 Stage-1 map = 100%,
reduce = 96%, Cumulative CPU 43.27 sec
2013-06-26 09:58:57,847 Stage-1 map = 100%,
reduce = 96%, Cumulative CPU 43.27 sec
2013-06-26 09:58:58,853 Stage-1 map = 100%,
reduce = 96%, Cumulative CPU 43.27 sec
2013-06-26 09:58:59,861 Stage-1 map = 100%,
reduce = 97%, Cumulative CPU 43.27 sec
2013-06-26 09:59:00,869 Stage-1 map = 100%,
reduce = 97%, Cumulative CPU 43.27 sec
2013-06-26 09:59:01,876 Stage-1 map = 100%,
reduce = 97%, Cumulative CPU 43.27 sec
2013-06-26 09:59:02,881 Stage-1 map = 100%,
reduce = 99%, Cumulative CPU 43.27 sec
2013-06-26 09:59:03,887 Stage-1 map = 100%,
reduce = 99%, Cumulative CPU 43.27 sec
2013-06-26 09:59:04,893 Stage-1 map = 100%,
reduce = 99%, Cumulative CPU 43.27 sec
2013-06-26 09:59:05,906 Stage-1 map = 100%,
reduce = 99%, Cumulative CPU 43.27 sec
2013-06-26 09:59:06,912 Stage-1 map = 100%,
reduce = 99%, Cumulative CPU 43.27 sec
2013-06-26 09:59:07,925 Stage-1 map = 100%,
reduce = 100%, Cumulative CPU 99.03 sec
2013-06-26 09:59:08,932 Stage-1 map = 100%,
reduce = 100%, Cumulative CPU 99.03 sec
2013-06-26 09:59:09,938 Stage-1 map = 100%,
reduce = 100%, Cumulative CPU 99.03 sec
2013-06-26 09:59:10,948 Stage-1 map = 100%,
reduce = 100%, Cumulative CPU 99.03 sec
MapReduce Total cumulative CPU time: 1 minutes 39 seconds 30 msec
Ended Job = job_201306260649_0003
Launching Job 2 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306260649_0004, Tracking URL = http://namenode.com:50030/jobdetails.jsp?jobid=job_201306260649_0004
Kill Command = /home/hadoop/hadoop-1.2.0/libexec/../bin/hadoop job
-kill job_201306260649_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2013-06-26 09:59:19,107 Stage-2 map = 0%,
reduce = 0%
2013-06-26 09:59:23,132 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:24,137 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:25,142 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:26,148 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:27,157 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:28,166 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:29,178 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:30,187 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:31,200 Stage-2 map = 100%,
reduce = 0%, Cumulative CPU 0.99 sec
2013-06-26 09:59:32,211 Stage-2 map = 100%,
reduce = 33%, Cumulative CPU 0.99 sec
2013-06-26 09:59:33,225 Stage-2 map = 100%,
reduce = 33%, Cumulative CPU 0.99 sec
2013-06-26 09:59:34,236 Stage-2 map = 100%,
reduce = 100%, Cumulative CPU 3.27 sec
2013-06-26 09:59:35,242 Stage-2 map = 100%,
reduce = 100%, Cumulative CPU 3.27 sec
2013-06-26 09:59:36,247 Stage-2 map = 100%,
reduce = 100%, Cumulative CPU 3.27 sec
2013-06-26 09:59:37,253 Stage-2 map = 100%,
reduce = 100%, Cumulative CPU 3.27 sec
2013-06-26 09:59:38,267 Stage-2 map = 100%,
reduce = 100%, Cumulative CPU 3.27 sec
MapReduce Total cumulative CPU time: 3 seconds 270 msec
Ended Job = job_201306260649_0004
Launching Job 3 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306260649_0005, Tracking URL = http://namenode.com:50030/jobdetails.jsp?jobid=job_201306260649_0005
Kill Command = /home/hadoop/hadoop-1.2.0/libexec/../bin/hadoop job
-kill job_201306260649_0005
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2013-06-26 09:59:46,788 Stage-3 map = 0%,
reduce = 0%
2013-06-26 09:59:52,817 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:53,824 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:54,835 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:55,840 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:56,851 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:57,865 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:58,874 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 09:59:59,884 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 10:00:00,890 Stage-3 map = 100%,
reduce = 0%, Cumulative CPU 1.67 sec
2013-06-26 10:00:01,896 Stage-3 map = 100%,
reduce = 33%, Cumulative CPU 1.67 sec
2013-06-26 10:00:02,905 Stage-3 map = 100%,
reduce = 100%, Cumulative CPU 3.48 sec
2013-06-26 10:00:03,910 Stage-3 map = 100%,
reduce = 100%, Cumulative CPU 3.48 sec
2013-06-26 10:00:04,915 Stage-3 map = 100%,
reduce = 100%, Cumulative CPU 3.48 sec
2013-06-26 10:00:05,921 Stage-3 map = 100%,
reduce = 100%, Cumulative CPU 3.48 sec
2013-06-26 10:00:06,929 Stage-3 map = 100%,
reduce = 100%, Cumulative CPU 3.48 sec
MapReduce Total cumulative CPU time: 3 seconds 480 msec
Ended Job = job_201306260649_0005
MapReduce Jobs Launched:
Job 0: Map: 2
Reduce: 1
Cumulative CPU: 99.03 sec
HDFS Read: 6771646 HDFS Write: 541 SUCCESS
Job 1: Map: 1
Reduce: 1
Cumulative CPU: 3.27 sec
HDFS Read: 1002 HDFS Write: 541 SUCCESS
Job 2: Map: 1
Reduce: 1
Cumulative CPU: 3.48 sec
HDFS Read: 1002 HDFS Write: 199 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 45 seconds 780 msec
OK
Kern
36
San Bernardino
35
Imperial
28
Inyo
20
Los Angeles
18
Monterey
14
Riverside
14
Santa Clara
12
Fresno
11
San Benito
11
San Diego
7
Santa Cruz
5
San Luis Obispo 3
Ventura 3
Orange
2
San Mateo
1
Time taken: 183.06 seconds, Fetched: 16 row(s)
从上面的信息我们可以看出,执行这个SQL语句是产生了三个job,每个job都有map和reduce的进程,对Esri提供的小数据在hadoop架构下进行分析,使用了183s,这样印证了hadoop在执行小数据量的局限性。这也说明了,什么样的技术需要在什么样的场景下适用,有点英雄无用武之地的感觉。
扩展:如果从事电信或者电力GIS行业的技术人员看完这个一点都不陌生,因为这正是他们习惯的开发模式,类似于ArcSDE的SQL操作ST_Geometry,他们习惯于直接在数据库里面使用SQL语句,使用SDE提供的关系操作符,那么以上使用的HIVE就是一个类SQL的方式,而且相关的操作符与SQL操作基本类似,如果这些行业数据量非常大而且分析较为复杂,不妨考虑一下这种模式。
同样,Esri也提供了基于Hadoop的可视化界面操作工具GP工具,这个工具提供的功能也比较简单
下载地址:https://github.com/Esri/geoprocessing-tools-for-hadoop
1:启用WebHDFS
在jobtacker机器上的hdfs-site.xml配置文件添加如下信息
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
2:部署GP工具包里面的requests和webhdfs的python包
将这两个文件夹拷贝到python的site-packages文件夹里面
C:Python27ArcGIS10.1Libsite-packages
具体使用可以参考提供的帮助参考。
-------------------------------------------------------------------------------------------------------
版权所有,文章允许转载,但必须以链接方式注明源地址,否则追究法律责任!
------------------------------------------------------------------------------------------------------
最后
以上就是瘦瘦网络为你收集整理的[Hadoop学习]Esri/gis-tools-for-hadoop介绍的全部内容,希望文章能够帮你解决[Hadoop学习]Esri/gis-tools-for-hadoop介绍所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复