Vertica和Hive表互操作方法比较(hdfs bulk load和shell pipe方法)1. 部署方式选择2. Vertica和Hive互操作方法选择3 结论

84 阅读 0 评论 56 点赞

我是靠谱客的博主大意保温杯，最近开发中收集的这篇文章主要介绍Vertica和Hive表互操作方法比较(hdfs bulk load和shell pipe方法)1. 部署方式选择2. Vertica和Hive互操作方法选择3 结论，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

1. 部署方式选择

2. Vertica和Hive互操作方法选择

2.1 方法1和性能：hdfs bulk load

2.2 方法2和性能：shell pipe

2.3 2种方法的比较

3 结论

1. 部署方式选择

Vertica和Hadoop是不相容的集群，磁盘部署方式（vertica是raid, hadoop是jobd）是不一样的，需要分离部署。

参考：https://www.cnblogs.com/harrychinese/p/vertica_hadoop_cooperation.html

结论是：选择vertica和hadoop分离集群的方式，尝试使用Reading Directly from HDFS读取hadoop集群中数据

2. Vertica和Hive互操作方法选择

2.1 方法1和性能：hdfs bulk load

//在vertica所在集群配置hdfs路径

//登陆到vertica服务器集群所在机器，在vertica所有节点的/opt/vertica/config目录下创建目录hadoop

mkdir /opt/vertica/config/hadoop

//登陆到hdfs集群所在机器

cd /opt/xx/hadoop/etc/hadoop

scp core-site.xml hdfs-site.xml mpp001:/opt/vertica/config/hadoop

scp core-site.xml hdfs-site.xml mpp002:/opt/vertica/config/hadoop

scp core-site.xml hdfs-site.xml mpp003:/opt/vertica/config/hadoop

...

//登陆到vertica客户端所在机器

vsql -h mpp001 -U dbadmin -W

c mytest

ALTER DATABASE mytest SET HadoopConfDir = '/opt/vertica/config/hadoop';

//验证配置是否正确

SELECT VERIFY_HADOOP_CONF_DIR();

//创建表，数据以vertica原生格式存储

create table if not exists iotdata.t1 (

staxxx varchar,

roomcode_trans varchar,

…

l_date varchar not null,

staxxxxx varchar,

primary key (xx, staxxxxx,qu_id) disabled

) ORDER BY xx,xx,qu_id KSAFE 1 PARTITION BY l_date;

//hdfs bulk load，分区增量导入，hdfs数据copy到表中分区

copy iotdata.t from 'hdfs://xx/path/db/table/l_date=2019-04-24/*/*' orc(hive_partition_cols='l_date,xxx);

//创建外部表，hdfs数据不迁移，外部表不支持ddl中的primary key和order by等操作

drop table iotdata.tmp_t;

create external table if not exists iotdata.tmp_t (

staxxx varchar,

roomcode_trans varchar,

…

l_date varchar not null,

staxxxxx varchar,

primary key (xx, staxxxxx,qu_id)

) as copy from 'hdfs://xx/path/db/table/l_date=2019-04-24/*/*' orc(hive_partition_cols='l_date,xxx);

参考：

1. https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/libhdfs/HdfsURL.htm

2. https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/ExternalTables/UsingPartitions.htm?tocpath=Administrator%27s%20Guide%7CWorking%20with%20External%C2%A0Data%7CReading%20ORC%20and%20Parquet%20Formats%7C_____2

2.2 方法2和性能：shell pipe

//在vertica客户端所在机器上，本机安装有hive客户端，pv是shell pipe速度监控的命令

hive -e "select * FROM xx.xxx where l_date = '2019-04-24'" |

dd bs=1M | pv -lpetr -s 30518825 |

/opt/vertica/bin/vsql -h x.x.x.xx -U dbadmin -wxxx -d mytest -c

"COPY iotdata.xxx FROM LOCAL STDIN DELIMITER E't' NULL 'NULL' DIRECT"

2.3 2种方法的比较

1. hdfs bulk load在拷贝进内部表和外部表的时候数据类型都不会自动转换

The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the ORC or Parquet data.

拷贝速度为20w条/s，2.5分钟

参考：https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/ExternalTables/CreatingORCParquetTables.htm

2. 通过shell pipe方式的copy字段的数据格式可以自动转换格式（比如，string to date, string datetime, string to int），hive中目前格式一般为字符串类型，但vertica中的主键索引，排序字段和分区都对数据类型有基于性能考虑的特殊要求，所以自动格式转换是个免不了的需求。

拷贝速度是7w条/s, 考虑hive查询的时间则实际速度只有3w条/s