概述
1、背景
HDFS的audit log产生数据量很大,速度也很快,在机器系统盘上必须立即持久化到HDFS,否则数据会被覆盖或者磁盘会打满。
用于数据治理-HDFS废弃文件、Hive废弃表检测与清理。
2、实现
① Apache Flume官网下载最新版本的Flume。
② 配置audit_log_hdfs.conf
# 一个channel一个source 配置3个sink
a1.sources = r1
a1.sinks = k1 k2 k3
a1.channels = c1
# 数据来源,给c1配置shell命令tail -F 获取hdfs-audit.log
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/logs/hdfs/hdfs-audit.log
a1.sources.r1.channels = c1
# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
## 按照日期分区
a1.sinks.k1.hdfs.path = /user/hive/warehouse/ods.db/ods_hdfs_audit_log_d/stat_day=20%y%m%d
## 配置文件名 前缀和后缀,后缀加上'.'为了解决在Hive/Spark读取当前写入分区时,访问正在写入的临时文件后,临时文件被rename成了正式文件,报错FileNotFoundException
a1.sinks.k1.hdfs.filePrefix = audit-log-sink1
a1.sinks.k1.hdfs.inUsePrefix = .
## 滚动间隔和数量设置为0,则文件块仅按照滚动大小(128M)来写,配置和HDFS block size相同
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
## 执行HDFS命令超时时长,由于文件较大,建议稍微放宽一些
a1.sinks.k1.hdfs.callTimeout=60000
a1.sinks.k2.type = hdfs
a1.sinks.k2.channel = c1
a1.sinks.k2.hdfs.path = /user/hive/warehouse/ods.db/ods_hdfs_audit_log_d/stat_day=20%y%m%d
a1.sinks.k2.hdfs.filePrefix = audit-log-sink2
a1.sinks.k2.hdfs.inUsePrefix = .
a1.sinks.k2.hdfs.rollInterval = 0
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.callTimeout=60000
a1.sinks.k3.type = hdfs
a1.sinks.k3.channel = c1
a1.sinks.k3.hdfs.path = /user/hive/warehouse/ods.db/ods_hdfs_audit_log_d/stat_day=20%y%m%d
a1.sinks.k3.hdfs.filePrefix = audit-log-sink3
a1.sinks.k3.hdfs.inUsePrefix = .
a1.sinks.k3.hdfs.rollInterval = 0
a1.sinks.k3.hdfs.rollSize = 134217728
a1.sinks.k3.hdfs.rollCount = 0
a1.sinks.k3.hdfs.useLocalTimeStamp = true
a1.sinks.k3.hdfs.fileType = DataStream
a1.sinks.k3.hdfs.callTimeout=60000
# 配置channel,设置其容量(内存大小,缓冲,事务数据量等)
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacity = 100000000
a1.channels.c1.byteCapacityBufferPercentage = 20
# 给channel关联上source和sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
a1.sinks.k3.channel = c1
③ JVM配置
需要调大JVM,防止OOM、GC频繁等问题。
由于NameNode机器配置较高,所以JVM可以开大一些。
修改./conf/flume-env.sh,增加:
export JAVA_OPTS="-Xms5g -Xmx20g -XX:+UseG1GC -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:G1ReservePercent=10 -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/opt/app/apache-flume-1.9.0-bin/bin/audit_log_gc_$(date +%Y%m%d-%H%M%S).log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:HeapDumpPath=/opt/app/apache-flume-1.9.0-bin/bin"
④ 启动
nohup ./flume-ng agent -c ../conf -f audit_log_hdfs.conf
-n a1 -Dflume.root.logger=INFO,console > audit_log_hdfs.log &
最后
以上就是搞怪舞蹈为你收集整理的Flume采集HDFS audit log日志至HDFS的全部内容,希望文章能够帮你解决Flume采集HDFS audit log日志至HDFS所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复