我是靠谱客的博主 大方大叔,最近开发中收集的这篇文章主要介绍工作中报错故障集合,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

OOM常见报错排查之堆外内存溢出
●报错:ExecutorLostFailure (executor xxx exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-xxxx.
●原因:这种错误出现的原因是Container使用的内存超过了申请的值,被YARN检测到并kill了。Java里的JVM堆内存设置后通常不会超额使用,只有可能是堆外内存使用过多。
●解决:可以设置livy.session.conf.spark.executor.memoryOverhead(Executor问题)或者livy.session.conf.spark.driver.memoryOverhead(Driver问题)为2G解决。

OOM常见报错排查之driver oom
●报错信息:
Application application_1611814054852_5589113 failed 1 times due to AM Container for appattempt_1611814054852_5589113_000001 exited with exitCode: 137 For more detailed output, check application tracking page:http://bj1230:8088/proxy/application_1611814054852_5589113/Then, click on links to logs of each attempt. Diagnostics: Container killed on request. Exit code is 137 Container exited with a non-zero exit code 137 Killed by external signal Failing this attempt. Failing the application.或: Failing the application. java.lang.OutOfMemoryError: Java heap space
●原因:读取数据太大,广播大表
●解决:1.业务层面:尽可能多的使用到表分区条件,减少分片。
2.代码层面:在减少数据读取的基础上,driver仍溢出,考虑查看web ui是否有broadcast表,禁用广播重试任务,同时去掉sql中的/*+broadcastjoin(a) */ 。
3.参数参考:
3.1机器上添加参数方法:
–hiveconf livy.session.driverMemory=10240M
–hiveconf livy.session.conf.spark.driver.memoryOverhead=4

OOM常见报错排查之executor oom:
●报错信息:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 40 in stage 5.0 failed 4 times, most recent failure: Lost task 40.3 in stage 5.0 (TID 444, cdh-bjht-xxxx, executor 61): ExecutorLostFailure (executor 61 exited caused by one of the running tasks) Reason: Container marked as failed: container_xxxx on host: cdh-bjht-xxxx. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
●解决思路:
1.业务层面:
1.1表中是否有大量空值及无意义数据,过滤无意义数据。
1.2尽可能多的使用到表分区条件,减少全表扫描。
1.3数据倾斜
2.代码层面:
2.1代码中是否有cache表的操作,占用了内存,导致executor oom,可先尝试去掉cache跑任务。
2.2对小表进行广播/*+broadcastjoin(a) */ 同时小表放右边。

最后

以上就是大方大叔为你收集整理的工作中报错故障集合的全部内容,希望文章能够帮你解决工作中报错故障集合所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(83)

评论列表共有 0 条评论

立即
投稿
返回
顶部