我是靠谱客的博主 糟糕樱桃,最近开发中收集的这篇文章主要介绍flink jobmanager高可用配置及故障演练1.配置2.故障演练结论&问题,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

1.配置

flink-conf.yaml添加配置:

high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/recovery
high-availability.zookeeper.quorum: xxx
high-availability.zookeeper.path.root: /flink-ha
yarn.application-attempts: 2

注意:
1)yarn模式下不需要配置high-availability.cluster-id,standalone模式才需要
2)yarn.application-attempts在ha没启用时默认为1,启用ha后默认为2。没有开启ha的情况下,设置yarn.application-attempts大于1 ,会导致任务从旧的checkpoint点重新启动,是一个比较危险的操作,所以非ha情况下最好不要设置yarn.application-attempts
3)yarn.application-attempts的配置受限于yarn的yarn.resourcemanager.am.max-attempts配置

2.故障演练

不开启HA情况下,NodeManager或者JobManager服务挂掉,会导致任务直接挂掉,不具备故障自动恢复能力
下面的故障演练都是基于HA的

(1)杀JobManager进程

JobManager和TaskManager会自动转移到新的节点重启,且是从最新的checkpoint恢复
整个过程是很快的,基本感知不到有发生故障转移

(2)杀JobManager所在机器的NodeManager进程

JobManager进程不受影响,任务也能正常执行
十几分钟之后,JobManager和TaskManager会自动转移到新的节点重启,且是从最新的checkpoint恢复

观察ResourceManager日志:

2022-05-18 11:30:39,662 INFO
util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(148)) - Expired:hdfs08-dev.yingzi.com:45454 Timed out after 600 secs
2022-05-18 11:30:39,666 INFO
rmnode.RMNodeImpl (RMNodeImpl.java:deactivateNode(1077)) - Deactivating Node hdfs08-dev.yingzi.com:45454 as it is now LOST
2022-05-18 11:30:39,667 INFO
rmnode.RMNodeImpl (RMNodeImpl.java:handle(671)) - hdfs08-dev.yingzi.com:45454 Node Transitioned from RUNNING to LOST
2022-05-18 11:30:39,675 INFO
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e142_1652757164854_0006_11_000001 Container Transitioned from RUNNING to KILLED
2022-05-18 11:30:39,680 INFO
capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1960)) - Removed node hdfs08-dev.yingzi.com:45454 clusterResource: <memory:143360, vCores:224>

默认10分钟(取决于yarn.nm.liveness-monitor.expiry-interval-ms配置)才会把NodeManager标记为LOST,然后移除节点和容器

期间如果手动启动NodeManager,则JobManager和TaskManager不会转移

如果不开启HA,则10分钟后任务会直接挂掉

(3)先杀JobManager所在机器的NodeManager进程,再杀JobManager进程

刚开始TaskManager进程还存活,但是任务无法正常执行,5分钟之后TaskManager进程挂掉
查看TaskManager日志,可以看出主要是与JobManager的连接断开导致:

2022-05-18 11:37:15,056 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job b5a0c61ab1b1495dd9c2f8e45d349957 from job leader monitoring.
2022-05-18 11:37:15,056 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2022-05-18 11:37:15,056 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/b5a0c61ab1b1495dd9c2f8e45d349957/job_manager_lock'}.
2022-05-18 11:37:15,056 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Close JobManager connection for job b5a0c61ab1b1495dd9c2f8e45d349957.
2022-05-18 11:37:15,079 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Close ResourceManager connection 27c66057868b763b46426beea8666578.
2022-05-18 11:37:25,014 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Cannot find task to fail for execution fc957691a7fbc7f3ac92f904b7840190 with exception:
java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.jobmaster.JobMasterGateway.updateTaskExecutionState(org.apache.flink.runtime.taskmanager.TaskExecutionState) timed out.
...
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hdfs05-dev.yingzi.com:16897/user/rpc/jobmanager_2#-1091134734]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
...
2022-05-18 11:42:15,096 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner
[] - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.

十几分钟之后,JobManager和TaskManager会自动转移到新的节点重启,且是从最新的checkpoint恢复,任务也恢复正常

观察ResourceManager日志:

2022-05-17 17:06:08,398 INFO
util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(148)) - Expired:hdfs03-dev.yingzi.com:45454 Timed out after 600 secs
2022-05-17 17:06:08,399 INFO
rmnode.RMNodeImpl (RMNodeImpl.java:deactivateNode(1077)) - Deactivating Node hdfs03-dev.yingzi.com:45454 as it is now LOST
2022-05-17 17:06:08,399 INFO
rmnode.RMNodeImpl (RMNodeImpl.java:handle(671)) - hdfs03-dev.yingzi.com:45454 Node Transitioned from RUNNING to LOST

默认10分钟(取决于yarn.nm.liveness-monitor.expiry-interval-ms配置)才会把NodeManager标记为LOST,然后移除节点和容器

如果把yarn.nm.liveness-monitor.expiry-interval-ms或者yarn.am.liveness-monitor.expiry-interval-ms(默认均为10分钟)值其中一个调小,JobManager和TaskManager会提前重启
即NodeManager和JobManager同时挂掉,任务恢复时间由nn和am的超时时间共同决定

期间如果手动启动NodeManager,任务也能够自动恢复

(4)杀TaskManager进程

TaskManager转移到新的节点重启,但是任务无法正常执行,并且checkpoint操作也是失败的
看日志JobManager还在尝试与旧节点的TaskManager通信,每间隔10秒报一次connection refused的异常
直到10分钟(与heartbeat.timeout值相关)后日志显示

2022-05-18 14:39:05,553 WARN
akka.remote.ReliableDeliverySupervisor
[] - Association with remote system [akka.tcp://flink@hdfs05-dev.yingzi.com:23065] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2022-05-18 14:40:06,013 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - JobManager for job b5a0c61ab1b1495dd9c2f8e45d349957 with leader id 84a8fd94fa276bf476d3af016c1441be lost leadership.
2022-05-18 14:40:06,017 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Close JobManager connection for job b5a0c61ab1b1495dd9c2f8e45d349957.
2022-05-18 14:40:06,018 INFO
org.apache.flink.runtime.taskmanager.Task
[] - Attempting to fail task externally Source: TableSourceScan(table=[[default_catalog, default_database, anc_herd_property_val]], fields=[id, create_user, modify_user, create_time, modify_time, app_id, tenant_id, deleted, animal_id, property_id, property_val]) -> MiniBatchAssigner(interval=[2000ms], mode=[ProcTime]) -> DropUpdateBefore (1/1)#0 (9c5ac532a5f8459790af009cbc74fd3a).
2022-05-18 14:40:06,020 WARN
org.apache.flink.runtime.taskmanager.Task
[] - Source: TableSourceScan(table=[[default_catalog, default_database, anc_herd_property_val]], fields=[id, create_user, modify_user, create_time, modify_time, app_id, tenant_id, deleted,animal_id, property_id, property_val]) -> MiniBatchAssigner(interval=[2000ms], mode=[ProcTime]) -> DropUpdateBefore (1/1)#0 (9c5ac532a5f8459790af009cbc74fd3a) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for b5a0c61ab1b1495dd9c2f8e45d349957.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
...
Caused by: java.lang.Exception: Job leader for job id b5a0c61ab1b1495dd9c2f8e45d349957 lost leadership.
...
2022-05-18 14:40:16,042 INFO
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=837.120mb (877783849 bytes), taskOffHeapMemory=0 bytes, managedMemory=2.010gb (2158221148 bytes), networkMemory=343.040mb (359703515 bytes)}, allocationId: 85159a1bbf11b3993c17d8adcbc9694c, jobId: b5a0c61ab1b1495dd9c2f8e45d349957).
2022-05-18 14:40:16,050 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job b5a0c61ab1b1495dd9c2f8e45d349957 from job leader monitoring.
2022-05-18 14:40:16,050 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2022-05-18 14:40:16,050 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/b5a0c61ab1b1495dd9c2f8e45d349957/job_manager_lock'}.
2022-05-18 14:40:16,109 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Cannot find task to fail for execution 373d61b162d19ffda23ea2906f0cb7a9 with exception:
java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.jobmaster.JobMasterGateway.updateTaskExecutionState(org.apache.flink.runtime.taskmanager.TaskExecutionState) timed out.
at com.sun.proxy.$Proxy36.updateTaskExecutionState(Unknown Source) ~[?:?]
at org.apache.flink.runtime.taskexecutor.TaskExecutor.updateTaskExecutionState(TaskExecutor.java:1852) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
at org.apache.flink.runtime.taskexecutor.TaskExecutor.unregisterTaskAndNotifyFinalState(TaskExecutor.java:1885) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
...
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hdfs05-dev.yingzi.com:23065/user/rpc/jobmanager_2#-613206152]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
...
2022-05-18 14:45:06,090 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Fatal error occurred in TaskExecutor akka.tcp://flink@hdfs04-dev.yingzi.com:32782/user/rpc/taskmanager_0.
1048 org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
1049
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1440) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
1050
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$17(TaskExecutor.java:1425) ~[flink-dist_2.12-1.13.2.jar:1.13.2
...
2022-05-18 14:45:06,106 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Stopping TaskExecutor akka.tcp://flink@hdfs04-dev.yingzi.com:32782/user/rpc/taskmanager_0.
1102 2022-05-18 14:45:06,110 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
1103 2022-05-18 14:45:06,110 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
1104 2022-05-18 14:45:06,110 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetri
evalDriver{retrievalPath='/leader/resource_manager_lock'}.
1105 2022-05-18 14:45:06,110 INFO
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
1106 2022-05-18 14:45:06,114 INFO
org.apache.flink.runtime.io.disk.FileChannelManagerImpl
[] - FileChannelManager removed spill file directory /hadoop/yarn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-io-2f303ad9-eef4-4048-846e-85dc29da5038
1107 2022-05-18 14:45:06,114 INFO
org.apache.flink.runtime.io.network.NettyShuffleEnvironment
[] - Shutting down the network environment and its components.
1108 2022-05-18 14:45:06,114 INFO
org.apache.flink.runtime.io.network.netty.NettyClient
[] - Successful shutdown (took 0 ms).
1109 2022-05-18 14:45:06,117 INFO
org.apache.flink.runtime.io.network.netty.NettyServer
[] - Successful shutdown (took 2 ms).
1110 2022-05-18 14:45:06,125 INFO
org.apache.flink.runtime.io.disk.FileChannelManagerImpl
[] - FileChannelManager removed spill file directory /hadoop/yarn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-netty-shuffle-5b4f703f-8827-4139-8237-6b9bfdfe
fe6e
1111 2022-05-18 14:45:06,125 INFO
org.apache.flink.runtime.taskexecutor.KvStateService
[] - Shutting down the kvState service and its components.
1112 2022-05-18 14:45:06,125 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
1113 2022-05-18 14:45:06,129 INFO
org.apache.flink.runtime.filecache.FileCache
[] - removed file cache directory /hadoop/ya
rn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-dist-cache-f2a2be98-66ed-45c2-815b-a981e7a90d05

然后才恢复正常
可以看出是触发了心跳超时,进行触发故障转移,从checkpoint恢复job

(5)杀TaskManager所在机器的NodeManager进程

刚杀完TaskManager进程正常,任务也可以正常工作
十几分钟之后,TaskManager自动转移到新的节点重启
现象基本和上面(2)类似

(6)先杀TaskManager所在机器的NodeManager进程,再杀TaskManager进程

flink ui上显示TaskManager状态正常且还在原来节点,但是其实任务已经不正常,并且checkpoint操作失败
除了tm不会立刻转移,现象基本和上面(4)一致

结论&问题

开启HA的情况下,即使出现某个服务进程,甚至某台机器直接挂掉的情况,任务也能自动恢复。具体恢复的时间取决于挂掉的服务情况和几个超时时间参数的配置;

如果某台机器直接宕机,任务恢复可能要十几分钟,可以通过调整参数来缩短故障恢复时间,但是需要进一步研究超时时间缩短可能带来的副作用

最后

以上就是糟糕樱桃为你收集整理的flink jobmanager高可用配置及故障演练1.配置2.故障演练结论&问题的全部内容,希望文章能够帮你解决flink jobmanager高可用配置及故障演练1.配置2.故障演练结论&问题所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(53)

评论列表共有 0 条评论

立即
投稿
返回
顶部