集群重启交换机导致心跳网不通集群重启问题分析

78 阅读 0 评论 52 点赞

我是靠谱客的博主天真哈密瓜，这篇文章主要介绍集群重启交换机导致心跳网不通集群重启问题分析，现在分享给大家，希望可以做个参考。

问题描述：
os: redhat 6.9
db: oracle 11.2.0.4.6
18:10进行交换机节点重启，已查出包含节点一心跳网在内，预测会发生节点一重启现象，重启交换机过程中互ping心跳网一直正常，无丢包现象，但出现了节点二、三被踢出集群并重启，分析并查明问题原因

处理过程：
节点三在18:11:09.146首先报出无法与节点一二通讯，18:11:24.101的时候CSS进程停止
18:11:25.205节点二，18:11:25.138节点一相继报出无法与其他节点通讯问题，节点一将节点二、三移除集群

节点一日志
2020-07-23 18:11:39.935: [cssd(17821)]CRS-1607:Node
orcl2 is being evicted in cluster incarnation 489868346; details at
(:CSSNM00007:) in /u01/app/11.2.0/grid/log/orcl1/cssd/ocssd.log.
2020-07-23 18:11:39.935: [cssd(17821)]CRS-1607:Node
orcl3 is being evicted in cluster incarnation 489868346; details at
(:CSSNM00007:) in /u01/app/11.2.0/grid/log/orcl1/cssd/ocssd.log.
2020-07-23 18:11:41.948: [cssd(17821)]CRS-1625:Node orcl3, number 3, was shut down
2020-07-23 18:11:42.452: [cssd(17821)]CRS-1625:Node orcl2, number 2, was shut down
2020-07-23 18:11:42.462: [cssd(17821)]CRS-1601:CSSD
Reconfiguration complete. Active nodes are orcl1 .

节点二、三日志(停止css进程，清理crs资源，被节点一移除集群)
2020-07-23 18:11:39.932: [cssd(15893)]CRS-1609:This
node is unable to communicate with other nodes in the cluster and is going downto preserve cluster integrity; details at (:CSSNM00008:) in/u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:11:39.932: [cssd(15893)]CRS-1656:The
CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in
/u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log
2020-07-23 18:11:39.996: [cssd(15893)]CRS-1652:Starting
clean up of CRSD resources.
2020-07-23 18:11:40.665: [cssd(15893)]CRS-1608:This
node was evicted by node 1, orcl1; details at (:CSSNM00005:) in
/u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:11:40.667: [cssd(15893)]CRS-1608:This
node was evicted by node 1, orcl1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:11:40.673: [cssd(15893)]CRS-1608:This
node was evicted by node 1, orcl1; details at (:CSSNM00005:) in
/u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:11:41.244:
[/u01/app/11.2.0/grid/bin/oraagent.bin(16900)]CRS-5016:Process
“/u01/app/11.2.0/grid/opmn/bin/onsctli” spawned by agent
“/u01/app/11.2.0/grid/bin/oraagent.bin” for action “check”
failed: details at “(:CLSN00010:)” in “/u01/app/11.2.0/grid/log/orcl2/agent/crsd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:11:41.863: [/u01/app/11.2.0/grid/bin/oraagent.bin(16900)]CRS-5016:Process
“/u01/app/11.2.0/grid/bin/lsnrctl” spawned by agent
“/u01/app/11.2.0/grid/bin/oraagent.bin” for action “check”
failed: details at “(:CLSN00010:)” in “/u01/app/11.2.0/grid/log/orcl2/agent/crsd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:11:41.867: [cssd(15893)]CRS-1654:Clean
up of CRSD resources finished successfully.
2020-07-23 18:11:41.868: [cssd(15893)]CRS-1655:CSSD
on node orcl2 detected a problem and started to shutdown.

节点一日志(清理节点二、三的相关信息，集群只剩节点一)
2020-07-23 18:11:42.466: [crsd(18668)]CRS-5504:Node
down event reported for node ‘orcl2’.
2020-07-23 18:11:42.466: [crsd(18668)]CRS-5504:Node
down event reported for node ‘orcl3’.
2020-07-23 18:11:45.745: [crsd(18668)]CRS-2773:Server
‘orcl2’ has been removed from pool ‘Generic’.
2020-07-23 18:11:45.745: [crsd(18668)]CRS-2773:Server
‘orcl2’ has been removed from pool ‘ora.iboc’.
2020-07-23 18:11:45.746: [crsd(18668)]CRS-2773:Server
‘orcl3’ has been removed from pool ‘Generic’.
2020-07-23 18:11:45.746: [crsd(18668)]CRS-2773:Server
‘orcl3’ has been removed from pool ‘ora.iboc’.

****节点二、三日志(ora.cluster_interconnect.haip’ has failed相关信息)
2020-07-23 18:11:41.883:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(16910)]CRS-5822:Agent
‘/u01/app/11.2.0/grid/bin/orarootagent_root’ disconnected from server. Detailsat (:CRSAGF00117:) {0:5:564} in /u01/app/11.2.0/grid/log/orcl2/agent/crsd/orarootagent_root//orarootagent_root.log.
2020-07-23 18:11:41.884:
[/u01/app/11.2.0/grid/bin/oraagent.bin(16900)]CRS-5822:Agent
‘/u01/app/11.2.0/grid/bin/oraagent_grid’ disconnected from server. Details at(:CRSAGF00117:) {0:1:10} in /u01/app/11.2.0/grid/log/orcl2/agent/crsd/oraagent_grid//oraagent_grid.log.
2020-07-23 18:11:41.884:
[/u01/app/11.2.0/grid/bin/oraagent.bin(27483)]CRS-5822:Agent
‘/u01/app/11.2.0/grid/bin/oraagent_oracle’ disconnected from server. Details at (:CRSAGF00117:) {0:13:52444} in /u01/app/11.2.0/grid/log/orcl2/agent/crsd/oraagent_oracle//oraagent_oracle.log.
2020-07-23 18:11:41.886: [ohasd(15608)]CRS-2765:Resource
‘ora.crsd’ has failed on server ‘orcl2’.
2020-07-23 18:11:41.945: [cssd(15893)]CRS-1625:Node
orcl3, number 3, was shut down
2020-07-23 18:11:42.088: [cssd(15893)]CRS-1660:The
CSS daemon shutdown has completed
2020-07-23 18:11:42.200: [ohasd(15608)]CRS-2765:Resource
‘ora.evmd’ has failed on server ‘orcl2’.
2020-07-23 18:11:42.203: [ohasd(15608)]CRS-2765:Resource
‘ora.ctssd’ has failed on server ‘orcl2’.
2020-07-23 18:11:42.910: [crsd(3559)]CRS-0805:Cluster
Ready Service aborted due to failure to communicate with Cluster
Synchronization Service with error [3]. Details at (:CRSD00109:) in
/u01/app/11.2.0/grid/log/orcl2/crsd/crsd.log.
2020-07-23 18:11:43.003: [ohasd(15608)]CRS-2765:Resource
‘ora.cssdmonitor’ has failed on server ‘orcl2’.
2020-07-23 18:11:43.008:
[/u01/app/11.2.0/grid/bin/oraagent.bin(15807)]CRS-5011:Check
of resource “+ASM” failed: details at “(:CLSN00006:)” in
“/u01/app/11.2.0/grid/log/orcl2/agent/ohasd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:11:43.017: [ohasd(15608)]CRS-2765:Resource
‘ora.crsd’ has failed on server ‘orcl2’.
2020-07-23 18:11:43.201:
[/u01/app/11.2.0/grid/bin/oraagent.bin(15807)]CRS-5011:Check
of resource “+ASM” failed: details at “(:CLSN00006:)” in
“/u01/app/11.2.0/grid/log/orcl2/agent/ohasd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:11:43.202: [ohasd(15608)]CRS-2765:Resource
‘ora.asm’ has failed on server ‘orcl2’.
2020-07-23 18:11:43.226: [ctssd(3570)]CRS-2402:The
Cluster Time Synchronization Service aborted on host orcl2. Details at(:ctss_css_init1:) in /u01/app/11.2.0/grid/log/orcl2/ctssd/octssd.log.
2020-07-23 18:11:43.389:
[/u01/app/11.2.0/grid/bin/oraagent.bin(15807)]CRS-5011:Check
of resource “+ASM” failed: details at “(:CLSN00006:)” in
“/u01/app/11.2.0/grid/log/orcl2/agent/ohasd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:11:43.514: [ohasd(15608)]CRS-2765:Resource
‘ora.cssd’ has failed on server ‘orcl2’.
2020-07-23 18:11:43.527:
[ohasd(15608)]CRS-2765:Resource ‘ora.cluster_interconnect.haip’ has failed on server ‘orcl2’.
2020-07-23 18:11:43.581: [/u01/app/11.2.0/grid/bin/oraagent.bin(15807)]CRS-5011:Check
of resource “+ASM” failed: details at “(:CLSN00006:)” in
“/u01/app/11.2.0/grid/log/orcl2/agent/ohasd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:11:44.241: [ohasd(15608)]CRS-2878:Failed
to restart resource ‘ora.ctssd’

问题1：
在整个过程当中，心跳网互ping一直正常，但是导致了集群节点重启问题，后发现
CRS-2765:Resource ‘ora.cluster_interconnect.haip’ has failed
on server ‘orcl2’.
值得注意的就是这个报错，通过查看gv $cluster_interconnects视图发现 SQL> select * from gv$ cluster_interconnects;
INST_ID NAME IP_ADDRESS
IS_ SOURCE

1 eth1:1 169.254.134.108 NO
3 eth1:1 169.254.151.97 NO
2 eth1:1 169.254.31.191 NO
对应的三个节点的虚拟心跳网IP。
集群中也可以看到类似的进程
[grid@orcl1 ~]$ crsctl stat res -t -init | grep -1 ha
1 ONLINE ONLINE orcl1 Started ora.cluster_interconnect.haip

**查看路径下的日志
$GIRD_HOME/log/orcl1/agent/ohasd/orarootagent_root/orarootagent_root.log文件发现确实报错169.254的IP地址不通问题，解释了为什么互相ping1.1.*的私网心跳通的情况下，依然发生了心跳不通节点重启的问题。
从Oracle 11.2.0.2版本开始推出HAIP（Highly
Available Virtual IP）技术替代了操作系统层面的网卡绑定技术，功能更强大、更兼容。通过ifconfig可以看到类似eth0:1的以169.254开头的IP地址，实现集群内部链接的高可用及负载均衡。所以，在11.2.0.2或更高版本安装RAC的时候需要注意169.254.*的IP地址不能被占用。
如果只有1块私网网卡，那么GRID将会创建1个HAIP。如果有两块私网网卡，那么GRID将会创建两个HAIP。若超过两块私网网卡则GRID创建4个HAIP。GRID最多支持4块私网网卡，而集群实际上使用的HAIP地址数则取决于集群中最先启动的节点中激活的私网网卡数目。如果选中更多的私网网卡作为Oracle的私有网络，那么多余4个的不能被激活。
参考文章【MOS】Redundant
Interconnect ora.cluster_interconnect.haip (文档 ID
1210883.1)

问题2：
再次查询通过交换机的MAC地址发现，节点三的MAC地址也在交换机上，从而解释了为何三个节点互相都不通，而不是只有二、三和一不通的问题。
2020-07-23 18:11:45.415: [cssd(3623)]CRS-1713:CSSD
daemon is started in clustered mode
2020-07-23 18:12:01.052: [cssd(3623)]CRS-1707:Lease
acquisition for node orcl2 number 2 completed
2020-07-23 18:12:02.317:
[cssd(3623)]CRS-1605:CSSD
voting file is online: ORCL:IBOC_CRS3; details in /u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:12:02.321:
[cssd(3623)]CRS-1605:CSSD
voting file is online: ORCL:IBOC_CRS2; details in /u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:12:02.326: [cssd(3623)]CRS-1605:CSSD
voting file is online: ORCL:IBOC_CRS1; details in /u01/app/11.2.0/grid/log/orcl2/cssd/ocssd.log.
2020-07-23 18:12:06.705: [cssd(3623)]CRS-1601:CSSD
Reconfiguration complete. Active nodes are orcl1 orcl2 orcl3 .
2020-07-23 18:12:08.473: [ctssd(3933)]CRS-2403:The
Cluster Time Synchronization Service on host orcl2 is in observer mode.
2020-07-23 18:12:08.739: [ctssd(3933)]CRS-2407:The
new Cluster Time Synchronization Service reference node is host orcl1.
2020-07-23 18:12:08.740: [ctssd(3933)]CRS-2401:The
Cluster Time Synchronization Service started on host orcl2.
2020-07-23 18:12:10.470: [ohasd(15608)]CRS-2767:Resource
state recovery not attempted for ‘ora.diskmon’ as its target state is OFFLINE
2020-07-23 18:12:15.968:
[/u01/app/11.2.0/grid/bin/oraagent.bin(15807)]CRS-5011:Check
of resource “+ASM” failed: details at “(:CLSN00006:)” in
“/u01/app/11.2.0/grid/log/orcl2/agent/ohasd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:12:16.226:
[/u01/app/11.2.0/grid/bin/oraagent.bin(15807)]CRS-5011:Check
of resource “+ASM” failed: details at “(:CLSN00006:)” in
“/u01/app/11.2.0/grid/log/orcl2/agent/ohasd/oraagent_grid//oraagent_grid.log”
2020-07-23 18:12:30.924: [crsd(4208)]CRS-1012:The
OCR service started on node orcl2.
2020-07-23 18:12:30.937: [evmd(3568)]CRS-1401:EVMD
started on node orcl2.
2020-07-23 18:12:33.877: [crsd(4208)]CRS-1201:CRSD
started on node orcl2.
至此节点二、三重启后加入集群，通过此次重启，节点一变成集群主节点。

问题2：
研究为何只有节点一正常，而节点二三都发生了重启的问题
通过查看ocrconfig -showbackup
Use the
ocrconfig -showbackup command to display the backup location, timestamp, and
the originating node name of the backup files. By default, this command
displays information for both automatic and manual backups unless you specify
auto or manual.
auto:
Displays information about automatic backups that Oracle Clusterware created in
the past 4 hours, 8 hours, 12 hours, and in the last day and week.
manual:
Displays information about manual backups that you invoke using the ocrconfig
-manualbackup command.
发现备份OCR的节点为节点一，可得出节点一为集群主节点

得出结论：
若三个节点互相都不通的情况下，集群会保留主节点重启其余节点；若只有一个节点与其余两个节点不通讯，则提出不通讯的节点。