19C RAC更换OCR磁盘组后,ASM密码认证导致集群CRSD服务无法启动

发布时间 2023-09-13 13:30:27作者: 石云华

前言

一套19.19的RAC,OCR所在的ASM磁盘组从+GRID更换为+DG_GRID,然后强制删除了原来的+GRID磁盘组,最终导致该集群无法启动。

 

过程

1、启动过程中,CSS服务正常启动,但CRS服务无法启动。此时,节点2的alertasm2.log日志中提示如下错误。

2023-06-23T17:44:33.667188+08:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13944.trc:

ORA-17503: ksfdopn:2 Failed to open file +GRID/orapwasm

ORA-15001: diskgroup "GRID" does not exist or is not mounted

ORA-06512: at line 4

ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 679

ORA-06512: at line 2

2023-06-23T17:44:34.129085+08:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13944.trc:

ORA-17503: ksfdopn:2 Failed to open file +GRID/orapwasm

ORA-15001: diskgroup "GRID" does not exist or is not mounted

ORA-06512: at line 4

ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 679

ORA-06512: at line 2

ORA-01017: invalid username/password; logon denied

2023-06-23T17:44:34.490668+08:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13944.trc:

ORA-17503: ksfdopn:2 Failed to open file +GRID/orapwasm

ORA-15001: diskgroup "GRID" does not exist or is not mounted

ORA-06512: at line 4

ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 679

ORA-06512: at line 2

^C

[grid@19crac2 trace]$

从日志文件可以看出,集群启动的过程中,需要找ASM实例的密码文件,但以前的密码文件存放在+GRID磁盘组中,而这个ASM磁盘组已经被删除了。

 

2、为ASM实例创建新的密码文件,并修改OCR相关的信息。

[root@19crac2 ~]# srvctl config asm

ASM home: <CRS home>

Password file: +grid/orapwASM

Backup of Password file: +grid/orapwASM_backup

ASM listener: LISTENER

ASM instance count: 3

Cluster ASM listener: ASMNET1LSNR_ASM

 

[root@19crac2 ~]#

[root@19crac2 ~]# orapwd file='+DG_GRID/orapwASM'  entries=5 password=welcome1

[root@19crac2 ~]# srvctl modify asm -pwfile +DG_GRID/orapwASM

[root@19crac2 ~]# srvctl modify asm -pwfilebackup +DG_GRID/orapwASM_backup

 

3、再次尝试重启集群,此时集群的CRSD服务仍然无法启动,crsd.trc日志中的错误信息如下所示。

2023-06-24 06:50:11.232*:kgfn.c@6088: kgfnGetBeqData: kgfnTgtInit failed, inst=NULL flags=0x6000

2023-06-24 06:50:11.235 :   CLSNS:3425988352: clsns_SetTraceLevel:trace level set to 1.

2023-06-24 06:50:11.363 :  OCRRAW:3425988352: kgfnConnect2: kgfnGetBeqData failed

 

2023-06-24 06:50:11.363*:kgfn.c@5268: kgfnConnect2: kgfnGetBeqData failed

2023-06-24 06:50:11.423 :  OCRRAW:3425988352: kgfnConnect2Int: cstr=(DESCRIPTION=(TCP_USER_TIMEOUT=1)(CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(ADDRESS_LIST=(LOAD_BALANCE=ON)(ADDRESS=(PROTOCOL=tcp

)(HOST=10.0.0.192)(PORT=1525)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))

 

2023-06-24 06:50:11.423*:kgfn.c@7122: kgfnConnect2Int: cstr=(DESCRIPTION=(TCP_USER_TIMEOUT=1)(CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(ADDRESS_LIST=(LOAD_BALANCE=ON)(ADDRESS=(PROTOCOL=tcp)(HOST=1

0.0.0.192)(PORT=1525)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))

 

4、搜索MOS,找到Grid Infrastructure (GI) startup fails because crsd fails to start in a flex asm environment (Doc ID 2392762.1),文章中提到,这个故障可能的三种原因:(1)、sqlnet.ora中的SQLNET.AUTHENTICATION_SERVICES参数被设置成none。(2)、ASM密码不匹配。(3)、ASMlistener的网段不匹配。

在本次故障中,是第二种情况造成的故障原因。根据How to Recreate Shared ASM Password File in 19c Grid Infrastructure (GI) (Doc ID 2717306.1)文章中的方法进行修复。

 

5、在处理这个故障时,已经重建并且指定了新的密码文件,但为什么还提示ASM密码不匹配呢,主要是因为19C RAC开始,重建ASM密码文件的方法与以前不一样。从19.8开始,asmcmd多了一个新特性,允许用户使用asmcmd credverify 和 asmcmd credfix命令来创建ASM密码。

GI_HOME/bin/asmcmd --nocp credverify

GI_HOME/bin/asmcmd --nocp credfix

 

6、修复了ASM密码匹配问题后,GI集群重启成功。