Exadata X6-2,出现RS-7445 [Serv CELLSRV hang detected] [It will be restarted]

发布时间 2023-03-28 13:30:27作者: 石云华

1、驻场的同事发现X6-2的某个存储节点,出现7445错误。

# cellcli -e list alerthistory

2023-03-27T23:01:44+08:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"

2、检查该存储节点的alert日志:

2023-03-27T23:01:44.912828+08:00
[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid: 16281) returned with error: 123
[RS] Service CELLSRV will be restarted.
Errors in file /opt/oracle/cell/log/diag/asm/cell/dm01celadm12/trace/rstrc_16269_omt.trc (incident=1):
RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell/log/diag/asm/cell/dm01celadm12/incident/incdir_1/rstrc_16269_omt_i1.trc

2023-03-27T23:01:45.172217+08:00
State dump signal delivered to CELLSRV<16314> by pid - 16269, uid - 0
State dump signal delivered to CELLSRV<16314> by RS.
2023-03-27T23:01:45.947036+08:00
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 924221440 bytes with size 16384 bytes membuf 0x6001d80be000, bioreq 0x600003dbf5d0 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 5584060416 bytes with size 131072 bytes membuf 0x601324800000, bioreq 0x6000042926c8 (errno: Input/output error [5])
Write Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 19931332608 bytes with size 512 bytes membuf 0x6001cbb51400, bioreq 0x600004647cf8 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 924221440 bytes with size 16384 bytes membuf 0x6001d6ea2000, bioreq 0x6002cde5cb48 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 924221440 bytes with size 16384 bytes membuf 0x6001d91de000, bioreq 0x600004526538 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 4483727360 bytes with size 16384 bytes membuf 0x6001d885a000, bioreq 0x600003e77518 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 33554432 bytes with size 512 bytes membuf 0x6001cbbece00, bioreq 0x6002cbe1c108 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 33554432 bytes with size 512 bytes membuf 0x6001cbad3400, bioreq 0x6002cc8b5ab8 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 5584060416 bytes with size 131072 bytes membuf 0x601379500000, bioreq 0x6002d0c7f578 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 4483727360 bytes with size 16384 bytes membuf 0x6001d7bc6000, bioreq 0x600003da6d60 (errno: Input/output error [5])
Max number of IO Error messages for FD_00_dm01celadm12 have been logged, further IO error messages for this device are temporary disabled
Mon Mar 27 23:01:45 2023 961 msec State dump completed for CELLSRV<16314>
2023-03-27T23:02:13.900399+08:00
[RS] Stopped Service CELLSRV
2023-03-27T23:02:13.911836+08:00
[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt with pid 12591
[RS] Previously detected 1 hang(s) for service CELLSRV. Using heartbeat timeout of 8 seconds.

可以看出,在报RS-7445错误时,/dev/nvme3n1这块FlashDISK出现IO读失败。

3、搜索MOS网站,可以找到MOS文档《Exadata: Cell Service crash with RS-7445 [SERV CELLSRV HANG DETECTED] during a flash disk failure (Doc ID 2486713.1)》 和 《Exadata: Database performance issues or outages after a flash disk failure, Cell Service may crash with RS-7445 [Serv CELLSRV hang detected] (Doc ID 2584475.1)》。

简单地说,就是FlashDISK出现IO失败,导致CELLSRV服务hang住。

 

4、后期需要升级存储软件版本,解决CELLSRV服务hang住的问题。