[pve]在华为服务器上检测磁盘阵列状态

发布时间 2023-11-04 23:28:17作者: 呼长喜

上周dell服务器坏了一块硬盘,故障信息通过关联其自带的openmanager报警到了icinga2。更换了磁盘后,想起另一个pve集群使用的是华为服务器,而华为没有类似的硬件管理软件。于是安装了阵列制造商的程序并自己写了个简单脚本检测告警。

安装阵列制造商的检测程序

确认阵列卡

# lspci  | grep -i raid
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

阵列卡为"Logic MegaRAID SAS-3 3108"

下载安装MegaRAID Storage Manager (MSM)

lsi被broadcom收购了

https://www.broadcom.cn/support

下载的zip包里面只有RPM格式的安装包,而PVE是基于debian的,所以还需要使用alien把rpm转化为deb再安装

apt install alien
tar zxvf 17.05.02.01_MSM_linux-x86.tar.gz
cd disk
alien --scripts *.rpm
dpkg --install lib-utils2_1.00-3_all.deb
dpkg --install megaraid-storage-manager_17.05.02-2_all.deb

默认安装到目录/usr/local/MegaRAID Storage Manager/StorCLI/

测试程序

查看所有阵列信息,这个输出会很长

# /usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL
Adapter #0
=====================================
                    Versions
                ================
Product Name    : SAS3108
Serial No       : 
FW Package Build: 24.16.0-0106

                    Mfg. Data
                ================
Mfg. Date       : 00/00/00
......

                Image Versions in Flash:
                ================
BIOS Version       : 6.32.02.0_4.17.08.00_0x06150500
......

                Pending Images in Flash
                ================
None
                PCI Info
                ================
Controller Id   : 0000
......

                HW Configuration
                ================
......
ROC temperature : 47  degree Celcius

                Settings
                ================
Current Time                     : 3:37:33 11/4, 2020
Predictive Fail Poll Interval    : 300sec
......

                Capabilities
                ================
RAID Level Supported             : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
......
                Status
                ================
ECC Bucket Count                 : 0

                Limitations
                ================
Max Arms Per VD          : 32
......

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
Offline         : 0
Physical Devices  : 3
  Disks           : 2
Critical Disks  : 0
Failed Disks    : 0

                Supported Adapter Operations
                ================
Rebuild Rate                    : Yes
......

                Supported VD Operations
                ================
Read Policy          : Yes
Write Policy         : Yes
......

                Supported PD Operations
                ================
Force Online                            : Yes
......
T10 Power State                         : No
                Error Counters
                ================
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0

                High Availability Properties
                ================
Topology Type                 : None
                Cluster Information
                ================
Cluster Permitted     : No
Cluster Active        : No

                Default Settings
                ================
Phy Polarity                     : 0
Phy PolaritySplit                : 0
Background Rate                  : 30
......

我们只需要关注"Device Present"部分,如果"Degraded","Offline","Critical Disks","Failed Disks",都为"0"就判断状态磁盘正常,否则就有故障。

"Device Present"后面一共8行,只要有4个0就OK。

获取状态信息

用一个简单的组合命令:

/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 'Device Present' | grep 0 | wc -l

放入脚本

#!/bin/bash

PRESENT=$(/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 "Device Present" | grep 0 | wc -l)
if [[ $PRESENT -eq 4 ]]; then
echo 'All are OK' && exit 0
else
echo 'All are OK' && exit 2
fi

测试脚本

# bash /mnt/pve/nfs199/pve/check_MegaRAID.sh
All are OK

现在可以结合以前说过的钉钉告警脚本,在出现故障的时候通过钉钉发送警告,

或者集成到nagios/zabbix/icinga等监控平台。