异常掉电osd无法启动

发布时间 2023-04-24 10:54:35作者: XU-NING

osd 无法启动

问题描述

osd 状态 down

[root@node-1 ~]# ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.68958 root default
-2 2.17239     host node-4
 3 1.08620         osd.3        up  1.00000          1.00000
 4 1.08620         osd.4      down        0          1.00000
-3 2.17239     host node-2
 2 1.08620         osd.2        up  1.00000          1.00000
 6 1.08620         osd.6        up  1.00000          1.00000
-4 2.17239     host node-3
 1 1.08620         osd.1        up  1.00000          1.00000
 7 1.08620         osd.7      down        0          1.00000
-5 2.17239     host node-1
 0 1.08620         osd.0        up  1.00000          1.00000
 5 1.08620         osd.5        up  1.00000          1.00000
You have mail in /var/spool/mail/root
[root@node-1 ~]# 

检查步骤

登陆对应 osd 节点, 查看对应 osd 服务

[root@node-3 ceph]# systemctl status  ceph-osd@7.service
● ceph-osd@7.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2021-10-29 13:26:40 CST; 4min 11s ago
  Process: 184542 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
  Process: 184488 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 184542 (code=killed, signal=ABRT)

Oct 29 13:26:20 node-3 systemd[1]: Unit ceph-osd@7.service entered failed state.
Oct 29 13:26:20 node-3 systemd[1]: ceph-osd@7.service failed.
Oct 29 13:26:40 node-3 systemd[1]: ceph-osd@7.service holdoff time over, scheduling restart.
Oct 29 13:26:40 node-3 systemd[1]: start request repeated too quickly for ceph-osd@7.service
Oct 29 13:26:40 node-3 systemd[1]: Failed to start Ceph object storage daemon.
Oct 29 13:26:40 node-3 systemd[1]: Unit ceph-osd@7.service entered failed state.
Oct 29 13:26:40 node-3 systemd[1]: ceph-osd@7.service failed.
[root@node-3 ceph]# 

重启对应 osd 服务, 无法正常拉起

[root@node-3 ceph]# systemctl restart  ceph-osd@7.service
Job for ceph-osd@7.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@7.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd@7.service" followed by "systemctl start ceph-osd@7.service" again.
[root@node-3 ceph]# 

# 根据提示执行
[root@node-3 ceph]# systemctl reset-failed ceph-osd@7.service

# 执行后重启没有报错,但是状态还是不对,下面需要查日志
[root@node-3 ceph]# systemctl restart  ceph-osd@7.service

查看日志, /var/log/ceph-osd@7.log

--- end dump of recent events ---
2021-10-29 13:50:07.117742 7ff76e664800  0 set uid:gid to 167:167 (ceph:ceph)
2021-10-29 13:50:07.117757 7ff76e664800  0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process ceph-osd, pid 203828
2021-10-29 13:50:07.119187 7ff76e664800  0 pidfile_write: ignore empty --pid-file
2021-10-29 13:50:07.146080 7ff76e664800  0 filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
2021-10-29 13:50:07.146462 7ff76e664800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2021-10-29 13:50:07.146467 7ff76e664800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2021-10-29 13:50:07.146483 7ff76e664800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: splice is supported
2021-10-29 13:50:07.158475 7ff76e664800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2021-10-29 13:50:07.158520 7ff76e664800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf
2021-10-29 13:50:07.159232 7ff76e664800  1 leveldb: Recovering log #767926
2021-10-29 13:50:07.175968 7ff76e664800  1 leveldb: Delete type=0 #767926

2021-10-29 13:50:07.176019 7ff76e664800  1 leveldb: Delete type=3 #767925

2021-10-29 13:50:07.176880 7ff76e664800  0 filestore(/var/lib/ceph/osd/ceph-7) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2021-10-29 13:50:07.196997 7ff76e664800 -1 journal Unable to read past sequence 1152396918 but header indicates the journal has committed up through 1152397338, journal is corrupt
2021-10-29 13:50:07.270433 7ff76e664800  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
2021-10-29 13:50:07.270609 7ff76e664800  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
2021-10-29 13:50:07.285389 7ff76e664800  0 osd.7 3200 crush map has features 2200130813952, adjusting msgr requires for clients
2021-10-29 13:50:07.285399 7ff76e664800  0 osd.7 3200 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2021-10-29 13:50:07.285403 7ff76e664800  0 osd.7 3200 crush map has features 2200130813952, adjusting msgr requires for osds
2021-10-29 13:50:15.583331 7fde2734f800  0 set uid:gid to 167:167 (ceph:ceph)
2021-10-29 13:50:15.583345 7fde2734f800  0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process ceph-osd, pid 204015
2021-10-29 13:50:15.584786 7fde2734f800  0 pidfile_write: ignore empty --pid-file
2021-10-29 13:50:15.616141 7fde2734f800  0 filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
2021-10-29 13:50:15.616528 7fde2734f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2021-10-29 13:50:15.616533 7fde2734f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2021-10-29 13:50:15.616550 7fde2734f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: splice is supported
2021-10-29 13:50:15.623022 7fde2734f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2021-10-29 13:50:15.623068 7fde2734f800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf
2021-10-29 13:50:15.623791 7fde2734f800  1 leveldb: Recovering log #767928
2021-10-29 13:50:15.623840 7fde2734f800  1 leveldb: Level-0 table #767930: started
2021-10-29 13:50:15.628917 7fde2734f800  1 leveldb: Level-0 table #767930: 139 bytes OK
2021-10-29 13:50:15.652033 7fde2734f800  1 leveldb: Delete type=0 #767928

关键日志信息: journal Unable to read past sequence 1152396918 but header indicates the journal has committed up through 1152397338, journal is corrupt

一般强断电会出现这个情况!

处理方法

对应 osd 节点, 修改 ceph 配置文件, 然后重启.

vim /etc/ceph/ceph.conf
[osd]
journal_ignore_corruption = true

# 重启成功后需要注释掉