OSD自然OUT之后无法再加入集群-v1-20210308_124828

发布时间 2023-04-25 10:52:01作者: XU-NING

OSD自然OUT之后无法再加入集群

企业云平台产品中心共享知识库

Exported on 03/08/2021

Table of Contents

  1. 问题描述 4
  2. 问题原因 5
  3. 解决方法 6
    1. 验证步骤 6

相关下载链接:

OSD自然OUT之后无法再加入集群.pdf1

- - - - - - - - 这是一条华丽的分割线 - - - - - - - -

1 https://iwiki.woa.com/download/attachments/119527381/ OSD%E8%87%AA%E7%84%B6OUT%E4%B9%8B%E5%90%8E%E6%97%A0%E6%B3%95%E5%86%8D%E5%8A%A0%E5%85%

A5%E9%9B%86%E7%BE%A4.pdf?api=v2&modificationDate=1586334532000&version=1

问题描述

osd.1 用实际环境启动失败的osd代替

OSD服务器关机之后时间较长导致osd被out出去,再次加入时osd日志

/var/log/ceph/ceph-osd.1.log报错如下:

2020-04-03 10:36:33.740785 7fcf4dab4d80 0 osd.1 74 crush map has features 288514051259236352, adjusting msgr requires for clients

2020-04-03 10:36:33.740792 7fcf4dab4d80 0 osd.1 74 crush map has features 288514051259236352 was 8705, adjusting msgr requires for mons

2020-04-03 10:36:33.740796 7fcf4dab4d80 0 osd.1 74 crush map has features 1009089991638532096, adjusting msgr requires for osds

2020-04-03 10:36:33.899665 7fcf4dab4d80 0 osd.1 74 load_pgs

2020-04-03 10:36:37.564629 7fcf4dab4d80 0 osd.1 74 load_pgs opened 6001 pgs

2020-04-03 10:36:37.566801 7fcf4dab4d80 0 osd.1 74 using weightedpriority op queue with priority op cut off at 64.

2020-04-03 10:36:37.568280 7fcf4dab4d80 -1 osd.1 74 log_to_monitors {default=true}

2020-04-03 10:36:39.973545 7fcf4dab4d80 -1 osd.1 74 init authentication failed: (22)

Invalid argument

关键信息 init authentication failed: (22) Invalid argument

将mon的auth debug级别设置到10/10可以看到如下信息

ceph daemon mon.$HOSTNAME config set debug_auth 20/20

mon认证报错日志

2020-04-04 12:22:08.233215 7f4322b4e700 10 In get_auth_session_handler for protocol

0

2020-04-04 12:22:08.235600 7f4326b56700 10 cephx server osd.1: start_session server_challenge 1d0a94ecfe0b5ec9

2020-04-04 12:22:08.239489 7f4326b56700 10 cephx server osd.1: handle_request get_auth_session_key for osd.1

2020-04-04 12:22:08.239532 7f4326b56700 0 mon.openstack-con01@0(leader).auth v210 caught error when trying to handle auth request, probably malformed request

问题原因

gdb 过程不做说明

gdb调试可知 osd keyring配置错误

解决方法

注释ceph集群服务器上/etc/ceph/ceph.conf [global] 节,重新启动OSD即可

keyring=/etc/ceph/ceph.client.admin.keyring systemctl restart ceph-osd@1

ceph osd tree|grep osd.1

# 输出如下,up后面的字段即为in状态,如果为0或者为空则为非in状态

1 hdd 2.49750 osd.1 up 1.00000 1.00000

等待一段时间,ceph osd tree 看到这个 osd up 且为in则无须做其他操作如果为up 但没有in 执行如下命令即可

ceph osd in osd.1

验证步骤

systemctl status ceph-osd@1 # 状态为

ceph osd tree

# 可以看到该osd节点状态为 up 且 in