K8s 集群 etcd节点故障解决方案

发布时间 2023-04-24 23:46:42作者: NavyW

1 环境说明

k8s版本:v1.20

etcd节点(192.168.0.12)故障:

 报错详情:

 4月 24 22:47:13 k8s-node2 etcd[9543]: {"level":"warn","ts":"2023-04-24T22:47:13.571+0800","caller":"etcdserver/server.go:2065","msg":"failed to publish local member to cluster through raft","local-member-id":"b8fffb7f5b2f26e","local-member-attributes":"{Name:etcd-3 ClientURLs:[https://192.168.0.12:2379]}","request-path":"/0/members/b8fffb7f5b2f26e/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"}

2 查看etcd集群

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member list

3 移除故障节点

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member remove b8fffb7f5b2f26e

4 删除故障节点的数据

rm -rf /var/lib/etcd/default.etcd/member/

5 修改故障节点etcd配置文件

将new改为existing

#[Member]
ETCD_NAME="etcd-3"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://192.168.0.12:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.0.12:2379"

#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.12:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.12:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.0.5:2380,etcd-2=https://192.168.0.11:2380,etcd-3=https://192.168.0.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

6 重新加入集群

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member add etcd-3 --peer-urls=https://192.168.0.12:2380

 

7 重启故障节点的etcd

systemctl restart etcd

查看etcd服务状态

8 查看k8s集群健康状态