tidb断电pd数据被破坏修复-526互联

思路：找到cluster id 和分配的id -----启动单点的pd-------pd-recover修复pd-------恢复成pd集群-----强制删除多个pd-------启动tidb集群

1、pd的日志报错

[root@localhost log]# pwd /tidb-deploy/pd-2379/log

[root@localhost log]# tail -100f pd.log

["failed to recover v3 backend from snapshot"] [error="failed to find database snapshot file (snap: snapshot file doesn't exist)"]

2、一.获取 Cluster ID

从 PD 日志获取 Cluster ID

cd /tidb-deploy/pd-2379/log/

cat pd.log | grep "init cluster id"

[2022/04/20 12:23:07.079 +08:00] [INFO] [server.go:358] ["init cluster id"] [cluster-id=7195834538672150139]

二.获取已分配 ID 在指定已分配 ID 时，需指定一个比当前最大的已分配 ID 更大的值。可以从监控中获取已分配 ID，也可以直接在服务器上查看日志。

cat pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1

3000

3、修改116的pd 改为单节点pd （最好找数据最齐全的pd节点，看到小等方式）

[root@localhost bin]# more /etc/systemd/system/pd-2379.service

[Unit]
Description=pd service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/tidb-deploy/pd-2379/scripts/run_pd.sh
Restart=always

RestartSec=15s

[Install]
WantedBy=multi-user.target
[root@localhost bin]# cd /tidb-deploy/pd-2379/scripts/
[root@localhost scripts]# ls
run_pd.sh run_pd.sh.danjiedian
[root@localhost scripts]# more run_pd.sh.danjiedian
#!/bin/bash
set -e

# WARNING: This file was auto-generated. Do not edit!
# All your edit might be overwritten!
DEPLOY_DIR=/tidb-deploy/pd-2379

cd "${DEPLOY_DIR}" || exit 1
exec bin/pd-server \
--name="pd-192.168.10.116-2379" \
--client-urls="http://0.0.0.0:2379" \
--advertise-client-urls="http://192.168.10.116:2379" \
--peer-urls="http://0.0.0.0:2380" \
--advertise-peer-urls="http://192.168.10.116:2380" \
--data-dir="/tidb-data/pd-2379" \
# --initial-cluster="pd-192.168.10.116-2379=http://192.168.10.116:2380,pd-192.168.10.117-2379=http://192.168.10.117:2380,pd-192.168.10.118-2379=http://192.168.10.118:2380" \
--config=conf/pd.toml \
--log-file="/tidb-deploy/pd-2379/log/pd.log" 2>> "/tidb-deploy/pd-2379/log/pd_stderr.log"

删除pd的数据
[root@localhost tidb-data]# pwd
/tidb-data
[root@localhost tidb-data]# mv pd-2379 pd-2379bak/

[tidb@localhost tidb-data]# mkdir pd-2379

启动pd
systemctl start pd-2379

4、下载pd-recover 进行pd修复（集群相同版本工具）

工具包下载地址：https://download.pingcap.org/tidb-community-toolkit-v5.4.1-linux-amd64.tar.gz

2.使用 pd-recover

cd /tidb-deploy/pd-2379/bin

./pd-recover -endpoints http://192.168.10.166:2379 -cluster-id 7195834538672150139 -alloc-id 3000

5、停止pd，还原成集群，查看tidb集群中pd是否启动

systemctl stop pd-2379

mv run_pd.sh.bak run_pd.sh

systemctl start pd-2379