tidb断电pd数据被破坏修复

发布时间 2023-12-06 10:48:32作者: 苍茫宇宙
思路:找到cluster id 和 分配的id -----启动单点的pd-------pd-recover修复pd-------恢复成pd集群-----强制删除多个pd-------启动tidb集群
1、pd的日志报错
[root@localhost log]# pwd /tidb-deploy/pd-2379/log
[root@localhost log]# tail -100f pd.log
["failed to recover v3 backend from snapshot"] [error="failed to find database snapshot file (snap: snapshot file doesn't exist)"]
2、一.获取 Cluster ID
从 PD 日志获取 Cluster ID
cd /tidb-deploy/pd-2379/log/
cat pd.log | grep "init cluster id"
[2022/04/20 12:23:07.079 +08:00] [INFO] [server.go:358] ["init cluster id"] [cluster-id=7195834538672150139]
二.获取已分配 ID 在指定已分配 ID 时,需指定一个比当前最大的已分配 ID 更大的值。可以从监控中获取已分配 ID,也可以直接在服务器上查看日志。
cat pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1
3000
3、修改116的pd 改为单节点pd (最好找数据最齐全的pd节点,看到小等方式)
[root@localhost bin]# more /etc/systemd/system/pd-2379.service

[Unit]
Description=pd service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/tidb-deploy/pd-2379/scripts/run_pd.sh
Restart=always

RestartSec=15s

[Install]
WantedBy=multi-user.target
[root@localhost bin]# cd /tidb-deploy/pd-2379/scripts/
[root@localhost scripts]# ls
run_pd.sh run_pd.sh.danjiedian
[root@localhost scripts]# more run_pd.sh.danjiedian
#!/bin/bash
set -e

# WARNING: This file was auto-generated. Do not edit!
# All your edit might be overwritten!
DEPLOY_DIR=/tidb-deploy/pd-2379

cd "${DEPLOY_DIR}" || exit 1
exec bin/pd-server \
--name="pd-192.168.10.116-2379" \
--client-urls="http://0.0.0.0:2379" \
--advertise-client-urls="http://192.168.10.116:2379" \
--peer-urls="http://0.0.0.0:2380" \
--advertise-peer-urls="http://192.168.10.116:2380" \
--data-dir="/tidb-data/pd-2379" \
# --initial-cluster="pd-192.168.10.116-2379=http://192.168.10.116:2380,pd-192.168.10.117-2379=http://192.168.10.117:2380,pd-192.168.10.118-2379=http://192.168.10.118:2380" \
--config=conf/pd.toml \
--log-file="/tidb-deploy/pd-2379/log/pd.log" 2>> "/tidb-deploy/pd-2379/log/pd_stderr.log"

删除pd的数据
[root@localhost tidb-data]# pwd
/tidb-data
[root@localhost tidb-data]# mv pd-2379 pd-2379bak/

[tidb@localhost tidb-data]# mkdir pd-2379

启动pd
systemctl start pd-2379

 
4、下载pd-recover 进行pd修复(集群相同版本工具)
工具包下载地址:https://download.pingcap.org/tidb-community-toolkit-v5.4.1-linux-amd64.tar.gz
2.使用 pd-recover
cd /tidb-deploy/pd-2379/bin
./pd-recover -endpoints http://192.168.10.166:2379 -cluster-id 7195834538672150139 -alloc-id 3000
5、停止pd,还原成集群,查看tidb集群中pd是否启动
systemctl stop pd-2379
mv run_pd.sh.bak run_pd.sh
systemctl start pd-2379
#查看集群pd是否启动
tiup cluster display tidb
 
0
5、强制剔除其他的pd,形成单接单pd运行
[root@localhost scripts]# tiup cluster scale-in tidb --node 192.168.10.117:2379 --force
#注意输入:
Are you sure to continue?
(Type "Yes, I know my data might be lost." to continue)
: Yes, I know my data might be lost.
6、启动验证集群,连接数据库查看数据
[root@localhost scripts]# tiup cluster start tidb
[root@localhost scripts]# tiup cluster display tidb
 
0