kingbaseES坏块修复功能

发布时间 2023-09-19 18:46:05作者: KINGBASE研究院

1、自动坏块修复简介

主数据库访问系统表数据、索引、持久化用户表数据、索引时,从磁盘读取数据块至共享缓冲区,如果检测到坏块,自动从备节点获取坏块的副本,并修复坏块。

坏块修复相关参数

参数名称 默认值 参数描述
auto_bmr.auto_bmr_max_sess 5 设置自动坏块修复最大的会话数
auto_bmr.auto_bmr_req_timeout 60 设置自动修复坏块超时时间,超过时间则推出自动修复
auto_bmr.auto_bmr_sess_threshold 100 单个会话最大坏块数量阈值超过则不启动自动修复
auto_bmr.auto_bmr_sys_threshold 1024 系统最大坏块数量阈值超过则不启动自动修复
auto_bmr.enable_auto_bmr on 是否开启自动坏块修复

开启坏块修复

首选需要创建坏块修复扩展

kingbase=# create extension auto_bmr; CREATE EXTENSION kingbase=# show auto_bmr.enable_auto_bmr; AUTO_BMR.ENABLE_AUTO_BMR

on (1 行记录)

可以看到创建扩展后,默认的bmr状态已经开启。

同时也要注意在自动坏块修复过程中是有上限的,通过auto_bmr.auto_bmr_req_timeout、auto_bmr.auto_bmr_sess_threshold、 auto_bmr.auto_bmr_sys_threshold

其中auto_bmr.auto_bmr_req_timeout控制坏块修复的超时时间

auto_bmr.auto_bmr_sess_threshold和auto_bmr.auto_bmr_sys_threshold控制会话和系统级别坏块修复的page数量的阈值

如果达到了数量限制我们可以使用

select reset_auto_bmr_sys_bad_blk();

reset_auto_bmr_sess_bad_blk();

两个函数重置相关数量限制。

自动坏块修复实操步骤

检查集群状态

[kingbase@localhost bin]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 3 | | host=192.168.56.101 user=esrep dbname=esrep port=54321 connect_timeout=10
2 | node2 | standby | running | node1 | default | 100 | 3 | 0 bytes | host=192.168.56.103 user=esrep dbname=esrep port=54321 connect_timeout=10

创建测试表 并插入数据

[kingbase@localhost bin]$ ksql -U system kingbase
ksql (V8.0)
输入 "help" 来获取帮助信息.

kingbase=# drop table t1;
DROP TABLE
kingbase=# create table t1 (id int,name varchar2(100));
CREATE TABLE
kingbase=# insert into t1 values(generate_series(1,100000),now());
INSERT 0 100000

查看该测试表对应物理文件

kingbase=# select * from sys_relation_filepath('t1');
sys_relation_filepath

base/12178/74951

执行sh sys_monitor.sh stop停止集群

[kingbase@localhost bin]$ sh sys_monitor.sh stop 2023-06-16 10:04:57 Ready to stop all DB ... Service process "node_export" was killed at process 4623 Service process "postgres_ex" was killed at process 4624 Service process "node_export" was killed at process 3588 Service process "postgres_ex" was killed at process 3589 2023-06-16 10:05:01 begin to stop repmgrd on "[192.168.56.101]". 2023-06-16 10:05:02 repmgrd on "[192.168.56.101]" stop success. 2023-06-16 10:05:02 begin to stop repmgrd on "[192.168.56.103]". 2023-06-16 10:05:02 repmgrd on "[192.168.56.103]" stop success. 2023-06-16 10:05:02 begin to stop DB on "[192.168.56.103]". 等待服务器进程关闭 .... 完成 服务器进程已经关闭 2023-06-16 10:05:03 DB on "[192.168.56.103]" stop success. 2023-06-16 10:05:03 begin to stop DB on "[192.168.56.101]". 等待服务器进程关闭 .... 完成 服务器进程已经关闭 2023-06-16 10:05:04 DB on "[192.168.56.101]" stop success. 2023-06-16 10:05:04 Done. [kingbase@localhost bin]$

集群停止成功后,使用dd命令构造两个坏块

[kingbase@localhost bin]$ dd bs=8192 count=2 seek=1 of=/data/V8cluster/base/12178/74944 if=./kingbase conv=notrunc 记录了2+0 的读入 记录了2+0 的写出 16384字节(16 kB)已复制,0.000178303 秒,91.9 MB/秒 [kingbase@localhost bin]$

执行 sh sys_monitor.sh start启动集群

连接数据库,查询表,此时由于存在坏块,查询失败,报错提示

kingbase=# select * from t1;
WARNING: page is invalid: base/12178/74944, blockNum: 1
WARNING: Exec get buffer page failed,errMsg:ERROR: function public.get_lsn_reached_page(integer, integer, integer, integer, integer, integer) does not exist
LINE 1: select public.get_lsn_reached_page(1663, 12178, 74944, 0, 1,...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.

WARNING: repair invalid page: base/12178/74944, block: 1 failed.
ERROR: invalid page in block 1 of relation base/12178/74944
kingbase=#

连接数据库创建插件auto_bmr;

create extension auto_bmr ;

再次执行查询,会进行坏块自动修复,并返回查询结果

kingbase=# select * from t1;
WARNING: page is invalid: base/12178/74951, blockNum: 1
WARNING: repair invalid page:base/12178/74951, blockNum: 1 successfully.
WARNING: page verification failed, calculated checksum 57120 but expected 17157
WARNING: page is invalid: base/12178/74951, blockNum: 2
WARNING: repair invalid page:base/12178/74951, blockNum: 2 successfully.
id | name
--------+-------------------------------
1 | 2023-06-16 10:12:50.601743+08
2 | 2023-06-16 10:12:50.601743+08
3 | 2023-06-16 10:12:50.601743+08