KingbaseES V8R6集群案例---一主二备架构单个备库宕机事务影响测试

发布时间 2023-09-18 16:14:00作者: KINGBASE研究院
      KingbaseES V8R6集群案例---一主二备架构单个备库宕机事务影响测试

案例说明:
对于KingbaseES V8R6集群,在sync模式下,对于一主一备架构,如果备库宕机时,主库事务commit,会被hang住,在wal_sender_timeout参数(默认60s)超过阈值后,sync转为async后,事务才能完成commit;本案例在测试一主二备架构下,同步备库宕机后,主库事务处理是否有hang住现象。

适用版本:
KingbaseES V8R6

集群架构:

  ID | Name  | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node1 | primary | * running |          | running | 25697 | no      | n/a
 2  | node2 | standby |   running | node1    | running | 582   | no      | 1 second(s) ago
 3  | node3 | standby |   running | node1    | running | 12571 | no      | 1 second(s) ago

测试脚本:

[kingbase@node101 shell]$ cat 1.sh
#!/bin/bash

while true
do
/home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/ksql -U system prod <<EOF
insert into test1 values (now());
EOF
done


[kingbase@node101 shell]$ cat 2.sh
#!/bin/bash

while true
do
/home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/ksql -U system prod <<EOF
select * from test1;
EOF
done

一、测试一(synchronous='sync')

1、流复制同步配置

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
synchronous='sync'

[kingbase@node101 bin]$ cat /data/kingbase/hac7/data/es_rep.conf |grep _commit
synchronous_commit = remote_apply

2、集群流复制状态

test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_
start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush
_lag | replay_lag | sync_priority | sync_state |          reply_time
------+----------+---------+------------------+---------------+-----------------+-------------+-----------------
--------------+--------------+-----------+------------+------------+------------+------------+-----------+------
-----+------------+---------------+------------+-------------------------------
 2840 |    16385 | esrep   | node2            | 192.168.1.102 |                 |       44199 | 2023-06-21 14:47
:49.668343+08 |              | streaming | 1/A5000680 | 1/A5000680 | 1/A5000680 | 1/A5000680 |           |
     |            |             1 | sync       | 2023-06-21 16:51:00.439757+08
 2852 |    16385 | esrep   | node3            | 192.168.1.103 |                 |       62335 | 2023-06-21 14:47
:51.035563+08 |              | streaming | 1/A5000680 | 1/A5000680 | 1/A5000680 | 1/A5000680 |           |
     |            |             2 | potential  | 2023-06-21 16:50:54.103476+08
(2 rows)

---如上所示,node2为流复制中的sync节点。

3、主库执行测试脚本

session 1:
[kingbase@node101 shell]$ sh 1.sh &

session 2:
[kingbase@node101 shell]$ sh 2.sh >1.log 2>&1

4、关闭同步备库数据库服务
node2节点宕机时间:
2023-06-25 13:52:01.037 CST,,,11890,,6497d534.2e72,5,,2023-06-25 13:48:36 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,"

5、查看数据插入记录

如下所示,在同步备库宕机的时间点,并没有造成事务处理的中断(事务被hang住):

.......
 2023-06-25 13:52:00.907013
 2023-06-25 13:52:00.937465
 2023-06-25 13:52:00.966735
 2023-06-25 13:52:00.995530
 2023-06-25 13:52:01.026226
 2023-06-25 13:52:01.055470
 2023-06-25 13:52:01.082933
 2023-06-25 13:52:01.168686
 2023-06-25 13:52:01.238983
 2023-06-25 13:52:01.324240
 2023-06-25 13:52:01.374047
 2023-06-25 13:52:01.427150
 2023-06-25 13:52:01.464301
 2023-06-25 13:52:01.519301
 2023-06-25 13:52:01.573772
 2023-06-25 13:52:01.628080
 2023-06-25 13:52:01.735475
 2023-06-25 13:52:01.804554
 2023-06-25 13:52:01.891160
 2023-06-25 13:52:01.957588
 2023-06-25 13:52:02.070638
 2023-06-25 13:52:02.166344
 2023-06-25 13:52:02.219404
 2023-06-25 13:52:02.264405
 2023-06-25 13:52:02.304738
 .....

二、测试二(synchronous='quorum')

1、流复制同步配置

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
synchronous='quorum'

[kingbase@node101 bin]$ cat /data/kingbase/hac7/data/es_rep.conf |grep _commit
synchronous_commit = remote_apply

2、集群流复制状态

test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time
------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
 4598 |    16385 | esrep   | node2            | 192.168.1.102 |                 |       41421 | 2023-06-21 14:36:45.278823+08 |              | streaming | 1/A30009F8 | 1/A30009F8 | 1/A30009F8 | 1/A30009F8 |           |     |            |             1 | quorum     | 2023-06-21 15:16:03.583408+08
 8945 |    16385 | esrep   | node3            | 192.168.1.103 |                 |       61295 | 2023-06-21 15:14:39.573214+08 |              | streaming | 1/A30009F8 | 1/A30009F8 | 1/A30009F8 | 1/A30009F8 |           |     |            |             1 | quorum     | 2023-06-21 15:16:02.032845+08
(2 rows)

3、主库执行测试脚本

session 1:
[kingbase@node101 shell]$ sh 1.sh &

session 2:
[kingbase@node101 shell]$ sh 2.sh >1.log 2>&1

4、关闭同步备库数据库服务
node2节点宕机时间:
2023-06-25 14:04:55.740 CST,,,18910,,6497d80f.49de,5,,2023-06-25 14:00:47 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,"

5、查看数据插入记录
如下所示,在同步备库宕机的时间点,并没有造成事务处理的中断(事务被hang住):

 ........
2023-06-25 14:04:54.955449
 2023-06-25 14:04:54.986434
 2023-06-25 14:04:55.027062
 2023-06-25 14:04:55.058955
 2023-06-25 14:04:55.092413
 2023-06-25 14:04:55.123352
 2023-06-25 14:04:55.157153
 2023-06-25 14:04:55.193440
 2023-06-25 14:04:55.227700
 2023-06-25 14:04:55.264961
 2023-06-25 14:04:55.301582
 2023-06-25 14:04:55.336236
 2023-06-25 14:04:55.380388
 2023-06-25 14:04:55.419770
 2023-06-25 14:04:55.462793
 2023-06-25 14:04:55.496123
 2023-06-25 14:04:55.530339
 2023-06-25 14:04:55.563342
 2023-06-25 14:04:55.597350
 2023-06-25 14:04:55.632114
 2023-06-25 14:04:55.659785
 2023-06-25 14:04:55.695025
 2023-06-25 14:04:55.726024
 2023-06-25 14:04:55.754980
 2023-06-25 14:04:55.788829
 2023-06-25 14:04:55.818428
 2023-06-25 14:04:55.850334
 2023-06-25 14:04:55.886098
 2023-06-25 14:04:55.916974
 2023-06-25 14:04:55.949975
 2023-06-25 14:04:55.985593
 2023-06-25 14:04:56.017278
 2023-06-25 14:04:56.062645
 2023-06-25 14:04:56.094456
 2023-06-25 14:04:56.129149
 .......

三、测试三(synchronous='all')

1、流复制同步配置

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
synchronous='all'

[kingbase@node101 bin]$ cat /data/kingbase/hac7/data/es_rep.conf |grep _commit
synchronous_commit = remote_apply

2、集群流复制状态

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
 25468 |    16385 | esrep   | node2            | 192.168.1.102 |                 |       42097 | 2023-06-21 16:28:11.888318+08 |              | streaming | 1/A4000508 | 1/A4000508 | 1/A4000508 | 1/A4000508 |           |      |            |             1 | sync       | 2023-06-21 16:28:51.499723+08
 25480 |    16385 | esrep   | node3            | 192.168.1.103 |                 |       61309 | 2023-06-21 16:28:13.300779+08 |              | streaming | 1/A4000508 | 1/A4000508 | 1/A4000508 | 1/A4000508 |           |      |            |             1 | sync       | 2023-06-21 16:28:54.034186+08
(2 rows)

---如上所示,node2为流复制中的sync节点。

3、主库执行测试脚本

session 1:
[kingbase@node101 shell]$ sh 1.sh &

session 2:
[kingbase@node101 shell]$ sh 2.sh >1.log 2>&1

4、关闭同步备库数据库服务
node2节点宕机时间:
2023-06-25 14:12:06.618 CST,,,24878,,6497da69.612e,5,,2023-06-25 14:10:49 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,"

5、查看数据插入记录
如下所示,在同步备库宕机的时间点,并没有造成事务处理的中断(事务被hang住):

......
 2023-06-25 14:12:06.910576
 2023-06-25 14:12:06.944394
 2023-06-25 14:12:06.977977
 2023-06-25 14:12:07.006902
 2023-06-25 14:12:07.027061
 2023-06-25 14:12:07.058800
 2023-06-25 14:12:07.090143
 2023-06-25 14:12:07.121745
 2023-06-25 14:12:07.159289
 2023-06-25 14:12:07.192978
 2023-06-25 14:12:07.216982
 2023-06-25 14:12:07.244713
 2023-06-25 14:12:07.275212
 2023-06-25 14:12:07.305201
 2023-06-25 14:12:07.333306
 2023-06-25 14:12:07.365943
 2023-06-25 14:12:07.396400
 2023-06-25 14:12:07.430124
 2023-06-25 14:12:07.457527
 2023-06-25 14:12:07.487357
 2023-06-25 14:12:07.518364
 2023-06-25 14:12:07.551942
 2023-06-25 14:12:07.580497
 2023-06-25 14:12:07.615095
 2023-06-25 14:12:07.642226
 2023-06-25 14:12:07.667587
 2023-06-25 14:12:07.705081
 2023-06-25 14:12:07.738520
 2023-06-25 14:12:07.777657
 2023-06-25 14:12:07.816077
 2023-06-25 14:12:07.847995
 2023-06-25 14:12:07.888306
 2023-06-25 14:12:07.917725
 2023-06-25 14:12:07.952045
 2023-06-25 14:12:07.984344
 2023-06-25 14:12:08.021621
 2023-06-25 14:12:08.051968
 2023-06-25 14:12:08.082699
 2023-06-25 14:12:08.114964
 2023-06-25 14:12:08.145950
 2023-06-25 14:12:08.175424
 ......

四、总结
在一主多备的架构下,同步流复制模式中,当同步备库节点宕机时,不会影响主库事务操作的连续性,建议对事务处理性能要求高的业务环境,采用一主多备的集群架构。如Oracle的DataGuard,在最大保护模式,建议采用一主多备的架构。