KingbaseES V8R3 集群运维案例 -- cluster.log无日志输出问题诊断

发布时间 2023-09-20 14:20:26作者: KINGBASE研究院

案例说明:
KingbaseES V8R3集群正常运行期间,现场发现cluster.log日志无任何信息输出,针对这一问题做了复现及提出解决方案。后现场检查发现,cluster.log文件曾被删除:

适用版本:
KingbaseES V8R3

一、查看集群的服务状态

1、集群节点状态

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | repli
cation_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+------
-------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | true              | 0
(2 rows)

2、流复制状态

TEST=# select * from sys_stat_replication;
 PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |
BACKEND_START         | BACKEND_XMIN |   STATE   | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REP
LAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
------+----------+---------+------------------+---------------+-----------------+-------------+---------
----------------------+--------------+-----------+---------------+----------------+----------------+----
-------------+---------------+------------
 1366 |       10 | SYSTEM  | node2            | 192.168.1.102 |                 |       38968 | 2023-04-
12 14:29:09.881587+08 |              | streaming | 1/2E0001B0    | 1/2E0001B0     | 1/2E0001B0     | 1/2
E0001B0      |             0 | async
(1 row)

二、通过lsof查看进程访问的文件

1、查看kingbasecluster进程访问的日志文件

[root@node101 ~]# cd /home/kingbase/cluster/HAR3/db/bin

[root@node101 bin]# lsof -c kingbasecluster |grep cluster.log
kingbasec 1689 root    1w      REG              253,2 29453500  34403567 /home/kingbase/cluster/HAR3/log/cluster.log
kingbasec 1689 root    2w      REG              253,2 29453500  34403567 /home/kingbase/cluster/HAR3/log/cluster.log
kingbasec 1724 root    1w      REG              253,2 29453500  34403567 /home/kingbase/cluster/HAR3/log/cluster.log

---如上所示,kingbasecluster服务启动后会访问cluster.log日志文件。

2、检查cluster.log对应的进程

[root@node101 bin]# lsof /home/kingbase/cluster/HAR3/log/cluster.log
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
kingbasec 1689 root    1w   REG  253,2 29455673 34403567 /home/kingbase/cluster/HAR3/log/cluster.log
kingbasec 1689 root    2w   REG  253,2 29455673 34403567 /home/kingbase/cluster/HAR3/log/cluster.log
kingbasec 1724 root    1w   REG  253,2 29455673 34403567 /home/kingbase/cluster/HAR3/log/cluster.log

# 查看对应的进程
[root@node101 bin]# ps -ef |grep 1689
root      1689     1  0 14:29 ?        00:00:00 ./kingbasecluster -n
root      1724  1689  0 14:29 ?        00:00:00 kingbasecluster: watchdog
root      1788  1689  0 14:29 ?        00:00:00 kingbasecluster: lifecheck
root      1794  1689  0 14:29 ?        00:00:00 kingbasecluster: wait for connection request
root      1795  1689  0 14:29 ?        00:00:00 kingbasecluster: wait for connection request
root      1796  1689  0 14:29 ?        00:00:00 kingbasecluster: wait for connection request

---如上所示,cluster.log日志文件会被kingbasecluster访问。

三、模拟cluster.log文件被删除

1、将cluster.log文件改名

[root@node101 log]# ls -lh cluster*

-rw-r--r-- 1 root root 2.4M Apr 12 14:33 cluster_restart.log
-rw-r--r-- 1 root root  110 Apr 12 14:09 clusterstop

[root@node101 log]# mv cluster.log cluster.log.bk

[root@node101 log]# ls -lh cluster*
-rw-r--r-- 1 root root  29M Apr 12 14:33 cluster.log.bk
-rw-r--r-- 1 root root 2.4M Apr 12 14:33 cluster_restart.log
-rw-r--r-- 1 root root  110 Apr 12 14:09 clusterstop

2、通过lsof查看kingbasecluster进程访问的日志文件

# 如下所示,kingbasecluster输出日志文件为cluster.log.bk
[root@node101 log]# lsof -c kingbasecluster |grep cluster.log
kingbasec 1689 root    1w      REG              253,2 29464573  34403567 /home/kingbase/cluster/HAR3/log/cluster.log.bk
kingbasec 1689 root    2w      REG              253,2 29464573  34403567 /home/kingbase/cluster/HAR3/log/cluster.log.bk
kingbasec 1724 root    1w      REG              253,2 29464573  34403567 /home/kingbase/cluster/HAR3/log/cluster.log.bk

# 删除cluster.log.bk文件
[root@node101 log]# rm cluster.log.bk

[root@node101 log]# lsof -c kingbasecluster |grep cluster.log
kingbasec 1689 root    1w      REG              253,2 29467057  34403567 /home/kingbase/cluster/HAR3/log/cluster.log.bk (deleted)
kingbasec 1689 root    2w      REG              253,2 29467057  34403567 /home/kingbase/cluster/HAR3/log/cluster.log.bk (deleted)
kingbasec 1724 root    1w      REG              253,2 29467057  34403567 /home/kingbase/cluster/HAR3/log/cluster.log.bk (deleted)

# 手工创建cluster.log文件
[root@node101 log]# touch cluster.log
[root@node101 log]# lsof /home/kingbase/cluster/HAR3/log/cluster.log

---如上所示,手工创建cluster.log文件后,也没有被kingbasecluster进程访问。

如下图所示,删除cluster.log后,lsof标识文件被‘delete’:

四、测试failover切换

1、切换前集群节点状态

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | repli
cation_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+------
-------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | true              | 0
(2 rows)

2、模拟主库数据库服务down

[kingbase@node101 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

3、查看切换后的新主库

TEST=# select sys_is_in_recovery();
 SYS_IS_IN_RECOVERY
--------------------
 f
(1 row)

4、切换后集群节点状态

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | repli
cation_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+------
-------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | standby | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
(2 rows)

---如上所示,对于cluster.log文件被误删除,不会影响到集群的切换。

五、恢复cluster.log日志输出
Tips:
如果需要kingbasecluster重新输出日志到cluster.log文件,需要重启kingbasecluster服务,可以通过root用户执行restartcluster.sh;但是必须先kill原来的kingbasecluster进程,然后再执行此脚本。这样可以在不影响数据库服务的情况下,重启集群服务。

1、查看kingbasecluster进程

[root@node101 ~]# ps -ef |grep kingbasecluster
root      1689     1  0 14:29 ?        00:00:00 ./kingbasecluster -n
.......

2、kill进程(注意:不用kill -9)

[root@node101 ~]# kill -2 1689
[root@node101 ~]# ps -ef |grep kingbasecluster

3、执行restartcluster.sh

[root@node101 ~]# /home/kingbase/cluster/HAR3/kingbasecluster/bin/restartcluster.sh
# 进程重启pid发生变化
[root@node101 ~]# ps -ef |grep kingbasecluster
root     14943     1  0 14:50 pts/0    00:00:00 ./kingbasecluster -n
.......

4、查看cluster.log日志输出

[root@node101 ~]# tail -f  /home/kingbase/cluster/HAR3/log/cluster.log
2023-04-12 14:50:46: pid 14943: LOG:  Setting up socket for :::9999
2023-04-12 14:50:46: pid 14943: LOG:  kingbasecluster successfully started. version 3.6.7 (release)
2023-04-12 14:50:47: pid 14966: LOG:  creating socket for sending heartbeat
2023-04-12 14:50:47: pid 14966: DETAIL:  bind send socket to device: enp0s3
.......

六、总结
对于集群及数据库服务的日志文件,不要轻易通过手工方式删除文件,如果文件日益增大占用磁盘空间,可以通过Linux系统的logrotate工具切割日志文件并自动删除陈旧日志文件。
通过restartcluster.sh重启集群kingbasecluster服务,不影响集群的正常访问,但是会导致cluster vip漂移,对于生产环境,应该在业务访问低峰期间执行。