openGauss 5.0 主从集群 日常运维

发布时间 2023-05-07 10:22:11作者: 耀阳居士

在之前的博客我们看了openGauss 主从集群的搭建,如下:

openGauss 5.0 一主两从 复制环境 搭建手册
https://www.cndba.cn/dave/article/116528

本篇我们看下主从集群的维护。

 

1 查看集群状态

查看集群所有节点:

[dave@www.cndba.cn ~]$ gs_om -t status --detail
[  CMServer State   ]

node       node_ip         instance                                     state
-------------------------------------------------------------------------------
1  oracle  192.168.56.105  1    /data/openGauss/data/cmserver/cm_server Primary
2  oracle2 192.168.56.106  2    /data/openGauss/data/cmserver/cm_server Standby
3  oracle3 192.168.56.107  3    /data/openGauss/data/cmserver/cm_server Standby

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node       node_ip         instance                             state            
---------------------------------------------------------------------------------
1  oracle  192.168.56.105  6001 /data/openGauss/install/data/dn P Standby Normal
2  oracle2 192.168.56.106  6002 /data/openGauss/install/data/dn S Primary Normal
3  oracle3 192.168.56.107  6003 /data/openGauss/install/data/dn S Standby Normal
[dave@www.cndba.cn ~]$

查看单个节点:

[dave@www.cndba.cn ~]$ gs_om -t status -h oracle
-----------------------------------------------------------------------

cluster_state             : Normal
redistributing            : No
balanced                  : No

-----------------------------------------------------------------------

node                      : 1
node_name                 : oracle

node                      : 1
instance_id               : 1
node_ip                   : 192.168.56.105
data_path                 : /data/openGauss/data/cmserver/cm_server
type                      : CMServer
instance_state            : Primary

node                      : 1
instance_id               : 6001
node_ip                   : 192.168.56.105
data_path                 : /data/openGauss/install/data/dn
type                      : Datanode
instance_state            : Standby
dcf_role                  : FOLLOWER
static_connections        : 2
HA_state                  : Normal
reason                    : Normal
sender_sent_location      : 0/6011EA8
sender_write_location     : 0/6011EA8
sender_flush_location     : 0/6011EA8
sender_replay_location    : 0/6011EA8
receiver_received_location: 0/6011EA8
receiver_write_location   : 0/6011EA8
receiver_flush_location   : 0/6011EA8
receiver_replay_location  : 0/6011E08
sync_state                : Async

node                      : 1
node_name                 : oracle

node                      : 1
instance_id               : 1
node_ip                   : 192.168.56.105
data_path                 : /data/openGauss/data/cmserver/cm_server
type                      : CMServer
instance_state            : Primary

node                      : 1
node_ip                   : 192.168.56.105
type                      : Fenced UDF
state                     : Normal

-----------------------------------------------------------------------

node_state                : Normal
-----------------------------------------------------------------------

2 集群启停

在集群的任一主节点上以omm用户进行操作。

 


[dave@www.cndba.cn ~]$ gs_om -t stop
Stopping cluster.
=========================================
Successfully stopped cluster.
=========================================
End stop cluster.


[dave@www.cndba.cn ~]$ gs_om -t start
Starting cluster.
======================================================================
Successfully started primary instance. Wait for standby instance.
======================================================================
.
Successfully started cluster.
======================================================================
cluster_state      : Normal
redistributing     : No
node_count         : 3
Datanode State
    primary           : 1
    standby           : 2
    secondary         : 0
    cascade_standby   : 0
    building          : 0
    abnormal          : 0
    down              : 0

Successfully started cluster.


[dave@www.cndba.cn ~]$ gs_om -t status --detail
[  CMServer State   ]

node       node_ip         instance                                     state
-------------------------------------------------------------------------------
1  oracle  192.168.56.105  1    /data/openGauss/data/cmserver/cm_server Primary
2  oracle2 192.168.56.106  2    /data/openGauss/data/cmserver/cm_server Standby
3  oracle3 192.168.56.107  3    /data/openGauss/data/cmserver/cm_server Standby

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node       node_ip         instance                             state            
---------------------------------------------------------------------------------
1  oracle  192.168.56.105  6001 /data/openGauss/install/data/dn P Standby Normal
2  oracle2 192.168.56.106  6002 /data/openGauss/install/data/dn S Primary Normal
3  oracle3 192.168.56.107  6003 /data/openGauss/install/data/dn S Standby Normal
[dave@www.cndba.cn ~]$

3 switchover 切换

先查看集群状态:

[dave@www.cndba.cn ~]$ gs_om -t status --detail
……
cluster_state   : Normal
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node       node_ip         instance                             state            
---------------------------------------------------------------------------------
1  oracle  192.168.56.105  6001 /data/openGauss/install/data/dn P Standby Normal
2  oracle2 192.168.56.106  6002 /data/openGauss/install/data/dn S Primary Normal
3  oracle3 192.168.56.107  6003 /data/openGauss/install/data/dn S Standby Normal
[dave@www.cndba.cn ~]$

我们这里主库是192.168.56.106,我们将192.168.56.105激活成主库,在56.105 上用omm 执行:

[dave@www.cndba.cn ~]$ gs_ctl switchover -D /data/openGauss/install/data/dn
[2023-04-07 17:55:53.995][16727][][gs_ctl]: gs_ctl switchover ,datadir is /data/openGauss/install/data/dn 
[2023-04-07 17:55:53.995][16727][][gs_ctl]: switchover term (1)
[2023-04-07 17:55:54.008][16727][][gs_ctl]: waiting for server to switchover........
[2023-04-07 17:55:59.069][16727][][gs_ctl]: done
[2023-04-07 17:55:59.069][16727][][gs_ctl]: switchover completed (/data/openGauss/install/data/dn)

对于同一数据库,上一次主备切换未完成,不能执行下一次切换。当业务正在操作时,发起switchover,可能主机的线程无法停止导致switchover显示超时,实际后台仍然在运行,等主机线程停止后,switchover即可完成。比如在主机删除一个大的分区表时,可能无法响应switchover发起的信号。

switchover或failover成功后,执行如下命令记录当前主备机器信息:

[dave@www.cndba.cn ~]$ gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.

[dave@www.cndba.cn ~]$ gs_om -t status --detail
[  CMServer State   ]

node       node_ip         instance                                     state
-------------------------------------------------------------------------------
1  oracle  192.168.56.105  1    /data/openGauss/data/cmserver/cm_server Primary
2  oracle2 192.168.56.106  2    /data/openGauss/data/cmserver/cm_server Standby
3  oracle3 192.168.56.107  3    /data/openGauss/data/cmserver/cm_server Standby

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL

[  Datanode State   ]

node       node_ip         instance                             state            
---------------------------------------------------------------------------------
1  oracle  192.168.56.105  6001 /data/openGauss/install/data/dn P Primary Normal
2  oracle2 192.168.56.106  6002 /data/openGauss/install/data/dn S Standby Normal
3  oracle3 192.168.56.107  6003 /data/openGauss/install/data/dn S Standby Normal
[dave@www.cndba.cn ~]$

注意这里有一个小细节,就是在集群正常的情况下,kill gaussdb 进程或者用gs_ctl 关闭主库,都会自动发生switchover,并作为备库自动拉起来:

[dave@www.cndba.cn ~]$ ps -ef|grep openG
omm       4154     1  2 10:54 ?        00:11:15 /data/openGauss/install/app/bin/om_monitor -L /var/log/omm/omm/cm/om_monitor
omm      13943  4154 25 17:50 ?        00:03:07 /data/openGauss/install/app/bin/cm_agent
omm      13963     1 15 17:50 ?        00:01:50 /data/openGauss/install/app/bin/cm_server
omm      21983     1 15 18:02 ?        00:00:05 /data/openGauss/install/app/bin/gaussdb -D /data/openGauss/install/data/dn -M pending
omm      22761     1  0 18:03 ?        00:00:00 python3 /data/openGauss/install/om/script/local/CheckSshAgent.py
omm      22805  4878  0 18:03 pts/1    00:00:00 grep --color=auto openG
[dave@www.cndba.cn ~]$ kill -9 21983
[dave@www.cndba.cn ~]$ gs_ctl stop -D /data/openGauss/install/data/dn
[2023-04-07 18:06:25.733][24541][][gs_ctl]: gs_ctl stopped ,datadir is /data/openGauss/install/data/dn 
waiting for server to shut down..... done
server stopped
[dave@www.cndba.cn ~]$
[dave@www.cndba.cn ~]$ ps -ef|grep openG
omm       4154     1  2 10:54 ?        00:11:15 /data/openGauss/install/app/bin/om_monitor -L /var/log/omm/omm/cm/om_monitor
omm      13943  4154 25 17:50 ?        00:03:11 /data/openGauss/install/app/bin/cm_agent
omm      13963     1 15 17:50 ?        00:01:52 /data/openGauss/install/app/bin/cm_server
omm      22968     1 57 18:03 ?        00:00:01 /data/openGauss/install/app/bin/gaussdb -D /data/openGauss/install/data/dn -M pending
omm      22991  4878  0 18:03 pts/1    00:00:00 grep --color=auto openG
[dave@www.cndba.cn ~]$ 

[dave@www.cndba.cn2 ~]$ gs_om -t status --detail
[  CMServer State   ]

node       node_ip         instance                                     state
-------------------------------------------------------------------------------
1  oracle  192.168.56.105  1    /data/openGauss/data/cmserver/cm_server Primary
2  oracle2 192.168.56.106  2    /data/openGauss/data/cmserver/cm_server Standby
3  oracle3 192.168.56.107  3    /data/openGauss/data/cmserver/cm_server Standby

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node       node_ip         instance                             state            
---------------------------------------------------------------------------------
1  oracle  192.168.56.105  6001 /data/openGauss/install/data/dn P Standby Normal
2  oracle2 192.168.56.106  6002 /data/openGauss/install/data/dn S Primary Normal
3  oracle3 192.168.56.107  6003 /data/openGauss/install/data/dn S Standby Normal
[dave@www.cndba.cn2 ~]$

4 failover 切换

上节看到的是正常的情况,但如果主机故障时,则需要在备机执行failover命令。

在原主库正常的情况下,执行failover命令,可以正常成功,也会自动恢复高可用。

[dave@www.cndba.cn ~]$ gs_ctl failover -D /data/openGauss/install/data/dn
[2023-04-07 18:37:21.152][9364][][gs_ctl]: gs_ctl failover ,datadir is /data/openGauss/install/data/dn 
[2023-04-07 18:37:21.152][9364][][gs_ctl]: failover term (1)
[2023-04-07 18:37:21.163][9364][][gs_ctl]:  waiting for server to failover...
.[2023-04-07 18:37:22.193][9364][][gs_ctl]:  done
[2023-04-07 18:37:22.193][9364][][gs_ctl]:  failover completed (/data/openGauss/install/data/dn)

[dave@www.cndba.cn ~]$ gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.

[dave@www.cndba.cn ~]$ gs_om -t status --detail
[  CMServer State   ]

node       node_ip         instance                                     state
-------------------------------------------------------------------------------
1  oracle  192.168.56.105  1    /data/openGauss/data/cmserver/cm_server Standby
2  oracle2 192.168.56.106  2    /data/openGauss/data/cmserver/cm_server Primary
3  oracle3 192.168.56.107  3    /data/openGauss/data/cmserver/cm_server Standby

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL

[  Datanode State   ]

node       node_ip         instance                             state            
---------------------------------------------------------------------------------
1  oracle  192.168.56.105  6001 /data/openGauss/install/data/dn P Primary Normal
2  oracle2 192.168.56.106  6002 /data/openGauss/install/data/dn S Standby Normal
3  oracle3 192.168.56.107  6003 /data/openGauss/install/data/dn S Standby Normal
[dave@www.cndba.cn ~]$

在集群正常运行的情况下,切换后对有自动回复主从关系,如果节点是:Standby Need repair(Disconnected),不能自动恢复,那么就需要重构该节点。

在需要重建备库实例的节点执行重构命令:

[dave@www.cndba.cn3 ~]$ gs_ctl build -b auto -D /data/openGauss/install/data/dn

5 双主异常处理

如果在切换过程中,因网络故障、磁盘满等原因造成主备实例连接断开,出现双主现象时,可以参考如下步骤处理:

1.查询数据库当前的实例状态:

gs_om -t status —detail

若查询结果显示两个实例的状态都为Primary,这种状态为异常状态。

2.确定降为备机的节点,在节点上执行如下命令关闭服务。

gs_ctl stop -D /home/omm/cluster/dn1/

3.执行以下命令,以standby模式启动备节点。

gs_ctl start -D /home/omm/cluster/dn1/ -M standby

4.保存数据库主备机器信息。

gs_om -t refreshconf

 

5.查看数据库状态,确认实例状态恢复。