Prometheus监控zookeeper集群(1)

发布时间 2023-04-14 16:19:13作者: 記憶や空白

因为zookeeper版本较低为3.4.x版本,所有采用zookeeper_exporter方式采集数据

1.下载(zookeeper_exporter采集器)

https://github.com/carlpett/zookeeper_exporter/releases/download/v1.1.0/zookeeper_exporter

2. 传到liunx上/opt目录下,没有目录可以自行创建

3.授予权限

chmod 755 zookeeper_exporter

4.编写zookeeper_exporter监控脚本(集群每台都跑)

vim /lib/systemd/system/zkexporter.service

    [Unit]
    Description=zookeeper_exporter
    After=network.target
    [Service]
    Type=simple
    User=root
    ExecStart=/opt/zookeeper_exporter -zookeeper 10.249.0.63:2181 -bind-addr :9143
    Restart=on-failure
    [Install]
    WantedBy=multi-user.target

5.分别执行如下启动命令

systemctl start zkexporter.service
systemctl status zkexporter.service

6.查看zookeeper_exporter运行状态(如出现Active: active (running) 已经运行成功)

7.查看采集数据

curl localhost:9143/metrics

8.修改 Prometheus 的配置文件 (prometheus.yml)

9.重启Prometheus ,访问http://localhost:9090

如上所示,当 State 状态显示为 UP 时,则说明 zookeeper_exporter 服务已经集成进来了

10.rule告警文件(仅供参考):

    groups:
    - name: zookeeperStatsAlert
      rules:
      - alert: 堆积请求数过大
        expr: avg(zk_outstanding_requests) by (instance) > 10    
        for: 1m
        labels:      
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} "
          description: "积请求数过大"
      - alert: 阻塞中的 sync 过多
        expr: avg(zk_pending_syncs) by (instance) > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} "
          description: "塞中的 sync 过多"
      - alert: 平均响应延迟过高
        expr: avg(zk_avg_latency) by (instance) > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} "
          description: '平均响应延迟过高'
      - alert: 打开文件描述符数大于系统设定的大小
        expr: zk_open_file_descriptor_count > zk_max_file_descriptor_count * 0.85
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} "
          description: '打开文件描述符数大于系统设定的大小'
      - alert: zookeeper服务器宕机
        expr: up{job="prd_zookeeper"} == 0
        for: 5s
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} "
          description: 'zookeeper服务器宕机'
      - alert: zk主节点丢失
        expr: absent(zk_server_state{state="leader"})  != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} "
          description: 'zk主节点丢失'

11.配置grafana

grafanaid: 11442