Prometheus + Altermanager实现告警邮件通知

发布时间 2023-05-24 17:34:09作者: 技术颜良

概述
实现Prometheus的告警,需Altermanager这个组件。Alertmanager与Prometheus是相互分离的两个组件。所以,Alertmanager需单独安装配置。通过在Prometheus中定义AlertRule(告警规则),Prometheus会周期性的对告警规则进行计算,如果满足告警触发条件就会向Alertmanager发送告警信息。

在Prometheus中一条告警规则主要由告警名称和告警规则两部分组成:
告警名称:用户为告警规则命名
告警规则:告警规则由PromQL进行定义,其实际意义是当表达式(PromQL)查询结果持续多长时间(During)后出发告警
Prometheus服务器根据报警规则将警报发送给Alertmanager,然后Alertmanager将静默(silencing)、抑制(inhibition)、分组聚合(aggregation)等消息通过Email、钉钉等发送通知。
Alertmanager特性
分组聚合:分组将同一类型的报警归类单个报警通知 。适用于当系统宕机导致大量报警被同时触发,此时分组机制可将这些被触发的告警合并为一个告警通知,避免一次性发送大量告警通知。
静默:提供了一个简单的机制可以快速根据标签对告警进行静默处理。特定时间不会发送告警通知。
抑制:指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。如网络不可达,导致其他服务连接相关警报。
Alertmanager安装配置
下载
下载地址:
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64

安装启动Alertmanager
解压到某个目录,这里与Prometheus放在一起,放置 /data/prometheus 目录下,设置软连。
cd /data/prometheus
tar xvf alertmanager-0.19.0.linux-amd64.tar.gz
ln -s /data/prometheus/alertmanager-0.19.0.linux-amd64 alertmanager

启动
/bin/nohup /data/prometheus/alertmanager/alertmanager --config.file=/data/prometheus/alertmanager/alertmanager.yml &
看启动

netstat -tnlp | grep 9093

Alertmanager配置
Alertmanager配置文件/data/prometheus/alertmanager/alertmanager.yml

**global **配置解决报警时间间隔和邮件发送服务。
global:
resolve_timeout: 5m #解决报警时间间隔,默认为5分钟。

# Email notifications
smtp_smarthost: 'smtp.xxx.com:25'
smtp_from: 'alter@xxx.com'
smtp_auth_username: 'alter@xxx.com'
smtp_auth_password: 'xxx123'
smtp_require_tls: false

route 路由树,每个报警都会在配置的顶级路由中进入路由树,路由树匹配所有报警规则。

route:
group_by: ['monitor_base','MySQLStatusAlert'] #与prometheus配置文件alert-rules.yml中配置规则名name对应
group_wait: 10s #group报警等待时间。当一个新的报警group分组被创建后,需等待至少 group_wait 时间来初始化通知
group_interval: 10s #group组报警间隔时间
repeat_interval: 1h #如果一个报警信息已经发送成功,重复报警通知的等待时间
receiver: email-receiver #告警处理方式,这里配置email方式,与下方的receiver name名称需一致

receivers 报警接收器,这里配置接受发送的email地址。

receivers:
- name: 'email-receiver'
email_configs:
- to: 'yuhh@xxx.com'
send_resolved: true #当前收件人需要接受告警恢复的通知的话,send_resolved为true。

inhibit_rules 报警抑制规则

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

重启 Alertmanager 服务。

Prometheus告警规则设置
管理告警和通知由Alertmanager处理,接下来需配置告警规则rules,这块在Prometheus服务中设置告警规则。
告警文档地址: 告警规则官方文档。
修改Prometheus服务配置文件prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/usr/local/prometheus/alert-rules-base.yml"
- "/usr/local/prometheus/alert-rules-mysql.yml"

告警规则配置文件 /usr/local/prometheus/alert-rules.yml

groups:
- name: monitor_base
rules:
- alert: CpuUsageAlert_waring
expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.60
for: 2m
labels:
level: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU usage high"
description: "{{ $labels.instance }} CPU usage above 60% (current value: {{ $value }})"
- alert: CpuUsageAlert_serious
#expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.85
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{job=~".*",mode="idle"}[5m])) * 100)) > 85
for: 3m
labels:
level: serious
annotations:
summary: "Instance {{ $labels.instance }} CPU usage high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: MemUsageAlert_waring
expr: avg by(instance) ((1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100) > 70
for: 2m
labels:
level: warning
annotations:
summary: "Instance {{ $labels.instance }} MEM usage high"
description: "{{$labels.instance}}: MEM usage is above 70% (current value is: {{ $value }}"
- alert: MemUsageAlert_serious
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.90
for: 3m
labels:
level: serious
annotations:
summary: "Instance {{ $labels.instance }} MEM usage high"
description: "{{ $labels.instance }} MEM usage above 90% (current value: {{ $value }})"
- alert: DiskUsageAlert_warning
expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "Instance {{ $labels.instance }} Disk usage high"
description: "{{$labels.instance}}: Disk usage is above 80% (current value is: {{ $value }}"
- alert: DiskUsageAlert_serious
expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 90
for: 3m
labels:
level: serious
annotations:
summary: "Instance {{ $labels.instance }} Disk usage high"
description: "{{$labels.instance}}: Disk usage is above 90% (current value is: {{ $value }}"
- alert: NodeFileDescriptorUsage
expr: avg by (instance) (node_filefd_allocated{} / node_filefd_maximum{}) * 100 > 60
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: File Descriptor usage is above 60% (current value is: {{ $value }}"
- alert: NodeLoad15
expr: avg by (instance) (node_load15{}) > 80
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: Load15 is above 80 (current value is: {{ $value }}"
- alert: NodeAgentStatus
expr: avg by (instance) (up{}) == 0
for: 2m
labels:
level: warning
annotations:
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: Node_Exporter Agent is down (current value is: {{ $value }}"
- alert: NodeProcsBlocked
expr: avg by (instance) (node_procs_blocked{}) > 10
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: Node Blocked Procs detected!(current value is: {{ $value }}"
- alert: NetworkTransmitRate
#expr: avg by (instance) (floor(irate(node_network_transmit_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50
expr: avg by (instance) (floor(irate(node_network_transmit_bytes_total{}[2m]) / 1024 / 1024 )) > 20
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: Node Transmit Rate is above 20MB/s (current value is: {{ $value }}"
- alert: NetworkReceiveRate
#expr: avg by (instance) (floor(irate(node_network_receive_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50
expr: avg by (instance) (floor(irate(node_network_receive_bytes_total{}[2m]) / 1024 / 1024)) > 30
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: Node Receive Rate is above 30MB/s (current value is: {{ $value }}"
- alert: DiskReadRate
expr: avg by (instance) (floor(irate(node_disk_read_bytes_total{}[2m]) / 1024)) > 100
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: Node Disk Read Rate is above 100KB/s (current value is: {{ $value }}"
- alert: DiskWriteRate
expr: avg by (instance) (floor(irate(node_disk_written_bytes_total{}[2m]) / 1024 / 1024)) > 20
for: 2m
labels:
level: warning
annotations:
description: "{{$labels.instance}}: Node Disk Write Rate is above 20MB/s (current value is: {{ $value }}"
告警规则中 expr 条件表达式先在Prometheus的 PromQL 中测试能查询结果,确保无语句语法错误。如查询网络下载带宽情况,如下:

avg by (instance) (floor(irate(node_network_receive_bytes_total{}[2m]) / 1024 / 1024)) > 1

规则文件中,将一组相关规则设置定义在一个groups中,每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成:
alert:告警规则名称。
expr:触发告警条件表达式,基于PromQL表达式,用于计算是否有时间序列满足该条件。
for:评估等待时间,可选参数,如 5m 表示5分钟。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。如添加 status:warning(标记告警warning等级),service_name:name(具体服务名称)。
annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations的内容在告警产生时会一同作为参数发送到Alertmanager。summary 描述告警的摘要信息;description 描述告警的详细信息。
检测 alertmanager.yml 文件配置是否正确,进入 alertmanager 程序目录运行命令:

./amtool check-config alertmanager.yml

重启Prometheus服务。访问http://loalhost:9090/alerts ,可查看已配置的告警规则的告警状态信息,如绿色为未告警状态,黄色为告警Pending状态,此时不会发送告警通知,红色为告警Firing状态,此时已发送告警通知。如下图所示:

访问alertmanager服务, http://10.20.1.63:9093/#/status 查看Alertmanager服务状态及基本config配置信息,如报警邮件配置、接收人信息等。


systemctl status prometheus.service

测试
关闭某个node服务,关闭node_exporter
systemctl stop node_exporter.service

或者手动拉高CPU使用率,输入命令:

cat /dev/zero > /dev/null

过会儿查收邮件通知,如下图:

-----------------------------------
©著作权归作者所有:来自51CTO博客作者Blue后花园的原创作品,请联系作者获取转载授权,否则将追究法律责任
Prometheus + Altermanager实现告警邮件通知
https://blog.51cto.com/u_10874766/2523351