docker-compsoe部署prometheus、Grafana监控、钉钉告警(四)

发布时间 2023-04-26 17:27:29作者: Nine4酷

docker-compsoe部署prometheus、Grafana监控、钉钉告警(四)

四、Prometheus 钉钉告警

Prometheus dingtalk属于alertmanager部分

  1. 建钉钉群、添加AI机器人
  • 建测试群,拉两人新建群,把其他人T出,即可形成单独的测试群;

  • 群设置--机器人

  • 添加自定义机器人


  • 加签

  1. 创建文件目录
[root@128-255-96 prometheus]# pwd
/home/prometheus/docker/prometheus
cd ./alertmanager/dingtalk && vim config.yml
  1. 编写config.yml配置文件
## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
templates:
#  - contrib/templates/legacy/template.tmpl
   - /root/contrib/templates/*.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
    # secret for signature
    secret: SEC0xxxxxxxx
    message:
      title: '{{ template "_ding.link.title" . }}'
      text: '{{ template "_ding.link.content" . }}'
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
    secret: SEC0xxxxxxxx
    message:
      title: '{{ template "ding.link.title" . }}'
      text: '{{ template "ding.link.content" . }}'
  webhook_legacy:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
    secret: SEC0xxxxxxxx
    # Customize template content
    message:
      # Use legacy template
      title: '{{ template "legacy.title" . }}'
      text: '{{ template "legacy.content" . }}'
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
    secret: SEC0xxxxxxxx
    mention:
      all: true #@ALL
  webhook_mention_users:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
    secret: SEC0xxxxxxxx
    mention:
      mobiles: ['186****7521'] #@某人
  1. 编写dingtalk.tmpl文件
mkdir -p contrib/templates/ && cd contrib/templates
vim dingtalk.tmpl
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}

{{ define "___text_alert_list_with_help" }}
{{ template "___text_alert_list" .Alerts.Firing }}
---
**帮助信息:** {{ (index .Alerts.Firing 0).Annotations.description | markdown | html }}
{{ end }}
{{ define "___text_alert_list" }}{{ range . }}
---
**告警主题:** {{ .Labels.alertname | upper }}    
**告警级别:** {{ .Labels.severity | upper }}    
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}    
**事件信息:** {{ .Annotations.summary | markdown | html }}    
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "___text_alertresovle_list" }}{{ range . }}
---
**告警主题:** {{ .Labels.alertname | upper }}    
**告警级别:** {{ .Labels.severity | upper }}    
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}    
**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}    
**事件信息:** {{ .Annotations.summary | markdown | html }}    
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}

{{/* Default */}}
{{ define "_default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "_default.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}

![警报 图标](https://duojia-lemei.oss-cn-beijing.aliyuncs.com/ERROR.jpg)
**========告警触发========**
{{/* template "___text_alert_list" .Alerts.Firing */}}
{{ template "___text_alert_list_with_help" . }}
{{- end }}

{{ if gt (len .Alerts.Resolved) 0 -}}
![恢复图标](https://duojia-lemei.oss-cn-beijing.aliyuncs.com/OK.jpg)
**========告警恢复========**
{{ template "___text_alertresovle_list" .Alerts.Resolved }}


{{- end }}
{{- end }}

{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}

{{/* Following names for compatibility */}}
{{ define "_ding.link.title" }}{{ template "_default.title" . }}{{ end }}
{{ define "_ding.link.content" }}{{ template "_default.content" . }}{{ end }}

  1. 编写docker-compose-dingtalk.yml文件
cd ../../../../
vim docker-compose-dingtalk.yml
version: '3'
services:
  dingtalk-alert:
    image: timonwong/prometheus-webhook-dingtalk
    container_name: dingtalk-alert
    restart: always
    ports:
      - "8060:8060"
    volumes:
      - /home/prometheus/docker/prometheus/alertmanager/dingtalk:/root
    command: --config.file=/root/config.yml
    networks:
    - prometheus

networks:
  prometheus:
    name: prometheus
  1. 启动容器
docker-compose -f docker-compose-dingtalk.yml up -d
  1. 验证部署是否成功

使用Postman发送钉钉消息验证


  1. 添加rules规则
vim prometheus/rules/mssql_rules.yml
  • 编辑mssql_rules.yml文件
groups:
- name: MSSQL告警规则
  rules:
  - alert: '待处理队列超长'
    expr: mssql_current_exec_num > 10
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "{{$labels.exported_instance}}: Too many scripts executing.(Current value is: {{$value}})"
      description: "执行以下语句查看详细信息:
SELECT
 der.[session_id],der.[blocking_session_id],
 sp.lastwaittype,sp.hostname,sp.program_name,sp.loginame,
 der.[start_time] AS '开始时间',
 der.[status] AS '状态',
 dest.[text] AS 'sql语句',
 DB_NAME(der.[database_id]) AS '数据库名',
 der.[wait_type] AS '等待资源类型',
 der.[wait_time] AS '等待时间',
 der.[wait_resource] AS '等待的资源',
 der.[logical_reads] AS '逻辑读次数'
FROM sys.[dm_exec_requests] AS der
INNER JOIN master.dbo.sysprocesses AS sp ON der.session_id=sp.spid
CROSS APPLY  sys.[dm_exec_sql_text](der.[sql_handle]) AS dest
WHERE [session_id]>50 AND session_id<>@@SPID
ORDER BY der.[session_id];"

  - alert: '数据库状态异常'
    expr: mssql_database_state != 0
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "{{$labels.exported_instance}}: {{$labels.database}} is not online.(Current state is {{$value}})."
      description: "0=ONLINE 1=RESTORING 2=RECOVERING 3=RECOVERY_PENDING 4=SUSPECT 5=EMERGENCY 6=OFFLINE 7=COPYING 10=OFFLINE_SECONDARY \r\n执行以下语句查看详细信息:
SELECT [name] AS [database],[state] FROM sys.databases;"

  - alert: '数据库产生死锁'
    expr: mssql_current_deadlocks != 0
    for: 3m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "{{$labels.exported_instance}}: deadlocks occurs.(Current count is {{$value}})"
      description: "执行以下语句查看详细信息:
SELECT 
 request_session_id spid,
 DB_NAME(resource_database_id) [DataBase],
 OBJECT_NAME(resource_associated_entity_id) TableName
FROM sys.dm_tran_locks
WHERE resource_type='OBJECT';
DBCC INPUTBUFFER(spid);"

  - alert: '脚本执行耗时过长'
    expr: mssql_long_elapsed_count != 0
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "{{$labels.exported_instance}}: Sql scripts execute for long time.(Current count is {{$value}})"
      description: "执行以下语句查看详细信息:
SELECT
  (total_elapsed_time / execution_count)/1000 N'平均时间ms'
  ,total_elapsed_time/1000 N'总花费时间ms'
  ,total_worker_time/1000 N'所用的CPU总时间ms'
  ,total_physical_reads N'物理读取总次数'
  ,total_logical_reads/execution_count N'每次逻辑读次数'
  ,total_logical_reads N'逻辑读取总次数'
  ,total_logical_writes N'逻辑写入总次数'
  ,execution_count N'执行次数'
  ,SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
  ((CASE statement_end_offset
  WHEN -1 THEN DATALENGTH(st.text)
  ELSE qs.statement_end_offset END
  - qs.statement_start_offset)/2) + 1) N'执行语句'
  ,db_name(st.dbid) N'数据库名'
  ,creation_time N'语句编译时间'
  ,last_execution_time N'上次执行时间'
FROM sys.dm_exec_query_stats AS qs 
  CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
WHERE creation_time > DATEADD(S, -15, GETDATE()) --BETWEEN '2023-04-20 00:00:00' AND '2023-04-22 00:00:00'
AND (total_elapsed_time / execution_count)/(1000) > 1000
ORDER BY total_elapsed_time / execution_count DESC;"


  - alert: mssql引擎服务宕机
    expr:  windows_service_state{state="running",exported_name="mssqlserver"}!=1
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
   
  - alert: mssql代理服务宕机
    expr:  windows_service_state{exported_name="sqlserveragent",state="running"}!=1                                                     
    for: 1m         
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"

  - alert: mssql引擎服务重启
    expr: mssql_db_uptime < 3600
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "mssql引擎服务1小时内有过重启,现已重启{{ $value }} 秒"

  - alert: mssql数据库不可用/不可访问
    expr: mssql_current_state_dbState !=0
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "db:{{ $labels.db }}\n value:{{ $labels.value }}={{ $value }} "

  - alert: mssql阻塞
    expr: sum(mssql_current_state_blocking)>5
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "mssql请求阻塞数>5,当前:{{ $value }} "

  - alert: mssql请求过多
    expr: sum(mssql_current_state_requests)>100
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "mssql请求数>100,当前:{{ $value }} "

  - alert: mssql死锁产生
    expr: increase(mssql_counter{type_object="SQLServer:Locks",type_counter="Number of Deadlocks/sec",type_instance="_Total"}[5m])>0
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "mssql 5分钟内死锁产生次数:{{ $value }} "

  - alert: mssql作业执行错误
    expr: increase(mssql_job_state_today[5m])>0
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "mssql 今天作业运行错误次数:{{ $value }} "

  - alert: mssql镜像状态变化
    expr: increase(mssql_mirror_sync{value="status"} [5m])!=0
    for: 1m
    labels:
      severity: warning
      notify_type: dingtalk
    annotations:
      summary: "详细: {{ $labels }}"
      description: "db:{{ $labels.db }}\n value:{{ $labels.value }}={{ $value }} "
  • 重启prometheus
docker-compose -f docker-compose-prometheus.yml restart
  • 查看钉钉消息

[FIRING:2] 数据库状态异常

========告警触发========

告警主题: 数据库状态异常
告警级别: WARNING
触发时间: 2023.04.24 17:56:03
事件信息: 128.0.23.17:1433: IISLogDB is not online.(Current state is 6).
事件标签:


alertname: 数据库状态异常
database: IISLogDB
exported_instance: 128.0.23.17:1433
exporter_type: prom-mssql-exporter
host: 128.0.23.17:1433
instance: 128.0.255.96:14001
job: prometheus-mssql-exporter
monitor: line-monitor
notify_type: dingtalk



告警主题: 数据库状态异常
告警级别: WARNING
触发时间: 2023.04.24 17:56:03
事件信息: sqlserver_xulq: IISLogDB is not online.(Current state is 6).
事件标签:


alertname: 数据库状态异常
database: IISLogDB
exported_instance: sqlserver_xulq
exported_job: sql-exporter
exporter_type: sql-exporter
instance: sql-exporter:9399
job: sql-exporter
monitor: line-monitor
notify_type: dingtalk



帮助信息: 0=ONLINE 1=RESTORING 2=RECOVERING 3=RECOVERY_PENDING 4=SUSPECT 5=EMERGENCY 6=OFFLINE 7=COPYING 10=OFFLINE_SECONDARY
执行以下语句查看详细信息: SELECT [name] AS [database],[state] FROM sys.databases;