Prometheus环境部署-526互联

一、Prometheus简介

Prometheus是最初在SoundCloud上构建的开源系统监视和警报工具包。自2012年成立以来，许多公司和组织都采用了Prometheus，该项目拥有非常活跃的开发人员和用户社区。Prometheus 于2016年加入了 Cloud Native Computing Foundation，这是继Kubernetes之后的第二个托管项目

Exporter是一个采集监控数据并通过Prometheus监控规范对外提供数据的组件，能为Prometheus提供监控的接口。

Exporter将监控数据采集的端点通过HTTP服务的形式暴露给Prometheus Server，Prometheus Server通过访问该

Exporter提供的Endpoint端点，即可获取到需要采集的监控数据。不同的Exporter负责不同的业务。

二、环境准备

系统	ip	角色	主机名
centos7.5	192.168.109.116	prometheus、grafana	pro01
centos7.5	192.168.109.116	node1	pro02

文档：https://prometheus.io/docs/introduction/overview/

下载prometheus各组件：

https://prometheus.io/download/

三、部署Prometheus

1、下载最新版本Prometheus解压安装

[root@pro01 ~]# ll
总用量 64732
-rw-------. 1 root root     1245 1月  11 22:53 anaconda-ks.cfg
-rw-r--r--. 1 root root 66280932 2月  19 13:54 prometheus-2.25.0.linux-amd64.tar.gz
[root@pro01 ~]# tar xzvf prometheus-2.25.0.linux-amd64.tar.gz 
prometheus-2.25.0.linux-amd64/
prometheus-2.25.0.linux-amd64/consoles/
prometheus-2.25.0.linux-amd64/consoles/index.html.example
prometheus-2.25.0.linux-amd64/consoles/node-cpu.html
prometheus-2.25.0.linux-amd64/consoles/node-disk.html
prometheus-2.25.0.linux-amd64/consoles/node-overview.html
prometheus-2.25.0.linux-amd64/consoles/node.html
prometheus-2.25.0.linux-amd64/consoles/prometheus-overview.html
prometheus-2.25.0.linux-amd64/consoles/prometheus.html
prometheus-2.25.0.linux-amd64/console_libraries/
prometheus-2.25.0.linux-amd64/console_libraries/menu.lib
prometheus-2.25.0.linux-amd64/console_libraries/prom.lib
prometheus-2.25.0.linux-amd64/prometheus.yml
prometheus-2.25.0.linux-amd64/LICENSE
prometheus-2.25.0.linux-amd64/NOTICE
prometheus-2.25.0.linux-amd64/prometheus
prometheus-2.25.0.linux-amd64/promtool
[root@pro01 ~]# mv prometheus-2.25.0.linux-amd64 /usr/local/prometheus
[root@pro01 ~]# ll /usr/local/prometheus/
总用量 167984
drwxr-xr-x. 2 3434 3434       38 2月  18 00:11 console_libraries
drwxr-xr-x. 2 3434 3434      173 2月  18 00:11 consoles
drwxr-xr-x. 6 root root      138 2月  20 09:27 data
-rw-r--r--. 1 3434 3434    11357 2月  18 00:11 LICENSE
-rw-r--r--. 1 3434 3434     3420 2月  18 00:11 NOTICE
-rwxr-xr-x. 1 3434 3434 91044140 2月  17 22:19 prometheus
-rw-r--r--. 1 3434 3434     1015 2月  19 14:56 prometheus.yml
-rwxr-xr-x. 1 3434 3434 80948693 2月  17 22:21 promtool

2、启动Prometheus

[root@pro01 prometheus]# ./prometheus 
level=info ts=2021-02-20T01:27:13.415Z caller=main.go:366 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-02-20T01:27:13.415Z caller=main.go:404 msg="Starting Prometheus" version="(version=2.25.0, branch=HEAD, revision=a6be548dbc17780d562a39c0e4bd0bd4c00ad6e2)"
level=info ts=2021-02-20T01:27:13.415Z caller=main.go:409 build_context="(go=go1.15.8, user=root@615f028225c9, date=20210217-14:17:24)"
level=info ts=2021-02-20T01:27:13.415Z caller=main.go:410 host_details="(Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 pro01 (none))"
level=info ts=2021-02-20T01:27:13.415Z caller=main.go:411 fd_limits="(soft=1024, hard=4096)"
level=info ts=2021-02-20T01:27:13.415Z caller=main.go:412 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-02-20T01:27:13.419Z caller=web.go:532 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-02-20T01:27:13.420Z caller=main.go:779 msg="Starting TSDB ..."
level=info ts=2021-02-20T01:27:13.420Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1613715656312 maxt=1613721600000 ulid=01EYWSF7XTDZVVT2BVARJZBTFA
level=info ts=2021-02-20T01:27:13.420Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1613721611311 maxt=1613728800000 ulid=01EYYGGVN4PHHTCFHNKME9FVB4
level=info ts=2021-02-20T01:27:13.424Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-02-20T01:27:13.425Z caller=head.go:668 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-02-20T01:27:13.425Z caller=head.go:682 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=215.709µs
level=info ts=2021-02-20T01:27:13.425Z caller=head.go:688 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-02-20T01:27:13.433Z caller=head.go:714 component=tsdb msg="WAL checkpoint loaded"
level=info ts=2021-02-20T01:27:13.469Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=2 maxSegment=5
level=info ts=2021-02-20T01:27:13.477Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=3 maxSegment=5
level=info ts=2021-02-20T01:27:13.479Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=4 maxSegment=5
level=info ts=2021-02-20T01:27:13.479Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=5 maxSegment=5
level=info ts=2021-02-20T01:27:13.479Z caller=head.go:745 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=7.556609ms wal_replay_duration=46.502062ms total_replay_duration=54.292801ms
level=info ts=2021-02-20T01:27:13.481Z caller=main.go:799 fs_type=XFS_SUPER_MAGIC
level=info ts=2021-02-20T01:27:13.481Z caller=main.go:802 msg="TSDB started"
level=info ts=2021-02-20T01:27:13.481Z caller=main.go:928 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2021-02-20T01:27:13.512Z caller=main.go:959 msg="Completed loading of configuration file" filename=prometheus.yml totalDuration=30.867346ms remote_storage=1.197µs web_handler=357ns query_engine=3.069µs scrape=30.310217ms scrape_sd=136.14µs notify=62.312µs notify_sd=15.155µs rules=4.274µs
level=info ts=2021-02-20T01:27:13.512Z caller=main.go:751 msg="Server is ready to receive web requests."
level=info ts=2021-02-20T01:27:26.336Z caller=compact.go:498 component=tsdb msg="write block resulted in empty block" mint=1613728800000 maxt=1613736000000 duration=19.585124ms
level=info ts=2021-02-20T01:27:26.337Z caller=head.go:847 component=tsdb msg="Head GC completed" duration=1.132169ms
level=info ts=2021-02-20T01:27:26.338Z caller=checkpoint.go:95 component=tsdb msg="Creating checkpoint" from_segment=2 to_segment=3 mint=1613736000000
level=info ts=2021-02-20T01:27:26.364Z caller=head.go:944 component=tsdb msg="WAL checkpoint complete" first=2 last=3 duration=26.464449ms

3、查看9090端口是否存在

4、在浏览器输入http://192.168.109.116:9090出现如下界面代表安装成功

四、部署node节点

1、下载组件node_exporter到node服务器

[root@pro02 ~]# ll
总用量 9032
-rw-------. 1 root root    1245 1月  11 22:53 anaconda-ks.cfg
-rw-r--r--. 1 root root 9242615 2月  19 15:09 node_exporter-1.1.1.linux-amd64.tar.gz
[root@pro02 ~]# tar xzvf node_exporter-1.1.1.linux-amd64.tar.gz 
node_exporter-1.1.1.linux-amd64/
node_exporter-1.1.1.linux-amd64/LICENSE
node_exporter-1.1.1.linux-amd64/NOTICE
node_exporter-1.1.1.linux-amd64/node_exporter
[root@pro02 ~]# mv node_exporter-1.1.1.linux-amd64 /usr/local/node_exporter
[root@pro02 ~]# ll /usr/local/node_exporter/
总用量 18996
-rw-r--r--. 1 3434 3434    11357 2月  13 21:42 LICENSE
-rwxr-xr-x. 1 3434 3434 19170241 2月  13 21:31 node_exporter
-rw-------. 1 root root   134269 2月  20 09:53 nohup.out
-rw-r--r--. 1 3434 3434      463 2月  13 21:42 NOTICE

2、启动node_exporter

[root@pro02 node_exporter]# nohup ./node_exporter &
[1] 16916

3、查看端口9100

4、在Prometheus中加入node节点

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'nodes' 
    static_configs:
     - targets:
       - 192.168.109.117:9100 #多个节点以列表形式排列
       - 192.168.109.118:9100

5，重启Prometheus 并在浏览器中查看结果

6，浏览器输入http://192.168.109.117:9100/metrics 查看node exporter展示了prometheus可以拉取的指标，包括在输出中更下方的各种系统指标（带有前缀node_）。要查看这些指标（以及帮助和类型信息）

五、通过PromQL聚合操作验证

1、node节点机器cpu在5分钟内的平均使用率

(1-avg (irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))*100

2、绘图

六：基于文件的服务发现

基于文件的服务发现是仅仅优于静态配置的服务发现方式，它不依赖于任何平台或第三方服务，因而也是最简单和通用的实现方式

prometheus Server定期从文件中加载target信息，文件可以使用json和yaml格式，包含定义的target列表以及可选的标签信息

案例：

1、创建目录 mkdir -p /usr/local/prometheus/file-sd/targets 详情如下

[root@pro01 file-sd]# ll
总用量 4
-rw-r--r--. 1 root root 1079 2月  20 16:26 prometheus.yml
drwxr-xr-x. 2 root root   61 2月  20 16:26 targets
[root@pro01 file-sd]# pwd
/usr/local/prometheus/file-sd
[root@pro01 file-sd]# ll targets/
总用量 8
-rw-r--r--. 1 root root 83 2月  20 16:26 nodes-linux.yaml
-rw-r--r--. 1 root root 87 2月  20 16:25 prometheus-servers.yaml

2、编辑prometheus服务启动文件 prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    file_sd_configs:
    - files:
      - targets/prometheus-*.yaml
      refresh_interval: 2m
  - job_name: 'nodes' #新增job
    file_sd_configs: #代表使用文件发现功能
    - files:
      - targets/nodes-*.yaml  #加载targets目录下面的所有以nodes开头的yaml文件
      refresh_interval: 2m  #每隔2分钟轮询一次

3、编辑prometheus-servers.yaml文件

- targets:
  - 192.168.109.116:9090
  labels:
    app: parmetheus
    job: parmetheus

4、编辑node节点文件nodes-linux.yaml

- targets:
  - 192.168.109.117:9100
  labels:
    app: node-exporter
    job: node

5、通过指定配置文件方式重启prometheus服务

[root@pro01 prometheus]# ./prometheus --config.file=./file-sd/prometheus.yml
level=info ts=2021-02-20T08:27:34.151Z caller=main.go:366 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-02-20T08:27:34.151Z caller=main.go:404 msg="Starting Prometheus" version="(version=2.25.0, branch=HEAD, revision=a6be548dbc17780d562a39c0e4bd0bd4c00ad6e2)"
level=info ts=2021-02-20T08:27:34.151Z caller=main.go:409 build_context="(go=go1.15.8, user=root@615f028225c9, date=20210217-14:17:24)"
level=info ts=2021-02-20T08:27:34.151Z caller=main.go:410 host_details="(Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 pro01 (none))"/6，

6、验证浏览器输入http://192.168.109.116:9090/targets

七：基于consul的服务发现

consul简介：

一款基于golang开发的开源工具，面向分布式，服务化的系统提供服务注册，服务发现和配置管理的功能

提供服务注册/发现、健康检查、key/value存储，多数据中心和分布式一致性保证等功能

1，下载地址

https://www.consul.io/downloads

2，安装

下载的consul_1.9.3_linux_amd64.zip包直接解压到/usr/local/bin 目录

3，服务启动

[root@pro01 ~]# mkdir -p /consul/data/
[root@pro01 ~]# mkdir /etc/consul/
[root@pro01 ~]# ./consul agent -dev -ui -data-dir=/consul/data/ -config-dir=/etc/consul/ -client=0.0.0.0
-bash: ./consul: 没有那个文件或目录
[root@pro01 ~]# consul agent -dev -ui -data-dir=/consul/data/ -config-dir=/etc/consul/ -client=0.0.0.0
==> Starting Consul agent...
           Version: '1.9.3'
           Node ID: '81c2539d-a878-226d-8ee9-d6dffd810238'
         Node name: 'pro01'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
      Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

4,在浏览器输入url：http://192.168.109.116:8500/ui 出现如下界面表示安装成功

5，配置通过自发现，发现prometheus node节点

创建prometheus启动加载文件prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    consul_sd_configs:
    - server: "192.168.109.116:8500"
      tags:
      - "prometheus"
      refresh_interval: 2m
  - job_name: 'nodes'
    consul_sd_configs: #由file变为consul
    - server: "192.168.109.116:8500"
      tags: #consul通过该tags去自发现
      - "nodes"
      refresh_interval: 2m

创建nodes.json

多个node节点配置如下

{
  "services": [
   {
     "id": "node-exporter-node02",
     "name": "node02",
     "address": "192.168.109.117",
     "port": 9100,
     "tags": ["nodes", "linux"],
      "check": [{
     "http":"http://192.168.109.117:9100/metrics",
         "interval": "5s"  
    }]     
  },
  {
     "id": "node-exporter-node01",
     "name": "node01",
     "address": "192.168.109.116",
     "port": 9100,
     "tags": ["nodes", "linux"],
      "check": [{
         "http":"http://192.168.109.116:9100/metrics",
         "interval": "5s"  
    }]
  }      
 ]
}

单个node节点配置如下

{
  "service": {
     "id": "node-exporter-node01",
     "name": "node01",
     "address": "192.168.109.117",
     "port": 9100,
     "tags": ["nodes", "linux"], #该列表需要包含prometheus.yml里面对应的标签
      "check": [{
     "http":"http://192.168.109.117:9100/metrics",
         "interval": "5s"  
    }]     
  }
}

把nodes.json拷贝到/etc/consul目录下面并重新加载consul服务

[root@pro01 consul]# consul reload
Configuration reload triggered

通过web查看

6，启动prometheus

[root@pro01 prometheus]# ./prometheus --config.file=./consul_sd/prometheus.yml
level=info ts=2021-02-23T03:48:36.254Z caller=main.go:366 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-02-23T03:48:36.254Z caller=main.go:404 msg="Starting Prometheus" version="(version=2.25.0, branch=HEAD, revision=a6be548dbc17780d562a39c0e4bd0bd4c00ad6e2)"
level=info ts=2021-02-23T03:48:36.254Z caller=main.go:409 build_context="(go=go1.15.8, user=root@615f028225c9, date=20210217-14:17:24)"
level=info ts=2021-02-23T03:48:36.254Z caller=main.go:410 host_details="(Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 pro01 (none))"
level=info ts=2021-02-23T03:48:36.254Z caller=main.go:411 fd_limits="(soft=1024, hard=4096)"
level=info ts=2021-02-23T03:48:36.254Z caller=main.go:412 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-02-23T03:48:36.262Z caller=web.go:532 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-02-23T03:48:36.264Z caller=main.go:779 msg="Starting TSDB ..."

浏览器输入http://192.168.109.116:9090 查看

八：部署grafana

1，下载包

通过地址https://mirrors.bfsu.edu.cn/grafana/yum/rpm/下载最新的grafana的rpm包

2，部署

执行安装命令

[root@pro01 ~]# yum install -y grafana-7.4.2-1.x86_64.rpm

查看安装的grafana包

[root@pro01 ~]# rpm -ql grafana

默认配置grafana.ini保持默认不需要更改

[root@pro01 grafana]# pwd
/etc/grafana
[root@pro01 grafana]# ll
总用量 40
-rw-r-----. 1 root grafana 33410 2月  23 13:50 grafana.ini
-rw-r-----. 1 root grafana  2270 2月  23 13:50 ldap.toml
drwxr-xr-x. 6 root grafana    75 2月  23 13:50 provisioning

启动grafana

[root@pro01 ~]# systemctl start grafana-server

查看端口默认3000

3，登陆

浏览器访问：http://192.168.109.116:3000

默认账密admin/admin 初次登陆需要修改密码

4，添加数据源

5，导入grafana的dashboards模板

官网地址:https://grafana.com/grafana/dashboards?dataSource=prometheus&search=node

在grafana中粘贴id会自动联网下载模板

应用模板

模板里面的panel会根据node节点采集到的数据进行自动填充

九：邮件告警

1、安装alertmanager

下载地址https://prometheus.io/download/

解压修改配置文件既可使用：

[root@pro01 local]# cd alertmanager/
[root@pro01 alertmanager]# ll
总用量 51648
-rwxr-xr-x. 1 3434 3434 28871879 6月  17 2020 alertmanager
-rw-r--r--. 1 root root      659 2月  26 17:38 alertmanager.yml
-rw-r--r--. 1 3434 3434      380 6月  17 2020 alertmanager.ymlbak
-rwxr-xr-x. 1 3434 3434 23987848 6月  17 2020 amtool
drwxr-xr-x. 2 root root       35 3月   1 21:17 data
-rw-r--r--. 1 3434 3434    11357 6月  17 2020 LICENSE
-rw-r--r--. 1 3434 3434      457 6月  17 2020 NOTICE
[root@pro01 alertmanager]#

2,编写邮件告警外发文件

[root@pro01 alertmanager]# pwd
/usr/local/alertmanager
[root@pro01 alertmanager]# cat alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 1h
  receiver: 'email-me'
receivers:
- name: 'email-me'
  email_configs:
  - to: 'wangxu13968@hundsun.com'
    from: 'hanxn@hundsun.cn'
    smarthost: 'smtp.hundsun.cn:80'
    auth_username: 'hanxn@hundsun.cn'
    auth_identity: 'hanxn@hundsun.cn'
    auth_password: 'GoodBoy123'
    require_tls: false
  #- url: 'http://127.0.0.1:5001/'
inhibit_rules:  #告警抑制机制，避免同类告警多次发送
  - source_match: #原告警内容
      alertname: 'InstanceDown'
      severity: 'critical'
    target_match: #新告警 和 原告警比较需要满足的条件
      alertname: 'InstanceDown'
      severity: 'critical'
    equal: #标签值相等
      - instance

3，启动alertmanager服务

 #默认加载alertmanager.yml
[root@pro01 alertmanager]# ./alertmanager  
level=info ts=2021-03-01T12:47:59.005Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-03-01T12:47:59.005Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-03-01T12:47:59.007Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.109.116 port=9094
level=info ts=2021-03-01T12:47:59.013Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2021-03-01T12:47:59.077Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=alertmanager.yml
level=info ts=2021-03-01T12:47:59.078Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
level=info ts=2021-03-01T12:47:59.084Z caller=main.go:485 msg=Listening address=:9093
level=info ts=2021-03-01T12:48:01.013Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000454316s
level=info ts=2021-03-01T12:48:09.015Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.001984594s

4，编辑Prometheus启动文件

[root@pro01 alter-config]# cat prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - file_sd_configs:
    - files:
      - "targets/alertmanagers*.yaml"  #基于文件发现规则，将告警内容对接可以发送告警信息的alertmanager
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files: 
  # - "first_rules.yml"
  # - "second_rules.yml"
  #- "rules/*.yaml"
  - "alert_rules/*.yaml"   #指向告警规则

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    file_sd_configs:
    - files:
      - targets/prometheus-*.yaml
      refresh_interval: 2m
  - job_name: 'nodes'
    file_sd_configs:
    - files:
      - targets/nodes-*.yaml
      refresh_interval: 2m
  - job_name: 'alertmanager'
    file_sd_configs:
    - files:
      - targets/alertmanagers*.yaml
      refresh_interval: 2m

5,定义alertmanager.yaml文件

[root@localhost targets]# cat alertmanager.yaml 
- targets:
  - 192.168.101.116:9093
  labels:
    app: alertmanager

6,自定义告警规则

[root@pro01 alter-config]# cat alert_rules/instance_down.yaml 
groups:
- name: AllInstances
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 20s  #表示延迟20秒告警
    annotations:
      title: 'down'
      description: this is test
    labels:
      severity: 'critical'

在告警规则文件中，我们可以将一组相关的规则设置定义在一个group下。在每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成：

alert：告警规则的名称。
expr：基于PromQL表达式告警触发条件，用于计算是否有时间序列满足该条件。

内存使用率：(node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 80

磁盘使用率：(node_filesystem_size_bytes{fstype="xfs",mountpoint="/"} - node_filesystem_free_bytes{fstype="xfs",mountpoint="/"}) / node_filesystem_size_bytes{fstype="xfs",mountpoint="/"} * 100 > 80

cpu5分钟负载：avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

cpu 使用率： 100 - ((avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100)
for：评估等待时间，可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
labels：自定义标签，允许用户指定要附加到告警上的一组附加标签。
annotations：用于指定一组附加信息，比如用于描述告警详细信息的文字等，annotations的内容在告警产生时会一同作为参数发送到Alertmanager。

7，启动Prometheus

[root@pro01 prometheus]# ./prometheus --config.file=./alter-config/prometheus.yml 
level=info ts=2021-03-01T12:55:35.388Z caller=main.go:366 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-03-01T12:55:35.388Z caller=main.go:404 msg="Starting Prometheus" version="(version=2.25.0, branch=HEAD, revision=a6be548dbc17780d562a39c0e4bd0bd4c00ad6e2)"
level=info ts=2021-03-01T12:55:35.388Z caller=main.go:409 build_context="(go=go1.15.8, user=root@615f028225c9, date=20210217-14:17:24)"
level=info ts=2021-03-01T12:55:35.388Z caller=main.go:410 host_details="(Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 pro01 (none))"
level=info ts=2021-03-01T12:55:35.388Z caller=main.go:411 fd_limits="(soft=1024, hard=4096)"
level=info ts=2021-03-01T12:55:35.388Z caller=main.go:412 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-03-01T12:55:35.393Z caller=web.go:532 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-03-01T12:55:35.395Z caller=main.go:779 msg="Starting TSDB ..."

7，通过web查看

十：设置钉钉机器人告警

1，首先需要创建一个钉钉项目组添加自定义钉钉机器人获取Webhook和secret

2，安装prometheus-webhook-dingtalk

下载dingtalk

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

解压后修改配置文件config.yml

## Request timeout
# timeout: 5s

## Customizable templates path
# templates:
#   - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
default_message: #告警模板 可网上下载
   title: '{{ template "legacy.title" . }}'
   text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=c74358f6bb9da11cb9f119980f8945de3a0838d0480eb8db085412f2f754eb4a  #通过钉钉机器人上面获取
    # secret for signature
    secret: SEC47b20c9b8a1cce4fe542ceb3352ee3420ca6b0eed67736e75e3ac6c96c0573ef

3，启动dingtalk

[root@pro01 prometheus-webhook-dingtalk]# ./prometheus-webhook-dingtalk 
level=info ts=2021-03-02T03:04:40.903Z caller=main.go:62 msg="Starting prometheus-webhook-dingtalk" version="(version=1.4.0, branch=HEAD, revision=02fe8265a98ab4caaa78ebbed209d3f06b87b4a6)"
level=info ts=2021-03-02T03:04:40.903Z caller=main.go:63 msg="Build context" (gogo1.13.5,userroot@eb9f8d8f0437,date20191211-03:00:38)=(MISSING)
level=info ts=2021-03-02T03:04:40.903Z caller=coordinator.go:89 component=configuration file=config.yml msg="Loading configuration file"
level=info ts=2021-03-02T03:04:40.905Z caller=coordinator.go:97 component=configuration file=config.yml msg="Completed loading of configuration file"
level=info ts=2021-03-02T03:04:40.905Z caller=main.go:117 component=configuration msg="Loading templates" templates=
ts=2021-03-02T03:04:40.907Z caller=main.go:133 component=configuration msg="Webhook urls for prometheus alertmanager" urls="http://localhost:8060/dingtalk/webhook_mention_users/send http://localhost:8060/dingtalk/webhook1/send http://localhost:8060/dingtalk/webhook2/send http://localhost:8060/dingtalk/webhook_legacy/send http://localhost:8060/dingtalk/webhook_mention_all/send"
level=info ts=2021-03-02T03:04:40.908Z caller=web.go:210 component=web msg="Start listening for connections" address=:8060

4，编辑dingtalk的alertmanager配置文件alertmanager.yml

[root@pro01 alertmanager]# pwd
/usr/local/alertmanager
[root@pro01 alertmanager]# cat alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 1h
  receiver: 'dingding.webhook1'
  routes:
  - receiver: 'dingding.webhook1'  #webhook1名字要对应dingtak配置文件的targets值
    match:
      team: devops
    group_wait: 10s
    group_interval: 15s
    repeat_interval: 3h
receivers:
- name: 'dingding.webhook1'
  webhook_configs:
  - url: 'http://192.168.109.116:8060/dingtalk/webhook1/send'  #请求地址
    send_resolved: true

5，启动alertmanager

[root@pro01 alertmanager]# ./alertmanager 
level=info ts=2021-03-02T03:04:58.367Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-03-02T03:04:58.367Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-03-02T03:04:58.369Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.109.116 port=9094
level=info ts=2021-03-02T03:04:58.375Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2021-03-02T03:04:58.428Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=alertmanager.yml
level=info ts=2021-03-02T03:04:58.432Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
level=info ts=2021-03-02T03:04:58.436Z caller=main.go:485 msg=Listening address=:9093

6，验证 kill一个node节点的node_exporter进程，查看dingtalk日志

level=info ts=2021-03-02T03:07:16.459Z caller=entry.go:22 component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=192.168.109.116:55402 user_agent=Alertmanager/0.21.0 uri=http://192.168.109.116:8060/dingtalk/webhook1/send resp_status=200 resp_bytes_length=2 resp_elapsed_ms=285.121773 msg="request complete"

查看钉钉机器人告警输出

告警恢复

Prometheus

环境

prometheus环境

node_exporter prometheus exporter环境

mysqld_exporter prometheus exporter环境

prometheus环境linux

prometheus grafana环境linux

prometheus grafana环境k8s

prometheus

prometheus-operator

prometheus实战

prometheus-alertmanager