prometheus

https://prometheus.io/

From metrics to insight

Power your metrics and alerting with the leading
open-source monitoring solution.

架构

https://juejin.cn/post/7201757033321267258

Prometheus Server: 用于收集和存储时间序列数据

Client Library: 客户端库，检测应用程序代码，当Prometheus抓取实例的HTTP端点时，客户端库会将所有跟踪的metrics指标的当前状态发送到prometheus server端。

Exporters: prometheus支持多种exporter，通过exporter可以采集metrics数据，然后发送到prometheus server端，所有向promtheus server提供监控数据的程序都可以被称为exporter

Alertmanager: 从 Prometheus server 端接收到 alerts 后，会进行去重，分组，并路由到相应的接收方，发出报警，常见的接收方式有：电子邮件，微信，钉钉, slack等。

Grafana：监控仪表盘，可视化监控数据

pushgateway: 各个目标主机可上报数据到pushgateway，然后prometheus server统一从pushgateway拉取数据。

从上图可发现，Prometheus整个生态圈组成主要包括prometheus server，Exporter，pushgateway，alertmanager，grafana，Web ui界面，Prometheus server由三个部分组成，Retrieval，Storage，PromQL

Retrieval负责在活跃的target主机上抓取监控指标数据

Storage存储主要是把采集到的数据存储到磁盘中

PromQL是Prometheus提供的查询语言模块。

Prometheus工作流程

1）Prometheus server可定期从活跃的（up）目标主机上（target）拉取监控指标数据，目标主机的监控数据可通过配置静态job或者服务发现的方式被prometheus server采集到，这种方式默认的pull方式拉取指标；也可通过pushgateway把采集的数据上报到prometheus server中；还可通过一些组件自带的exporter采集相应组件的数据；

2）Prometheus server把采集到的监控指标数据保存到本地磁盘或者数据库；

3）Prometheus采集的监控指标数据按时间序列存储，通过配置报警规则，把触发的报警发送到alertmanager

4）Alertmanager通过配置报警接收方，发送报警到邮件，微信或者钉钉等

5）Prometheus 自带的web ui界面提供PromQL查询语言，可查询监控数据

6）Grafana可接入prometheus数据源，把监控数据以图形化形式展示出来

理解时间序列

https://www.prometheus.wang/promql/what-is-prometheus-metrics-and-labels.html

在1.2节当中，通过Node Exporter暴露的HTTP服务，Prometheus可以采集到当前主机所有监控指标的样本数据。例如：
# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="idle"} 362812.7890625
# HELP node_load1 1m load average.
# TYPE node_load1 gauge
node_load1 3.0703125
其中非#开头的每一行表示当前Node Exporter采集到的一个监控样本：node_cpu和node_load1表明了当前指标的名称、大括号中的标签则反映了当前样本的一些特征和维度、浮点数则是该监控样本的具体值。

样本

Prometheus会将所有采集到的样本数据以时间序列（time-series）的方式保存在内存数据库中，并且定时保存到硬盘上。time-series是按照时间戳和值的序列顺序存放的，我们称之为向量(vector). 每条time-series通过指标名称(metrics name)和一组标签集(labelset)命名。如下所示，可以将time-series理解为一个以时间为Y轴的数字矩阵：
  ^
  │   . . . . . . . . . . . . . . . . .   . .   node_cpu{cpu="cpu0",mode="idle"}
  │     . . . . . . . . . . . . . . . . . . .   node_cpu{cpu="cpu0",mode="system"}
  │     . . . . . . . . . .   . . . . . . . .   node_load1{}
  │     . . . . . . . . . . . . . . . .   . .  
  v
    <------------------ 时间 ---------------->
在time-series中的每一个点称为一个样本（sample），样本由以下三部分组成：

指标(metric)：metric name和描述当前样本特征的labelsets;

时间戳(timestamp)：一个精确到毫秒的时间戳;

样本值(value)：一个float64的浮点型数据表示当前样本的值。
<--------------- metric ---------------------><-timestamp -><-value->
http_request_total{status="200", method="GET"}@1434417560938 => 94355
http_request_total{status="200", method="GET"}@1434417561287 => 94334

http_request_total{status="404", method="GET"}@1434417560938 => 38473
http_request_total{status="404", method="GET"}@1434417561287 => 38544

http_request_total{status="200", method="POST"}@1434417560938 => 4748
http_request_total{status="200", method="POST"}@1434417561287 => 4785

Label and relabel

https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/#available-actions

Prometheus labels

Labels are sets of key-value pairs that allow us to characterize and organize what’s actually being measured in a Prometheus metric.

For example, when measuring HTTP latency, we might use labels to record the HTTP method and status returned, which endpoint was called, and which server was responsible for the request.

Each unique combination of key-value label pairs is stored as a new time series in Prometheus, so labels are crucial for understanding the data’s cardinality and unbounded sets of values should be avoided as labels.

Internal labels

But what about metrics with no labels? Prometheus also provides some internal labels for us. These begin with two underscores and are removed after all relabeling steps are applied; that means they will not be available unless we explicitly configure them to.

Some of these special labels available to us are

Label name Description

__name__ The scraped metric’s name

__address__ host:port of the scrape target

__scheme__ URI scheme of the scrape target

__metrics_path__ Metrics endpoint of the scrape target

__param_<name> is the value of the first URL parameter passed to the target

__scrape_interval__ The target’s scrape interval (experimental)

__scrape_timeout__ The target’s timeout (experimental)

__meta_ Special labels set set by the Service Discovery mechanism

__tmp Special prefix used to temporarily store label values before discarding them

So now that we understand what the input is for the various relabel_config rules, how do we create one? And what can they actually be used for?

Label name	Description
__name__	The scraped metric’s name
__address__	host:port of the scrape target
__scheme__	URI scheme of the scrape target
__metrics_path__	Metrics endpoint of the scrape target
__param_<name>	is the value of the first URL parameter passed to the target
__scrape_interval__	The target’s scrape interval (experimental)
__scrape_timeout__	The target’s timeout (experimental)
__meta_	Special labels set set by the Service Discovery mechanism
__tmp	Special prefix used to temporarily store label values before discarding them

The base <relabel_config> block

A <relabel_config> consists of seven fields. These are:

source_labels

separator (default = ;)

target_label

regex (default = (.*))

modulus

replacement (default = $1)

action (default = replace)

A Prometheus configuration may contain an array of relabeling steps; they are applied to the label set in the order they’re defined in. Omitted fields take on their default value, so these steps will usually be shorter.

source_labels and separator

Let’s start off with source_labels. It expects an array of one or more label names, which are used to select the respective label values. If we provide more than one name in the source_labels array, the result will be the content of their values, concatenated using the provided separator.

As an example, consider the following two metrics
my_custom_counter_total{server="webserver01",subsystem="kata"} 192  1644075044000
my_custom_counter_total{server="sqldatabase",subsystem="kata"} 147  1644075044000
The following relabel_config
source_labels: [subsystem, server]
separator: "@"
would extract these values.
kata@webserver01
kata@sqldatabase

PromQL

https://prometheus.io/docs/prometheus/latest/querying/examples/

Simple time series selection

Return all time series with the metric http_requests_total:
http_requests_total
Return all time series with the metric http_requests_total and the given job and handler labels:
http_requests_total{job="apiserver", handler="/api/comments"}
Return a whole range of time (in this case 5 minutes up to the query time) for the same vector, making it a range vector:
http_requests_total{job="apiserver", handler="/api/comments"}[5m]
Note that an expression resulting in a range vector cannot be graphed directly, but viewed in the tabular ("Console") view of the expression browser.

https://prometheus.io/docs/guides/node-exporter/

Simple Demo

https://github.com/fanqingsong/docker-prometheus

Prometheus Monitoring

This repository contains minimal Prometheus Server, NodeExporter, BlackBoxExporter, AlertManager and Grafana implementation for monitoring various services. You can use this repository to monitor a bare-metal Linux instance or to monitor Apache, NGINX or other HTTP based services using Prometheus.

Monitoring a Bare-Metal Linux Server

To monitor a stand-alone Linux Server, you have to checkout against the tag v1.0 of the repository. Where all the configurations for monitoring a stand-alone Linux Server are available. Just docker-compose up -d and you're good to go. (You have to map alerts manually against tag v1.0)

Monitoring HTTP-based Web Services

The v1.1 tag of the repository monitors 2 HTTP-based Web Services by default: An Apache httpd server and NGINX server both running in Docker Containers. If either or both of them goes down, an Prometheus will fire alerts in the form emails specified in the config.yml file in the AlertManager folder.

https://github.com/prometheus/blackbox_exporter

Checking the results

Visiting http://localhost:9115/probe?target=google.com&module=http_2xx will return metrics for a HTTP probe against google.com. The probe_success metric indicates if the probe succeeded. Adding a debug=true parameter will return debug information for that probe.

https://www.cnblogs.com/cyleon/p/12876897.html

HTTP 测试：定义 Request Header 信息、判断 Http status / Http Respones Header / Http Body 内容
TCP 测试：业务组件端口状态监听、应用层协议定义与监听
ICMP 测试：主机探活机制
POST 测试：接口联通性

https://github.com/prometheus/node_exporter

If you are new to Prometheus and node_exporter there is a simple step-by-step guide.

The node_exporter listens on HTTP port 9100 by default. See the --help output for more options.

定制数据exporter

https://github.com/prometheus/client_python#counter

from prometheus_client import start_http_server, Summary
import random
import time

# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')

# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)

if __name__ == '__main__':
    # Start up the server to expose the metrics.
    start_http_server(8000)
    # Generate some requests.
    while True:
        process_request(random.random())