Grafana学习(8)——Introduction to Alerting

发布时间 2023-11-22 15:44:34作者: 钱塘江畔
Whether you’re just starting out or you’re a more experienced user of Grafana Alerting, learn more about the fundamentals and available features that help you create, manage, and respond to alerts; and improve your team’s ability to resolve issues quickly.

无论您是刚起步,还是Grafana Alerting的经验更丰富的用户,都可以了解更多关于帮助您创建、管理和响应警报的基本原理和可用功能的信息;并提高团队快速解决问题的能力。

Principles
In Prometheus-based alerting systems, you have an alert generator that creates alerts and an alert receiver that receives alerts. For example, Prometheus is an alert generator and is responsible for evaluating alert rules, while Alertmanager is an alert receiver and is responsible for grouping, inhibiting, silencing, and sending notifications about firing and resolved alerts.

Grafana Alerting is built on the Prometheus model of designing alerting systems. It has an internal alert generator responsible for scheduling and evaluating alert rules, as well as an internal alert receiver responsible for grouping, inhibiting, silencing, and sending notifications. Grafana doesn’t use Prometheus as its alert generator because Grafana Alerting needs to work with many other data sources in addition to Prometheus. However, it does use Alertmanager as its alert receiver.

Alerts are sent to the alert receiver where they are routed, grouped, inhibited, silenced and notified. In Grafana Alerting, the default alert receiver is the Alertmanager embedded inside Grafana, and is referred to as the Grafana Alertmanager. However, you can use other Alertmanagers too, and these are referred to as External Alertmanagers.

The following diagram gives you an overview of Grafana Alerting and introduces you to some of the fundamental features that are the principles of how Grafana Alerting works.

原则
在基于Prometheus的警报系统中,您有一个创建警报的警报生成器和一个接收警报的警报接收器。例如,Prometheus是警报生成器,负责评估警报规则,而Alertmanager是警报接收器,负责分组、禁止、静默和发送有关触发和已解决警报的通知。
Grafana Alerting建立在Prometheus警报系统设计模型的基础上。它有一个负责调度和评估警报规则的内部警报生成器,以及一个负责分组、禁止、静音和发送通知的内部警报接收器。Grafana不使用Prometheus作为警报生成器,因为Grafana Alerting需要与Prometheus之外的许多其他数据源一起工作。但是,它确实使用Alertmanager作为其警报接收器。
警报被发送到警报接收器,在那里它们被路由、分组、禁止、静默和通知。在Grafana Alerting中,默认警报接收器是嵌入Grafana中的Alertmanager,并被称为Grafana警报器管理器。但是,您也可以使用其他AlertManager,这些被称为外部AlertManager。
下图概括介绍了Grafana Alerting,并向您介绍了Grafana Alerting工作原理的一些基本功能。

Fundamentals
Alert rules
An alert rule is a set of criteria that determine when an alert should fire. It consists of one or more queries and expressions, a condition which needs to be met, an interval which determines how often the alert rule is evaluated, and a duration over which the condition must be met for an alert to fire.

Alert rules are evaluated over their interval, and each alert rule can have zero, one, or any number of alerts firing at a time. The state of the alert rule is determined by its most “severe” alert, which can be one of Normal, Pending, or Firing. For example, if at least one of an alert rule’s alerts are firing then the alert rule is also firing. The health of an alert rule is determined by the status of its most recent evaluation. These can be OK, Error, and NoData.

A very important feature of alert rules is that they support custom annotations and labels. These allow you to instrument alerts with additional metadata such as summaries and descriptions, and add additional labels to route alerts to specific notification policies.

基本原理
警报规则

  • 警报规则是一组标准,用于确定警报应在何时触发。它由一个或多个查询和表达式、一个需要满足的条件、一个确定警报规则评估频率的间隔,以及一个必须满足条件才能触发警报的持续时间组成。
  • 警报规则在其间隔内进行评估,每个警报规则一次可以触发零个、一个或任意数量的警报。警报规则的状态由其最“严重”警报决定,该警报可以是“正常”、“挂起”或“触发”之一。例如,如果警报规则的警报中至少有一个正在触发,则该警报规则也在触发。警报规则的运行状况由其最近评估的状态决定。这些可以是OK、Error和NoData。
  • 警报规则的一个非常重要的特性是,它们支持自定义注释和标签。这些功能允许您使用其他元数据(如摘要和描述)检测警报,并添加其他标签以将警报路由到特定的通知策略。
Alerts
Alerts are uniquely identified by sets of key/value pairs called Labels. Each key is a label name and each value is a label value. For example, one alert might have the labels foo=bar and another alert might have the labels foo=baz. An alert can have many labels such as foo=bar,bar=baz but it cannot have the same label twice such as foo=bar,foo=baz. Two alerts cannot have the same labels either, and if two alerts have the same labels such as foo=bar,bar=baz and foo=bar,bar=baz then one of the alerts will be discarded. Alerts are resolved when the condition in the alert rule is no longer met, or the alert rule is deleted.

In Grafana Managed Alerts, alerts can be in Normal, Pending, Alerting, No Data or Error states. In Datasource Managed Alerts, such as Mimir and Loki, alerts can be in Normal, Pending and Alerting, but not NoData or Error.

警报

  • 警报由一组称为“标签”的键/值对唯一标识。每个键都是一个标签名称,每个值都是标签值。例如,一个警报的标签可能是foo=bar,而另一个警报可能的标签是foo=baz。警报可以有许多标签,如foo=bar、bar=baz,但不能有两次相同的标签,如foo=bar、foo=baz。两个警报也不能有相同的标签,如果两个警报具有相同的标签(如foo=bar、bar=baz和foo=bar,bar=ba兹),则其中一个警报将被丢弃。当不再满足警报规则中的条件或删除警报规则时,警报被解决掉了。
  • 在Grafana管理的警报中,警报可以处于正常、挂起、警报、无数据或错误状态。在数据源管理警报(如Mimir和Loki)中,警报可以是Normal、Pending和Alerting,但不能是NoData或Error。
Contact points
Contact points determine where notifications are sent. For example, you might have a contact point that sends notifications to an email address, to Slack, to an incident management system (IRM) such as Grafana OnCall or Pagerduty, or to a webhook.

The notifications that are sent from contact points can be customized using notification templates. You can use notification templates to change the title, message, and structure of the notification. Notification templates are not specific to individual integrations or contact points.

触点

  • 触点决定通知的发送位置。例如,您可能有一个触点,可以将通知发送到其电子邮件地址、Slack、事件管理系统(IRM)(如Grafana OnCall或Pagerduty)或webhook。
  • 可以使用通知模板自定义从触点发送的通知。您可以使用通知模板更改通知的标题、消息和结构。通知模板并非特定于单个集成或触点。
Notification policies
Notification policies group alerts and then route them to contact points. They determine when notifications are sent, and how often notifications should be repeated.

Alerts are matched to notification policies using label matchers. These are human-readable expressions that assert if the alert’s labels exactly match, do not exactly match, contain, or do not contain some expected text. For example, the matcher foo=bar matches alerts with the label foo=bar while the matcher foo=~[a-zA-Z]+ matches alerts with any label called foo with a value that matches the regular expression [a-zA-Z]+.

By default, an alert can only match one notification policy. However, with the continue feature alerts can be made to match any number of notification policies at the same time. For more information on notification policies, see fundamentals of Notification Policies. 

通知策略

  • 通知策略将警报分组,然后将它们路由到触点。它们确定何时发送通知,以及通知应重复的频率。
  • 使用标签匹配器将警报与通知策略匹配。这些是人类可读的表达式,用于断言警报的标签是否完全匹配、不完全匹配、包含或不包含某些预期文本。例如,matcher foo=bar将警报与标签foo=bar匹配,而matcher foo=~[a-zA-Z]+将警报与任何名为foo的标签匹配,该标签的值与正则表达式[a-zA-Z]+匹配。
  • 默认情况下,一个警报只能与一个通知策略匹配。但是,使用continue功能,可以同时发出与任意数量的通知策略相匹配的警报。有关通知策略的更多信息,请参阅通知策略的基本原理
Silences and mute timings
Silences and mute timings allow you to pause notifications for specific alerts or even entire notification policies. Use a silence to pause notifications on an ad-hoc basis, such as during a maintenance window; and use mute timings to pause notifications at regular intervals, such as evenings and weekends.

Provisioning
You can create your alerting resources (alert rules, notification policies, and so on) in the Grafana UI; configmaps, files and configuration management systems using file-based provisioning; and in Terraform using API-based provisioning.

静默和静音计时

  • 静默和静音计时允许您暂停特定警报甚至整个通知策略的通知。使用静默来临时暂停通知,例如在维护窗口期间;并使用静音定时以规律的间隔(如晚上和周末)暂停通知。
    资源调配
  • 您可以在Grafana UI中创建警报资源(警报规则、通知策略等);配置映射、文件和基于文件的供应的配置管理系统;以及使用基于API的供应的Terraform。