Grafana学习(7)——Alerting

发布时间 2023-11-22 15:14:37作者: 钱塘江畔
Grafana Alerting allows you to learn about problems in your systems moments after they occur.

Monitor your incoming metrics data or log entries and set up your Alerting system to watch for specific events or circumstances and then send notifications when those things are found.

In this way, you eliminate the need for manual monitoring and provide a first line of defense against system outages or changes that could turn into major incidents.

Grafana Alerting允许您在系统出现问题后立即了解问题。
监控传入的指标数据或日志条目,并设置警报系统以监视特定事件或情况,然后在发现这些情况时发送通知。
通过这种方式,您无需手动监控,并为系统中断或可能演变为重大事件的更改提供了第一道防线。

Using Grafana Alerting, you create queries and expressions from multiple data sources — no matter where your data is stored — giving you the flexibility to combine your data and alert on your metrics and logs in new and unique ways. You can then create, manage, and take action on your alerts from a single, consolidated view, and improve your team’s ability to identify and resolve issues quickly.

Grafana Alerting is available for Grafana OSS, Grafana Enterprise, or Grafana Cloud. With Mimir and Loki alert rules you can run alert expressions closer to your data and at massive scale, all managed by the Grafana UI you are already familiar with.

Watch this video to learn more about Grafana Alerting:

使用Grafana Alerting,您可以从多个数据源创建查询和表达式,无论您的数据存储在哪里,这使您能够灵活地组合数据,并以新的独特方式对指标和日志发出警报。然后,您可以从一个统一的视图创建、管理警报并对其采取行动,并提高团队快速发现和解决问题的能力。
Grafana Alerting可用于Grafana OSS、Grafana Enterprise或Grafana Cloud。使用Mimir和Loki警报规则,您可以在更接近数据的地方大规模运行警报表达式,所有这些都由您已经熟悉的Grafana UI管理。
观看此视频了解有关Grafana Alerting的更多信息:
Refer to Manage your alert rules for current instructions.

Key features and benefits
One page for all alerts
A single Grafana Alerting page consolidates both Grafana-managed alerts and alerts that reside in your Prometheus-compatible data source in one single place.

Multi-dimensional alerts
Alert rules can create multiple individual alert instances per alert rule, known as multi-dimensional alerts, giving you the power and flexibility to gain visibility into your entire system with just a single alert rule. You do this by adding labels to your query to specify which component is being monitored and generate multiple alert instances for a single alert rule. For example, if you want to monitor each server in a cluster, a multi-dimensional alert will alert on each CPU, whereas a standard alert will alert on the overall server.

一页显示所有警报

  • 一个Grafana Alerting页面将Grafana管理的警报和位于Prometheus兼容数据源中的警报合并在一个位置。
    多维警报
  • 警报规则可以为每个警报规则创建多个单独的警报实例,称为多维警报,使您能够通过单个警报规则查看整个系统。您可以通过在查询中添加标签来指定要监视的组件,并为单个警报规则生成多个警报实例。例如,如果要监视集群中的每个服务器,则会在每个CPU上发出多维警报,而在整个服务器上发出标准警报。
Route alerts
Route each alert instance to a specific contact point based on labels you define. Notification policies are the set of rules for where, when, and how the alerts are routed to contact points.
Silence alerts
Silences stop notifications from getting created and last for only a specified window of time. Silences allow you to stop receiving persistent notifications from one or more alert rules. You can also partially pause an alert based on certain criteria. Silences have their own dedicated section for better organization and visibility, so that you can scan your paused alert rules without cluttering the main alerting view.

路由警报
根据您定义的标签将每个警报实例路由到特定的联系点。通知策略是一组规则,用于通知在何处、何时以及如何发送到联系点。
静默警报
静默会阻止创建通知,并且只持续指定的时间窗口。静默允许您停止接收来自一个或多个警报规则的持久通知。您也可以根据某些条件部分暂停警报。Silence有自己的专用部分,可以更好地组织和可见性,这样您就可以扫描暂停的警报规则,而不会扰乱主警报视图。

Mute timings
A mute timing is a recurring interval of time when no new notifications for a policy are generated or sent. Use them to prevent alerts from firing a specific and reoccurring period, for example, a regular maintenance period.
Similar to silences, mute timings do not prevent alert rules from being evaluated, nor do they stop alert instances from being shown in the user interface. They only prevent notifications from being created.

静音计时
静音时间是指没有为策略生成或发送新通知的重复时间间隔。使用它们可以防止警报在特定的周期内再次触发,例如定期维护周期。
与静默类似,静音计时不会阻止评估警报规则,也不会阻止在用户界面中显示警报实例。它们只阻止创建通知。

Design your Alerting system
Monitoring complex IT systems and understanding whether everything is up and running correctly is a difficult task. Setting up an effective alert management system is therefore essential to inform you when things are going wrong before they start to impact your business outcomes.
Designing and configuring an alert management set up that works takes time.
Here are some tips on how to create an effective alert management set up for your business:

设计您的警报系统
监控复杂的IT系统并了解一切是否正常运行是一项艰巨的任务。因此,建立一个有效的警报管理系统对于在问题开始影响您的业务成果之前通知您是至关重要的。
设计和配置有效的警报管理设置需要时间。
以下是关于如何为您的企业创建有效的警报管理设置的一些提示:

Which are the key metrics for your business that you want to monitor and alert on?

Find events that are important to know about and not so trivial or frequent that recipients ignore them.

Alerts should only be created for big events that require immediate attention or intervention.

Consider quality over quantity.

Which type of Alerting do you want to use?

Choose between Grafana-managed Alerting or Grafana Mimir or Loki-managed Alerting; or both.
How do you want to organize your alerts and notifications?

Be selective about who you set to receive alerts. Consider sending them to whoever is on call or a specific Slack channel.
Automate as far as possible using the Alerting API or alerts as code (Terraform).

How can you reduce alert fatigue?

Avoid noisy, unnecessary alerts by using silences, mute timings, or pausing alert rule evaluation.
Continually tune your alert rules to review effectiveness. Remove alert rules to avoid duplication or ineffective alerts.
Think carefully about priority and severity levels.
Continually review your thresholds and evaluation rules.

您希望监控和提醒您的业务的关键指标是什么?

  • 查找需要了解的重要事件,不要过于琐碎或频繁以至于收件人忽略它们。
  • 只应为需要立即关注或干预的重大事件创建警报。
  • 考虑质量而非数量。
    您希望使用哪种类型的Alerting?
  • 在Grafana管理的警报或Grafana Mimir或Loki管理的警报之间进行选择;或两者兼有。
    您希望如何组织提醒和通知?
  • 选择要接收警报的对象。考虑将它们发送给随时待命的人或特定的Slack频道。
  • 使用警报API或警报代码(Terraform)尽可能实现自动化。
    如何减少警觉疲劳?
  • 通过使用静音、静音时间或暂停警报规则评估,避免发出嘈杂、不必要的警报。
  • 不断调整您的警报规则以审查有效性。删除警报规则以避免重复或无效警报。
  • 仔细考虑优先级和严重程度。
  • 不断审查您的阈值和评估规则。