报警规则（Alerting Rules） · Prometheus 官方文档中文翻译

# **报警规则** <br /> 报警规则使您可以基于Prometheus表达式定义报警条件，并将有关触发报警的通知发送到外部服务。只要报警表达式在给定的时间点产生一个或多个向量元素，该报警就被视为以这些元素的标签集处于活跃状态。 ## **定义报警规则** 报警规则在Prometheus中的配置方式与记录规则相同。带有报警的示例规则文件如下： ~~~ groups: - name: example rules: - alert: HighRequestLatency expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 for: 10m labels: severity: page annotations: summary: High request latency ~~~ 可选填的`for`子句，可以使Prometheus在第一次遇到一个新的表达式向量元素和计数一个Firing警报的该元素之间等待一段时间。在这种情况下，Prometheus将在每次发出警报之前评估、检查警报在10分钟内是否继续处于活跃状态。活跃但尚未触发的元素处于挂起（pending）状态。 `labels`子句允许指定一组附加标签来附加到警报。任何现有的冲突标签都将被覆盖。标签值可以模板化。 `annotations`子句指定一组信息标签，这些标签可用于存储更长的附加信息，例如警报描述或运行手册链接。注释值可以模板化。 ## **模版化 Templating** 标签（labels）和注释（annotations）值可以使用[控制台模板（console templates）](https://prometheus.io/docs/visualization/consoles)进行模板化。 `$labels`变量保存报警实例的标签键/值对。可以通过`$externalLabels`变量访问已配置的外部标签。 `$value`变量保存警报实例的评估值。 ~~~ # To insert a firing element's label values: {{ $labels.<labelname> }} # To insert the numeric expression value of the firing element: {{ $value }} ~~~ 示例 ~~~ groups: - name: example rules: # Alert for any instance that is unreachable for >5 minutes. - alert: InstanceDown expr: up == 0 for: 5m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." # Alert for any instance that has a median request latency >1s. - alert: APIHighRequestLatency expr: api_http_request_latencies_second{quantile="0.5"} > 1 for: 10m annotations: summary: "High request latency on {{ $labels.instance }}" description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)" ~~~ ## **运行时检查报警** 要手动检查哪些警处于活跃状态（pending或firing），请导航到Prometheus的“Alerts”选项卡。这将向您显示每个定义的警报当前处于活跃状态的确切标签集。对于挂起（pending）和触发（firing）警报，Prometheus还存储格式为`ALERTS {alertname =“ <警报名称>”，alertstate =“ pending | firing”，<其他警报标签>}`的合成时间序列。只要警报处于指示的活跃（挂起或触发）状态，样本值就设置为1；如果不再是这种情况，则将系列标记为陈旧状态（stale）。 ## **发送报警通知** Prometheus的报警规则擅长于确定当前已有问题的内容，但它们并不是完整的报警通知解决方案。在简单的警报定义之上，还需要另一层来做添加摘要、通知速率限制、静默和报警依赖关系等。在Prometheus的生态系统中，[Alertmanager](https://prometheus.io/docs/alerting/alertmanager/)担当了这个角色。因此，Prometheus可以配置为定期将有关警报状态的信息发送到Alertmanager实例，该实例随后负责调度正确的通知。可以将Prometheus[配置](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)为通过其服务发现机制，自动发现可用的Alertmanager实例。