Prometheus对接alertmanager · Kubernetes

[TOC] ![](https://img.kancloud.cn/cf/6e/cf6e6d8b2b5d53cdb52fd8ea79ca9b4b_1036x148.png) 摘要官方的一句话。**建议在本地相关 Prometheus 服务器内继续部署规则** 如果的确需要安装 rule 组件，请参考 [Rule文章](Ruler.md) 。该文章演示Prometheus与alertmanager对接设置警报和通知的**主要步骤**是： - 安装和配置 alertmanager - Prometheus 关联 alertmanager - 在Prometheus中创建警报规则 ## 安装和配置 alertmanager 1. 安装alertmanager 请参考 [上一章节内容](alertmanager.md) 2. 配置alertmanager邮件告警 ```shell global: # 邮件配置 smtp_from: 'ecloudz@126.com' smtp_smarthost: 'smtp.126.com:25' smtp_auth_username: 'ecloudz@126.com' smtp_auth_password: 'FHWBDWBEUMQExxxx' # 邮箱的授权码 route: # 当一个新的报警分组被创建后，需要等待至少 group_wait 时间来初始化通知 # 这种方式可以确保您能有足够的时间为同一分组来获取多个警报，然后一起触发这个报警信息。 group_wait: 1m # 已经成功发送警报，再次发送通知之前等待多长时间 repeat_interval: 4h # 相同的group之间发送告警通知的时间间隔 group_interval: 15m # 分组，对应Prometheus的告警规则的labels group_by: ["cluster", "team"] # 子路由 # 当 team=hosts(Prometheus传递过来) 的 labels ，告警媒介走 email 方式。如果没有到对于的labels，告警媒介则走default routes: - receiver: email matchers: - team = hosts receivers: - name: email email_configs: - to: "jiaxzeng@126.com" # 收件邮箱地址 html: '{{ template "email.to.html" . }}' # 发送邮件的内容 headers: { Subject: '{{ if eq .Status "firing" }}【监控告警正在发生】{{ else if eq .Status "resolved" }}【监控告警已恢复】{{ end }} {{ .CommonLabels.alertname }}' } # 邮件的主题 send_resolved: true # 是否接受已解决的告警信息 templates: - "/data/alertmanager/email.tmpl" # 模板路径 ``` 3. 添加模板 ```shell cat <<-EOF | sudo tee /data/alertmanager/email.tmpl > /dev/null {{ define "email.to.html" }} {{- if gt (len .Alerts.Firing) 0 -}} {{ range .Alerts }} =========start========== 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }} 告警类型: {{ .Labels.alertname }} 告警主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} =========end========== {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range .Alerts }} =========start========== 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }} 告警类型: {{ .Labels.alertname }} 告警主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} =========end========== {{ end }}{{ end -}} {{- end }} EOF ``` > 第一行 `define` 定义的内容是 alertmanager 配置文件的 `receivers.email_configs.html` 的值保持一致，否则告警邮件内容为空 4. 检测配置文件是否正常 ```shell $ amtool check-config /data/alertmanager/alertmanager.yml Checking '/data/alertmanager/alertmanager.yml' SUCCESS Found: - global config - route - 0 inhibit rules - 2 receivers - 1 templates SUCCESS ``` 5. 热加载alertmanager ```shell systemctl reload alertmanager ``` ## Prometheus 关联 alertmanager ```yaml alerting: alert_relabel_configs: - action: labeldrop regex: replica alertmanagers: - path_prefix: "/alertmanager" static_configs: - targets: - "192.168.31.103:9093" ``` > 请注意以下三点： > - 所有Prometheus节点都需要配置 > - 配置 `alert_relabel_configs` 是因为Prometheus有添加额外的标签，如果告警时不删除该标签，则会出现重发告警邮件 > - 配置 `path_prefix` 是因为 alertmanager 添加子路径，如果没有添加的话，则不需要该配置行 ## 在Prometheus中创建警报规则 1. Prometheus配置告警规则路径 ```shell rule_files: - "rules/*.yml" ``` 2. 创建告警规则 ```shell mkdir /data/prometheus/rules cat <<-EOF | sudo tee /data/prometheus/rules/hosts.yml > /dev/null groups: - name: hosts rules: - alert: NodeMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: team: hosts annotations: summary: "节点内存使用率过高" description: "{{$labels.instance}} 节点内存使用率超过 80% (当前值: {{ $value }})" - alert: NodeCpuUsage expr: (1 - (sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total[1m])) by(instance))) * 100 > 80 for: 1m labels: team: hosts annotations: summary: "节点CPU使用率过高" description: "{{$labels.instance}} 节点最近一分钟CPU使用率超过 80% (当前值: {{ $value }})" - alert: NodeDiskUsage expr: ((node_filesystem_size_bytes{fstype !~ "tmpfs|rootfs"} - node_filesystem_free_bytes{fstype !~ "tmpfs|rootfs"}) / node_filesystem_size_bytes{fstype !~ "tmpfs|rootfs"})*100 > 40 for: 1m labels: team: hosts annotations: summary: "节点磁盘分区使用率过高" description: "{{$labels.instance}} 节点 {{$labels.mountpoint}} 分区超过 80% (当前值: {{ $value }})" EOF ``` ## 热加载告警规则 ```shell promtool check rules /data/thanos/rule/rules/hosts.yml sudo systemctl reload thanos-rule.service ``` ## 将文件同步给其他节点 ```shell # 告警目录 scp -r /data/thanos/rule/rules ops@k8s-master02:/data/thanos/rule # 检测配置文件 ssh ops@k8s-master02 "promtool check rules /data/thanos/rule/rules/hosts.yml" # 热加载配置文件 ssh ops@k8s-master02 "sudo systemctl reload thanos-rule.service" ``` ## 验证如果Prometheus没有暴露可以访问的地址，这里使用api进行验证 ```shell # 告警规则名称 curl -s http://localhost:9090/api/v1/rules | jq .data.groups[].rules[].name # 正在发生的告警 curl -s http://localhost:9090/api/v1/alerts | jq .data.alerts[].labels ```