[TOC]
前面章节我们介绍了AlertManager的API。本节,我们将使用PostMan模拟Prometheus向AlertManager发送告警,然后用我们自已写的程序接收AlertManager发出来的通知。
本文中,我们的实验主要来验证AlertManager中Group的机制,以及的三个配置参数的效果:`group_wait`、`group_interval`、`repeat_interval`。
## **启动自定义程序**
启动我们自已的程序,监听10000端口,提供POST /webhook API,用来接收AlertManager的通知。程序代码见文章附录
## **Group机制**
设置AlertManager的配置如下,然后启动AlertManager
```
route:
group_by: ["alertname"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'webhook'
receivers:
- name: webhook
webhook_configs:
- url: http://192.168.2.101:10000/webhook
send_resolved: true
```
接着,用PostMan调用AlertManager的API `POST /api/v2/alerts`发送一个告警,告警内容(Body参数)如下,注意下面的时间要设置好(StartsAt可以是一个过去的时间,EndsAt设置为你通过PostMan发送这个请求时的后一个小时或更久):
```
[
{
"Labels": {
"alertname": "NodeCpuPressure",
"IP": "192.168.2.101"
},
"Annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"StartsAt": "2020-02-17T23:00:00.000+08:00",
"EndsAt": "2020-02-18T23:00:00.000+08:00"
}
]
```
然后,我们调AlertManager的API来查询Alerts(`GET /api/v2/alerts`)与Groups(GET `/api/v2/alerts/groups`),可以通过浏览器直接调或者通过命令行curl来调。
查询到的Alerts结果如下:
```
[
{
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2020-02-18T23:00:00.000+08:00",
"fingerprint": "27e1a08813b1ec3b",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2020-02-17T23:00:00.000+08:00",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2020-02-17T23:38:38.610+08:00",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
}
}
]
```
查询到的Groups结果如下:
```
[
{
"alerts": [
{
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2020-02-18T23:00:00.000+08:00",
"fingerprint": "27e1a08813b1ec3b",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2020-02-17T23:00:00.000+08:00",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2020-02-17T23:38:38.610+08:00",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
}
}
],
"labels": {
"alertname": "NodeCpuPressure"
},
"receiver": {
"name": "webhook"
}
}
]
```
我们发现,AlertManger自动创建了一个Group,其Labels为`{alertname=NodeCpuPressure}`,里面包含了刚才的告警。
接着,我们再发一个Alert,其内容如下:
```
[
{
"Labels": {
"alertname": "NodeMemoryPressure",
"IP": "192.168.2.101"
},
"Annotations": {
"summary": "NodeMemoryPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"StartsAt": "2020-02-17T23:00:00.000+08:00",
"EndsAt": "2020-02-18T23:00:00.000+08:00"
}
]
```
然后再查询Group,结果如下,说明又创建了一个Group,其Labels为`{alertname=NodeCpuPressure}`
```
[
{
"alerts": [
{
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2020-02-18T23:00:00.000+08:00",
"fingerprint": "27e1a08813b1ec3b",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2020-02-17T23:00:00.000+08:00",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2020-02-17T23:38:38.610+08:00",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
}
}
],
"labels": {
"alertname": "NodeCpuPressure"
},
"receiver": {
"name": "webhook"
}
},
{
"alerts": [
{
"annotations": {
"summary": "NodeMemoryPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2020-02-18T23:00:00.000+08:00",
"fingerprint": "1a354c7333c5b062",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2020-02-17T23:00:00.000+08:00",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2020-02-17T23:41:27.790+08:00",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeMemoryPressure"
}
}
],
"labels": {
"alertname": "NodeMemoryPressure"
},
"receiver": {
"name": "webhook"
}
}
]
```
此时,我们我们再发送以下的“解除告警”(即把EndsAt设置为一个过去的时间)
```
[
{
"Labels": {
"alertname": "NodeCpuPressure",
"IP": "192.168.2.101"
},
"Annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"StartsAt": "2020-02-17T23:00:00.000+08:00",
"EndsAt": "2020-02-17T23:01:00.000+08:00"
}
]
```
再查看Alert与Group,发现都只剩下一个了
```
[
{
"alerts": [
{
"annotations": {
"summary": "NodeMemoryPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2020-02-18T23:00:00.000+08:00",
"fingerprint": "1a354c7333c5b062",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2020-02-17T23:00:00.000+08:00",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2020-02-17T23:41:27.790+08:00",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeMemoryPressure"
}
}
],
"labels": {
"alertname": "NodeMemoryPressure"
},
"receiver": {
"name": "webhook"
}
}
]
```
## **group_wait**
停止alertmanager,清空alertmanager的数据目录,然后还是使用上面的配置,启动alertmanager。此时alertmanager中没有任何Alert与Group
接着,向AlertManager发送一个告警,内容如下:
```
[
{
"Labels": {
"alertname": "NodeCpuPressure",
"IP": "192.168.2.101"
},
"Annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"StartsAt": "2020-02-17T23:00:00.000+08:00",
"EndsAt": "2020-02-18T23:00:00.000+08:00"
}
]
```
然后在30秒内,再调用API发送一个如下的Alert
```
[
{
"Labels": {
"alertname": "NodeCpuPressure",
"IP": "192.168.2.102"
},
"Annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.102, Value: 95%, Threshold: 85%"
},
"StartsAt": "2020-02-17T23:00:00.000+08:00",
"EndsAt": "2020-02-18T23:00:00.000+08:00"
}
]
```
然后,等到第一个告警发送后的30秒后,我们便会在我们自已程序那里看到接收到的通知,内容如下:
```
```
## **附录**
webhook-receiver.go
```
package main
import (
"time"
"io/ioutil"
"net/http"
"fmt"
)
type MyHandler struct{}
func (am *MyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
body, err := ioutil.ReadAll(r.Body)
if err != nil {
fmt.Printf("read body err, %v\n", err)
return
}
fmt.Println(time.Now())
fmt.Printf("%s\n\n", string(body))
}
func main() {
http.Handle("/webhook", &MyHandler{})
http.ListenAndServe(":10000", nil)
}
```
- (一)快速开始
- 安装Prometheus
- 使用NodeExporter采集数据
- AlertManager进行告警
- Grafana数据可视化
- (二)探索PromQL
- 理解时间序列
- Metrics类型
- 初识PromQL
- PromQL操作符
- PromQL内置函数
- rate和irate
- 常见指标的PromQL
- 主机CPU使用率
- 主机内存使用率
- 主机磁盘使用率
- 主机磁盘IO
- 主机网络IO
- API的响应时间
- (三)Promtheus告警处理
- 自定义告警规则
- 示例-对主机进行监控告警
- 部署AlertManager
- 告警的路由与分组
- 使用Receiver接收告警信息
- 集成邮件系统
- 屏蔽告警通知
- 扩展阅读
- AlertManager的API
- Prometheus发送告警机制
- 实践:接收Prometheus的告警
- 实践:AlertManager
- (四)监控Kubernetes集群
- 部署Prometheus
- Kubernetes下的服务发现
- 监控Kubernetes集群
- 监控Kubelet的运行状态
- 监控Pod的资源(cadvisor)
- 监控K8s主机的资源
- KubeStateMetrics
- K8S及ETCD常见监控指标
- ETCD监控指标
- Kube-apiserver监控指标
- (五)其他
- Prometheus的relabel-config
- Target的Endpoint
- Prometheus的其他配置
- (六)BlackboxExporter
- 安装
- BlackboxExporter的应用场景
- 在Promtheus中使用BlackboxExporter
- 参考