[TOC]
## **配置**
Pod的资源使用情况是通过kubelet的`/metrics/cadvisor`暴露的。所以,在prometheus的配置文件中添加以下配置
```
- job_name: cadvisor
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: replace
target_label: __address__
replacement: kubernetes.default.svc:443
- action: replace
source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
```
## **Pod监控常用的PromQL**
#### **CPU使用率**
* Cadvisor指标
```
# HELP container_cpu_usage_seconds_total Cumulative cpu time consumed in seconds.
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{container_name="",cpu="total",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice",image="",name="",namespace="kube-system",pod="prometheus-5fd68f657-nrbc6"} 1.787001816 1597991130928
container_cpu_usage_seconds_total{container_name="prometheus",cpu="total",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice/docker-c2eff3f787388ec05b9b306a0c33320e9342cf2fb0db07e1e091a2e3730493a5.scope",image="sha256:61bf337f29560d2c3bc5c73168014eba58eb14fdefa2e05e78a877eae29548cd",name="k8s_prometheus_prometheus-5fd68f657-nrbc6_kube-system_f2e26cd6-32b7-450c-b522-54663184104c_0",namespace="kube-system",pod="prometheus-5fd68f657-nrbc6"} 1.782126542 1597991135141
container_cpu_usage_seconds_total{container_name="POD",cpu="total",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice/docker-edd7d354485b291e83e3c57ce29506905a99f819b04223390de3f66b83f9486a.scope",image="k8s.gcr.io/pause:3.1",name="k8s_POD_prometheus-5fd68f657-nrbc6_kube-system_f2e26cd6-32b7-450c-b522-54663184104c_0",namespace="kube-system",pod="prometheus-5fd68f657-nrbc6"} 0.022122064 1597991141008
```
* PromQL
```
# 整个pod的cpu使用率
irate(container_cpu_usage_seconds_total{container_name="",pod_name="prometheus-xxxxx-xxxx",namespace="default"}[2m])
# pod中pause容器的cpu使用率
irate(container_cpu_usage_seconds_total{container_name="POD",pod_name="prometheus-xxxxx-xxxx",namespace="default"}[2m])
# pod中prometheus容器的cpu使用率
irate(container_cpu_usage_seconds_total{container_name="prometheus",pod_name="prometheus-xxxxx-xxxx",namespace="kube-system"}[2m])
```
#### **内存使用率**
* Cadvisor指标
```
# HELP container_memory_working_set_bytes Current working set in bytes.
# TYPE container_memory_working_set_bytes gauge
container_memory_working_set_bytes{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice",image="",name="",namespace="kube-system",pod_name="prometheus-5fd68f657-nrbc6"} 3.530752e+07 1597991130928
container_memory_working_set_bytes{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice/docker-edd7d354485b291e83e3c57ce29506905a99f819b04223390de3f66b83f9486a.scope",image="k8s.gcr.io/pause:3.1",name="k8s_POD_prometheus-5fd68f657-nrbc6_kube-system_f2e26cd6-32b7-450c-b522-54663184104c_0",namespace="kube-system",pod_name="prometheus-5fd68f657-nrbc6"} 1.130496e+06 1597991141008
container_memory_working_set_bytes{container_name="prometheus",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice/docker-c2eff3f787388ec05b9b306a0c33320e9342cf2fb0db07e1e091a2e3730493a5.scope",image="sha256:61bf337f29560d2c3bc5c73168014eba58eb14fdefa2e05e78a877eae29548cd",name="k8s_prometheus_prometheus-5fd68f657-nrbc6_kube-system_f2e26cd6-32b7-450c-b522-54663184104c_0",namespace="kube-system",pod_name="prometheus-5fd68f657-nrbc6"} 3.4177024e+07 1597991135141
# HELP container_memory_usage_bytes Current memory usage in bytes, including all memory regardless of when it was accessed
# TYPE container_memory_usage_bytes gauge
...
# HELP container_spec_memory_limit_bytes Memory limit for the container.
# TYPE container_spec_memory_limit_bytes gauge
container_spec_memory_limit_bytes{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice",image="",name="",namespace="kube-system",pod_name="prometheus-5fd68f657-nrbc6"} 0
container_spec_memory_limit_bytes{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice/docker-edd7d354485b291e83e3c57ce29506905a99f819b04223390de3f66b83f9486a.scope",image="k8s.gcr.io/pause:3.1",name="k8s_POD_prometheus-5fd68f657-nrbc6_kube-system_f2e26cd6-32b7-450c-b522-54663184104c_0",namespace="kube-system",pod_name="prometheus-5fd68f657-nrbc6"} 1.073741824e+09
container_spec_memory_limit_bytes{container_name="prometheus",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf2e26cd6_32b7_450c_b522_54663184104c.slice/docker-c2eff3f787388ec05b9b306a0c33320e9342cf2fb0db07e1e091a2e3730493a5.scope",image="sha256:61bf337f29560d2c3bc5c73168014eba58eb14fdefa2e05e78a877eae29548cd",name="k8s_prometheus_prometheus-5fd68f657-nrbc6_kube-system_f2e26cd6-32b7-450c-b522-54663184104c_0",namespace="kube-system",pod_name="prometheus-5fd68f657-nrbc6"} 1.073741824e+09
```
注意:如果Pod中某个Container没有设置`limits.memory`,则该Container的`container_spec_memory_limit_bytes`指标值为0;Pod的该指标值就是所有Container的limits之和
* PromQL(不含缓存)
如果Container的WorkingSet超过了limit,则会被OOM killed。
```
注意:要给Pod的每个Container设置limit.memory,这样container_spec_memory_limit_bytes才不会为0
# 整个pod的内存使用率
container_memory_working_set_bytes{container_name="",pod_name="prometheus-xxxxx-xxxx",namespace="kube-system"} / container_spec_memory_limit_bytes
# pod中pause容器的内存使用率
pause容器的 container_spec_memory_limit_bytes 永远为0
# pod中prometheus容器的内存使用率
container_memory_working_set_bytes{container_name="prometheus",pod_name="prometheus-xxxxx-xxxx",namespace="kube-system"} / container_spec_memory_limit_bytes
```
* PromQL(含缓存)
```
注意:要给Pod的每个Container设置limit.memory,这样container_spec_memory_limit_bytes才不会为0
# 整个pod的内存使用率
container_memory_usage_bytes{container_name="",pod_name="prometheus-xxxxx-xxxx",namespace="kube-system"} / container_spec_memory_limit_bytes
# pod中pause容器的内存使用率
pause容器的 container_spec_memory_limit_bytes 永远为0
# pod中prometheus容器的内存使用率
container_memory_usage_bytes{container_name="prometheus",pod_name="prometheus-xxxxx-xxxx",namespace="kube-system"} / container_spec_memory_limit_bytes
```
#### **网卡流量**
* Cadvisor指标
```
# HELP container_network_receive_bytes_total Cumulative count of bytes received
# TYPE container_network_receive_bytes_total counter
container_network_receive_bytes_total{container_name="POD",id="/kubepods.slice/kubepods-podb83a7b78_085b_42d2_ba6b_77088cf64743.slice/docker-5f80fcfd575cfa556ee91b213a5b95422b3c0ae3602ecea30148361f049c7338.scope",image="k8s.gcr.io/pause:3.1",interface="eth0",name="k8s_POD_prometheus-745dc86965-6kb79_kube-system_b83a7b78-085b-42d2-ba6b-77088cf64743_0",namespace="kube-system",pod="prometheus-745dc86965-6kb79"} 4.13377e+06 1597997600588
container_network_receive_bytes_total{container_name="POD",id="/kubepods.slice/kubepods-podb83a7b78_085b_42d2_ba6b_77088cf64743.slice/docker-5f80fcfd575cfa556ee91b213a5b95422b3c0ae3602ecea30148361f049c7338.scope",image="k8s.gcr.io/pause:3.1",interface="tunl0",name="k8s_POD_prometheus-745dc86965-6kb79_kube-system_b83a7b78-085b-42d2-ba6b-77088cf64743_0",namespace="kube-system",pod="prometheus-745dc86965-6kb79"} 0 1597997600588
# HELP container_network_transmit_bytes_total Cumulative count of bytes transmitted
# TYPE container_network_transmit_bytes_total counter
container_network_transmit_bytes_total{container_name="POD",id="/kubepods.slice/kubepods-podb83a7b78_085b_42d2_ba6b_77088cf64743.slice/docker-5f80fcfd575cfa556ee91b213a5b95422b3c0ae3602ecea30148361f049c7338.scope",image="k8s.gcr.io/pause:3.1",interface="eth0",name="k8s_POD_prometheus-745dc86965-6kb79_kube-system_b83a7b78-085b-42d2-ba6b-77088cf64743_0",namespace="kube-system",pod="prometheus-745dc86965-6kb79"} 897123 1597997600588
container_network_transmit_bytes_total{container_name="POD",id="/kubepods.slice/kubepods-podb83a7b78_085b_42d2_ba6b_77088cf64743.slice/docker-5f80fcfd575cfa556ee91b213a5b95422b3c0ae3602ecea30148361f049c7338.scope",image="k8s.gcr.io/pause:3.1",interface="tunl0",name="k8s_POD_prometheus-745dc86965-6kb79_kube-system_b83a7b78-085b-42d2-ba6b-77088cf64743_0",namespace="kube-system",pod="prometheus-745dc86965-6kb79"} 0 1597997600588
```
注意:网络插件Calico我们设置的是隧道模式,上面prometheus Pod出现了两个指标,其中`eth0`才是Pod里面真正的网卡。
```
container_network_receive_bytes_total{interface="eth0",container_name="POD",pod_name="prometheus-xxxx-xxx"} 4.13377e+06 1597997600588
container_network_receive_bytes_total{interface="tunl0",container_name="POD",pod_name="prometheus-xxxx-xxx"} 0 1597997600588
```
不仅prometheus会有这个指标,所有的Pod(hostnetwork与非hostnetwork)都会有`interface="tunl0"`的指标(已证实);不过,我们把模式从IPIP改成直接路由,主机的tunl0网卡依然还在,只中tunl0的路由不见了,且指标也还在(已证实)。
而对于HostNetwork的容器,还会有如下的更多指标:
```
container_network_receive_errors_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode94ef178_85fb_4f12_991d_51e90bed9926.slice/docker-387d09bdfe3357df9a9e153202c95d880f8b54ac1c96bd37ecbc17fa1a067505.scope",image="k8s.gcr.io/pause:3.1",interface="calidb50ed3a51d",name="k8s_POD_calico-node-p8tzb_kube-system_e94ef178-85fb-4f12-991d-51e90bed9926_0",namespace="kube-system",pod_name="calico-node-p8tzb"} 0 1598000409548
container_network_receive_errors_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode94ef178_85fb_4f12_991d_51e90bed9926.slice/docker-387d09bdfe3357df9a9e153202c95d880f8b54ac1c96bd37ecbc17fa1a067505.scope",image="k8s.gcr.io/pause:3.1",interface="calif656e9eeedc",name="k8s_POD_calico-node-p8tzb_kube-system_e94ef178-85fb-4f12-991d-51e90bed9926_0",namespace="kube-system",pod_name="calico-node-p8tzb"} 0 1598000409548
container_network_receive_errors_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode94ef178_85fb_4f12_991d_51e90bed9926.slice/docker-387d09bdfe3357df9a9e153202c95d880f8b54ac1c96bd37ecbc17fa1a067505.scope",image="k8s.gcr.io/pause:3.1",interface="ens33",name="k8s_POD_calico-node-p8tzb_kube-system_e94ef178-85fb-4f12-991d-51e90bed9926_0",namespace="kube-system",pod_name="calico-node-p8tzb"} 0 1598000409548
container_network_receive_errors_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode94ef178_85fb_4f12_991d_51e90bed9926.slice/docker-387d09bdfe3357df9a9e153202c95d880f8b54ac1c96bd37ecbc17fa1a067505.scope",image="k8s.gcr.io/pause:3.1",interface="tunl0",name="k8s_POD_calico-node-p8tzb_kube-system_e94ef178-85fb-4f12-991d-51e90bed9926_0",namespace="kube-system",pod_name="calico-node-p8tzb"} 0 1598000409548
```
HostNetwork的Pod并没有主机所有网卡的数据,比如,我们查看主机的网卡情况如下,calico-node只有上面四个网卡,而没有lo网卡与docker0网卡的流量。
```
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:5b:02:9b brd ff:ff:ff:ff:ff:ff
inet 192.168.2.102/24 brd 192.168.2.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet6 fd15::1a36:bad1:b207:da5b/64 scope global deprecated noprefixroute dynamic
valid_lft 65745sec preferred_lft 0sec
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:2b:91:54:46 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
6: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
inet 172.26.14.192/32 brd 172.26.14.192 scope global tunl0
valid_lft forever preferred_lft forever
13: calif656e9eeedc@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
16: calidb50ed3a51d@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
```
* PromQL
```
# 非hostnetwork模式的Pod的下载速率(kb/s)
irate(container_network_receive_bytes_total{interface="eth0",pod_name="promtheus-xxxxx-xxxx",namespace="kube-system"}[2m]) / 1024
# hostnetwork模式的Pod的下载速率
由于有多个网卡指标,所有在页面上不好用一个网卡来代表这个Pod的网卡速率
```
- (一)快速开始
- 安装Prometheus
- 使用NodeExporter采集数据
- AlertManager进行告警
- Grafana数据可视化
- (二)探索PromQL
- 理解时间序列
- Metrics类型
- 初识PromQL
- PromQL操作符
- PromQL内置函数
- rate和irate
- 常见指标的PromQL
- 主机CPU使用率
- 主机内存使用率
- 主机磁盘使用率
- 主机磁盘IO
- 主机网络IO
- API的响应时间
- (三)Promtheus告警处理
- 自定义告警规则
- 示例-对主机进行监控告警
- 部署AlertManager
- 告警的路由与分组
- 使用Receiver接收告警信息
- 集成邮件系统
- 屏蔽告警通知
- 扩展阅读
- AlertManager的API
- Prometheus发送告警机制
- 实践:接收Prometheus的告警
- 实践:AlertManager
- (四)监控Kubernetes集群
- 部署Prometheus
- Kubernetes下的服务发现
- 监控Kubernetes集群
- 监控Kubelet的运行状态
- 监控Pod的资源(cadvisor)
- 监控K8s主机的资源
- KubeStateMetrics
- K8S及ETCD常见监控指标
- ETCD监控指标
- Kube-apiserver监控指标
- (五)其他
- Prometheus的relabel-config
- Target的Endpoint
- Prometheus的其他配置
- (六)BlackboxExporter
- 安装
- BlackboxExporter的应用场景
- 在Promtheus中使用BlackboxExporter
- 参考