[TOC]
除了前面介绍的,k8s还有很多需要监控的内容,比如k8s集群的Node是否为Ready,k8s中网络插件、dns插件的容器是否都为Running等。
上面的这些指标,都需要用到一个叫kube-state-metrics的exporter来暴露。
## **安装**
#### **安装kube-state-metrics**
kube-state-metrics会监控k8s的API,它与k8s版本的兼容性可以参考Reference1。这里我们k8s的版本为v1.17,使用的kube-state-metrics版本为v1.9。
在 https://github.com/kubernetes/kube-state-metrics/tree/release-1.9/examples/standard 页面下会有5个文件,创建这些资源;在下面我们对这些文件稍微做了一下修改,把这些文件的内容合并,并做了一些微调。
创建文件kube-state-metrics-rbac.yaml,内容如下:
```
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
- ingresses
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- volumeattachments
verbs:
- list
- watch
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
```
创建文件kube-state-metrics-deployment.yaml,,内容如下:
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
containers:
- image: quay.io/coreos/kube-state-metrics:v1.9.6
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
```
创建文件kube-state-metrics-service.yaml,内容如下:
```
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
type: ClusterIP
selector:
app: kube-state-metrics
ports:
- name: http-metrics
port: 8080
targetPort: 8080
```
然后我们创建上面的资源:
```
$ kubectl apply -f kube-state-metrics-rbac.yaml
$ kubectl apply -f kube-state-metrics-deployment.yaml
$ kubectl apply -f kube-state-metrics-service.yaml
```
然后,我们可以在集群的主机上,通过如下的URL查看metrics
```
$ curl {serviceIP}:8080/metrics
```
#### **更新Prometheus的配置**
在prometheus的配置文件中添加如下内容:
```
- job_name: kube-state-metrics
static_configs:
- targets: [kube-state-metrics.kube-system.svc:8080]
```
这里,我们静态配置Target,不使用Service服务发现。
## **监控Node状态**
在kube-state-metrics暴露的指标中,有如下的一些。通过这些指标,我们判断Node是否为Ready状态,是否有MemoryPressure等
```
# HELP kube_node_status_condition The condition of a cluster node.
# TYPE kube_node_status_condition gauge
kube_node_status_condition{node="peng01",condition="NetworkUnavailable",status="true"} 0
kube_node_status_condition{node="peng01",condition="NetworkUnavailable",status="false"} 1
kube_node_status_condition{node="peng01",condition="NetworkUnavailable",status="unknown"} 0
kube_node_status_condition{node="peng01",condition="MemoryPressure",status="true"} 0
kube_node_status_condition{node="peng01",condition="MemoryPressure",status="false"} 1
kube_node_status_condition{node="peng01",condition="MemoryPressure",status="unknown"} 0
kube_node_status_condition{node="peng01",condition="DiskPressure",status="true"} 0
kube_node_status_condition{node="peng01",condition="DiskPressure",status="false"} 1
kube_node_status_condition{node="peng01",condition="DiskPressure",status="unknown"} 0
kube_node_status_condition{node="peng01",condition="PIDPressure",status="true"} 0
kube_node_status_condition{node="peng01",condition="PIDPressure",status="false"} 1
kube_node_status_condition{node="peng01",condition="PIDPressure",status="unknown"} 0
kube_node_status_condition{node="peng01",condition="Ready",status="true"} 1
kube_node_status_condition{node="peng01",condition="Ready",status="false"} 0
kube_node_status_condition{node="peng01",condition="Ready",status="unknown"} 0
```
## **监控重要插件的Pod是否都为Running或Ready状态**
有如下两个指标,可以看到Pod是否为Running或Ready状态
```
# HELP kube_pod_status_phase The pods current phase.
# TYPE kube_pod_status_phase gauge
kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Pending"} 0
kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Succeeded"} 0
kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Failed"} 0
kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Running"} 1
kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Unknown"} 0
# HELP kube_pod_status_ready Describes whether the pod is ready to serve requests.
# TYPE kube_pod_status_ready gauge
kube_pod_status_ready{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",condition="true"} 1
kube_pod_status_ready{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",condition="false"} 0
kube_pod_status_ready{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",condition="unknown"} 0
```
但是在实际中,由于Pod会重启,重启后名字会变,所以我们一般会使用下面的指标来做监控:比如deployment的unavaible不为0,或者available不等于replicas时,就发送告警
```
# HELP kube_deployment_status_replicas The number of replicas per deployment.
# TYPE kube_deployment_status_replicas gauge
kube_deployment_status_replicas{namespace="default",deployment="reviews-v1"} 1
# HELP kube_deployment_status_replicas_available The number of available replicas per deployment.
# TYPE kube_deployment_status_replicas_available gauge
kube_deployment_status_replicas_available{namespace="liuxh-test",deployment="dpdemo"} 1
# HELP kube_deployment_status_replicas_unavailable The number of unavailable replicas per deployment.
# TYPE kube_deployment_status_replicas_unavailable gauge
kube_deployment_status_replicas_unavailable{namespace="chenjh-test",deployment="newlocalpvc"} 1
```
## **Reference**
* https://github.com/kubernetes/kube-state-metrics
- (一)快速开始
- 安装Prometheus
- 使用NodeExporter采集数据
- AlertManager进行告警
- Grafana数据可视化
- (二)探索PromQL
- 理解时间序列
- Metrics类型
- 初识PromQL
- PromQL操作符
- PromQL内置函数
- rate和irate
- 常见指标的PromQL
- 主机CPU使用率
- 主机内存使用率
- 主机磁盘使用率
- 主机磁盘IO
- 主机网络IO
- API的响应时间
- (三)Promtheus告警处理
- 自定义告警规则
- 示例-对主机进行监控告警
- 部署AlertManager
- 告警的路由与分组
- 使用Receiver接收告警信息
- 集成邮件系统
- 屏蔽告警通知
- 扩展阅读
- AlertManager的API
- Prometheus发送告警机制
- 实践:接收Prometheus的告警
- 实践:AlertManager
- (四)监控Kubernetes集群
- 部署Prometheus
- Kubernetes下的服务发现
- 监控Kubernetes集群
- 监控Kubelet的运行状态
- 监控Pod的资源(cadvisor)
- 监控K8s主机的资源
- KubeStateMetrics
- K8S及ETCD常见监控指标
- ETCD监控指标
- Kube-apiserver监控指标
- (五)其他
- Prometheus的relabel-config
- Target的Endpoint
- Prometheus的其他配置
- (六)BlackboxExporter
- 安装
- BlackboxExporter的应用场景
- 在Promtheus中使用BlackboxExporter
- 参考