企业🤖AI智能体构建引擎,智能编排和调试,一键部署,支持知识库和私有化部署方案 广告
[TOC] 除了前面介绍的,k8s还有很多需要监控的内容,比如k8s集群的Node是否为Ready,k8s中网络插件、dns插件的容器是否都为Running等。 上面的这些指标,都需要用到一个叫kube-state-metrics的exporter来暴露。 ## **安装** #### **安装kube-state-metrics** kube-state-metrics会监控k8s的API,它与k8s版本的兼容性可以参考Reference1。这里我们k8s的版本为v1.17,使用的kube-state-metrics版本为v1.9。 在 https://github.com/kubernetes/kube-state-metrics/tree/release-1.9/examples/standard 页面下会有5个文件,创建这些资源;在下面我们对这些文件稍微做了一下修改,把这些文件的内容合并,并做了一些微调。 创建文件kube-state-metrics-rbac.yaml,内容如下: ``` apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets - ingresses verbs: - list - watch - apiGroups: - apps resources: - statefulsets - daemonsets - deployments - replicasets verbs: - list - watch - apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch - apiGroups: - storage.k8s.io resources: - storageclasses - volumeattachments verbs: - list - watch - apiGroups: - admissionregistration.k8s.io resources: - mutatingwebhookconfigurations - validatingwebhookconfigurations verbs: - list - watch - apiGroups: - networking.k8s.io resources: - networkpolicies verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system ``` 创建文件kube-state-metrics-deployment.yaml,,内容如下: ``` apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: kube-system spec: replicas: 1 selector: matchLabels: app: kube-state-metrics template: metadata: labels: app: kube-state-metrics spec: containers: - image: quay.io/coreos/kube-state-metrics:v1.9.6 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 nodeSelector: kubernetes.io/os: linux serviceAccountName: kube-state-metrics ``` 创建文件kube-state-metrics-service.yaml,内容如下: ``` apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: kube-system spec: type: ClusterIP selector: app: kube-state-metrics ports: - name: http-metrics port: 8080 targetPort: 8080 ``` 然后我们创建上面的资源: ``` $ kubectl apply -f kube-state-metrics-rbac.yaml $ kubectl apply -f kube-state-metrics-deployment.yaml $ kubectl apply -f kube-state-metrics-service.yaml ``` 然后,我们可以在集群的主机上,通过如下的URL查看metrics ``` $ curl {serviceIP}:8080/metrics ``` #### **更新Prometheus的配置** 在prometheus的配置文件中添加如下内容: ``` - job_name: kube-state-metrics static_configs: - targets: [kube-state-metrics.kube-system.svc:8080] ``` 这里,我们静态配置Target,不使用Service服务发现。 ## **监控Node状态** 在kube-state-metrics暴露的指标中,有如下的一些。通过这些指标,我们判断Node是否为Ready状态,是否有MemoryPressure等 ``` # HELP kube_node_status_condition The condition of a cluster node. # TYPE kube_node_status_condition gauge kube_node_status_condition{node="peng01",condition="NetworkUnavailable",status="true"} 0 kube_node_status_condition{node="peng01",condition="NetworkUnavailable",status="false"} 1 kube_node_status_condition{node="peng01",condition="NetworkUnavailable",status="unknown"} 0 kube_node_status_condition{node="peng01",condition="MemoryPressure",status="true"} 0 kube_node_status_condition{node="peng01",condition="MemoryPressure",status="false"} 1 kube_node_status_condition{node="peng01",condition="MemoryPressure",status="unknown"} 0 kube_node_status_condition{node="peng01",condition="DiskPressure",status="true"} 0 kube_node_status_condition{node="peng01",condition="DiskPressure",status="false"} 1 kube_node_status_condition{node="peng01",condition="DiskPressure",status="unknown"} 0 kube_node_status_condition{node="peng01",condition="PIDPressure",status="true"} 0 kube_node_status_condition{node="peng01",condition="PIDPressure",status="false"} 1 kube_node_status_condition{node="peng01",condition="PIDPressure",status="unknown"} 0 kube_node_status_condition{node="peng01",condition="Ready",status="true"} 1 kube_node_status_condition{node="peng01",condition="Ready",status="false"} 0 kube_node_status_condition{node="peng01",condition="Ready",status="unknown"} 0 ``` ## **监控重要插件的Pod是否都为Running或Ready状态** 有如下两个指标,可以看到Pod是否为Running或Ready状态 ``` # HELP kube_pod_status_phase The pods current phase. # TYPE kube_pod_status_phase gauge kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Pending"} 0 kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Succeeded"} 0 kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Failed"} 0 kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Running"} 1 kube_pod_status_phase{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",phase="Unknown"} 0 # HELP kube_pod_status_ready Describes whether the pod is ready to serve requests. # TYPE kube_pod_status_ready gauge kube_pod_status_ready{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",condition="true"} 1 kube_pod_status_ready{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",condition="false"} 0 kube_pod_status_ready{namespace="kube-system",pod="coredns-cb88f5dd5-9kllm",condition="unknown"} 0 ``` 但是在实际中,由于Pod会重启,重启后名字会变,所以我们一般会使用下面的指标来做监控:比如deployment的unavaible不为0,或者available不等于replicas时,就发送告警 ``` # HELP kube_deployment_status_replicas The number of replicas per deployment. # TYPE kube_deployment_status_replicas gauge kube_deployment_status_replicas{namespace="default",deployment="reviews-v1"} 1 # HELP kube_deployment_status_replicas_available The number of available replicas per deployment. # TYPE kube_deployment_status_replicas_available gauge kube_deployment_status_replicas_available{namespace="liuxh-test",deployment="dpdemo"} 1 # HELP kube_deployment_status_replicas_unavailable The number of unavailable replicas per deployment. # TYPE kube_deployment_status_replicas_unavailable gauge kube_deployment_status_replicas_unavailable{namespace="chenjh-test",deployment="newlocalpvc"} 1 ``` ## **Reference** * https://github.com/kubernetes/kube-state-metrics