ETCD监控指标 · prometheus

[TOC] ### **Server** ### **Disk** 磁盘相关的性能指标描述如下： ![](https://img.kancloud.cn/d5/7b/d57b56764491dc87d6a2783573b741b8_840x422.png) 我们可以通过下面的语句来监控这两个指标： ``` histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) ``` 一般这两个指标的值不会超过30ms（自己观察历史数据发现的，不一定有参考性）。 ### **Network** 网络相关的指标都是以`etcd_network_`开头，有如下指标： ![](https://img.kancloud.cn/05/30/05302378fc9d9a3f697408b505ffa728_851x480.png) `peer_sent_bytes_total`是指某个etcd节点发送给其他etcd节点的数据量。一般来说，leader节点发送的数据比follower节点发送的数据要多。 `peer_received_bytes_total`是指从其他etcd节点收到数据量。一般来说，follower节点只从leader节点接收数据。我们可以通过如下promql进行监控： ``` # 监控节点之间的数据发送率与接收率(MiB/s) rate(etcd_network_peer_sent_bytes_total[1m]) / 1024 / 1024 rate(etcd_network_peer_received_bytes_total[1m]) / 1024 / 1024 # 监控节点之间的数据发送失败率与接收失败率（个/s） rate(etcd_network_peer_sent_failures_total[1m]) rate(etcd_network_peer_received_failures_total[1m]) # etcd节点之间的round-trip延时（0.99） histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) # etcd节点发送给grpc-client(kube-apiserver)或从grpc-client接收数据率（MiB/s) rate(etcd_network_client_grpc_sent_bytes_total[1m]) / 1024 /1024 rate(etcd_network_client_grpc_received_bytes_total[1m]) / 1024 /1024 ``` ### **文件描述符** 有两个指标，一个是ETCD进程的最大可以使用的文件描述数量，一个是已经使用文件描述符数量 ``` # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 65536 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 124 ``` ### **内存** 有两个指标，一个是ETCD进程使用的常驻内存大小，一个是ETCD使用的虚拟内存大小 ``` # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 2.264686592e+09 # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 1.105934336e+10 ``` ### **Etcd_debugging namespace metrics** debugging指标不是稳定的，随时可能会变。这里有一个与`snapshot_save`有关的指标，会与磁盘性能有关。 ![](https://img.kancloud.cn/b7/48/b74830422d1c458f62acc9c8f0137aa6_850x200.png) 但是在k8s-1.23.3的集群中，etcd这个指标的名字为`etcd_debugging_snap_save_total_duration_seconds` 所以我们可能用如下的语句监控该指标： ``` histogram_quantile(0.99, rate(etcd_debugging_snap_save_total_duration_seconds_bucket[1m])) ``` 一般这两个指标的值不会超过30ms（自己观察历史数据发现的，不一定有参考性）。 ### **参考** * https://etcd.io/docs/v3.5/metrics/ * https://juejin.cn/post/6844904105257730055#ETCD-2 * https://v2-0.docs.kubesphere.io/docs/zh-CN/api-reference/monitoring-metrics/