**Prometheus****的组件:** Prometheus生态由多个组件组成,并且这些组件大部分是可选的: * **Prometheus****服务器**,用于获取和存储时间序列数据; * 仪表应用数据的客户端类库(**Client Library**) * 支持临时性工作的**推网关****(Push Gateway)** * 特殊目的的**输出者****(Exporter)**,提供被监控组件信息的 HTTP 接口,例如HAProxy、StatsD、MySQL、Nginx和Graphite等服务都有现成的输出者接口 * 处理告警的**告警管理器(****Alert Manager****)** * 其它支持工具 **Prometheus****的整体架构** ![pastedGraphic.png](blob:https://www.kancloud.cn/3370d151-9d21-477b-9e89-d79f6a042c34) Prometheus的整体工作流程: 1)Prometheus 服务器定期从配置好的 jobs 或者 exporters 中获取度量数据;或者接收来自推送网关发送过来的 度量数据。 2)Prometheus 服务器在本地存储收集到的度量数据,并对这些数据进行聚合; 3)运行已定义好的 alert.rules,记录新的时间序列或者向告警管理器推送警报。 4)告警管理器根据配置文件,对接收到的警报进行处理,并通过email等途径发出告警。 5)Grafana等图形工具获取到监控数据,并以图形化的方式进行展示。 **数据模型** Prometheus从根本上将所有数据存储为时间序列:属于相同度量标准和同一组标注尺寸的时间戳值流。除了存储的时间序列之外,普罗米修斯可能会生成临时派生时间序列作为查询的结果。 **度量名称和标签**:每个时间序列都是由度量标准名称和一组键值对(也称为标签)组成唯一标识。**度量名称**指定被测量的系统的特征(例如:http\_requests\_total-接收到的HTTP请求的总数)。它可以包含ASCII字母和数字,以及下划线和冒号。它必须匹配正则表达式\[a-zA-Z\_:\]\[a-zA-Z0-9\_:\]\*。**标签**启用Prometheus的维度数据模型:对于相同度量标准名称,任何给定的标签组合都标识该度量标准的特定维度实例。查询语言允许基于这些维度进行筛选和聚合。更改任何标签值(包括添加或删除标签)都会创建新的时间序列。标签名称可能包含ASCII字母,数字以及下划线。他们必须匹配正则表达式\[a-zA-Z\_\]\[a-zA-Z0-9\_\]\*。以\_\_开始的标签名称保留给供内部使用。 **样本**:实际的时间序列,每个序列包括:一个 float64 的值和一个毫秒级的时间戳。 **格式:**给定度量标准名称和一组标签,时间序列通常使用以下格式来标识: {=, ...} **度量类型** Prometheus 客户端库主要提供Counter、Gauge、**Histogram****和****Summery**四种主要的 metric 类型: **Counter(****计算器****)****:****Counter****是**一种累加的度量,它的值只能增加或在重新启动时重置为零。 **Gauge(****测量****)****:**Gauge表示单个数值,表达可以任意地上升和下降的度量。 **Histogram(**直方图):Histogram样本观测(例如:请求持续时间或响应大小),并将它们计入配置的桶中。它也提供所有观测值的总和。 **Summery**:**类似于****Histogram**,*Summery*样本观察(通常是请求持续时间和响应大小)。虽然它也提供观测总数和所有观测值的总和,但它计算滑动时间窗内的可配置分位数。 ![pastedGraphic_1.png](blob:https://www.kancloud.cn/6affd576-1f9c-4e12-b85f-0a73244198a7) 两种获取数据的方式:pull、push pull:客户端安装exporters,exporters采集数据,prometheus用HTTP get访问exporter,exporter返回数据 push:客户端安装pushgateway,用自己开发的脚本把数据组织成k\\v形式,发给pushgateway,然后pushgateway推给prometheus **promql****示例** 1. 1)((sum(increase(node\_cpu{mode="idle"}\[1m\])) by(instance)) /(sum(increase(node\_cpu\[1m\])) by(instance))))\*100 要查询的是 node\_cpu increase()求一个时间段的增量,\[1m\]表示求1分钟之内的增量 {mode="idle"}表示求空闲的cpu的1分钟之内的增量 sum()求和 by(instance) 可以把sum加到一起的数值按照指定方式拆分,instance代表机器名 / 代表除法,promql支持 \+ - \* / % ^ 等数学运算 以上promql是计算cpu使用率的表达式 rate()求一个时间段的平均每秒的增量,专门搭配counter类 {exported\_instance=~"XXX"} 过滤,模糊匹配 topk() 取前几位的最高值,一般用于console查看 count() 把符合条件的输出数目加合,比如统计pod的总数是多少,而不是看pod有哪些 predict\_linear() 对曲线变化速率的计算,以及对加速的未来预测 以上只列出了比较常用的一些函数,官网上还有很多。 **安装****prometheus** 安装prometheus可以在k8s里以容器形式创建,也可以在外部用二进制包安装 \# vim prometheus.yaml apiVersion: v1 kind: Namespace metadata: name: monitoring \--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: \- apiGroups: \[""\] resources: \- nodes \- nodes/proxy \- services \- endpoints \- pods verbs: \["get", "list", "watch"\] \- apiGroups: \- extensions resources: \- ingresses verbs: \["get", "list", "watch"\] \- nonResourceURLs: \["/metrics"\] verbs: \["get"\] \--- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring \--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: \- kind: ServiceAccount name: prometheus namespace: monitoring \--- apiVersion: apps/v1beta2 kind: Deployment metadata: labels: name: prometheus-deployment name: prometheus namespace: monitoring spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: containers: \- image: prom/prometheus:v2.0.0 name: prometheus command: \- "/bin/prometheus" args: \- "--config.file=/etc/prometheus/prometheus.yml" \- "--storage.tsdb.path=/prometheus" \- "--storage.tsdb.retention=24h" ports: \- containerPort: 9090 protocol: TCP volumeMounts: \- mountPath: "/prometheus" name: data \- mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 100Mi limits: cpu: 500m memory: 2500Mi serviceAccountName: prometheus imagePullSecrets: \- name: regsecret volumes: \- name: data emptyDir: {} \- name: config-volume configMap: name: prometheus-config \--- kind: Service apiVersion: v1 metadata: labels: app: prometheus name: prometheus namespace: monitoring spec: type: ClusterIP ports: \- port: 9090 targetPort: 9090 selector: app: prometheus \--- apiVersion: extensions/v1beta1 kind: Ingress metadata: name: prometheus namespace: monitoring spec: rules: \- host: prometheus.pkbeta.com http: paths: \- path: / backend: serviceName: prometheus servicePort: 9090 以上就是创建一个prometheus的yaml文件里面创建了一个namespace,deployment,ingress **!!!最重要的****configmap****也就是****Prometheus****的配置文件我把单独列出来了** apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: monitoring data: prometheus.yml: | global: scrape\_interval: 15s 每15s抓取一次数据 evaluation\_interval: 15s 每15s评估一次规则 scrape\_configs: \- job\_name: 'kubernetes-apiservers' 任务的名称 kubernetes\_sd\_configs: 以k8s角色来定义收集 \- role: endpoints 从endpoints获取apiserver数据 scheme: https tls\_config: ca\_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer\_token\_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel\_configs: 在抓取之前对任何目标及其标签进行修改 \- source\_labels: \[\_\_meta\_kubernetes\_namespace, \_\_meta\_kubernetes\_service\_name, \_\_meta\_kubernetes\_endpoint\_port\_name\] 选择哪些label action: keep 含有符合regex的source\_label的endpoints进行保留 regex: default;kubernetes;https \- job\_name: 'kubernetes-nodes' kubernetes\_sd\_configs: \- role: node scheme: https tls\_config: ca\_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer\_token\_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel\_configs: \- action: labelmap regex: \_\_meta\_kubernetes\_node\_label\_(.+) \- target\_label: \_\_address\_\_ replacement: kubernetes.default.svc:443 \- source\_labels: \[\_\_meta\_kubernetes\_node\_name\] regex: (.+) target\_label: \_\_metrics\_path\_\_ replacement: /api/v1/nodes/${1}/proxy/metrics \- job\_name: 'kubernetes-cadvisor' kubernetes\_sd\_configs: \- role: node scheme: https tls\_config: ca\_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer\_token\_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel\_configs: \- action: labelmap regex: \_\_meta\_kubernetes\_node\_label\_(.+) \- target\_label: \_\_address\_\_ replacement: kubernetes.default.svc:443 \- source\_labels: \[\_\_meta\_kubernetes\_node\_name\] regex: (.+) target\_label: \_\_metrics\_path\_\_ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor \- job\_name: 'kubernetes-service-endpoints' kubernetes\_sd\_configs: \- role: endpoints relabel\_configs: \- source\_labels: \[\_\_meta\_kubernetes\_service\_annotation\_prometheus\_io\_scrape\] action: keep regex: true \- source\_labels: \[\_\_meta\_kubernetes\_service\_annotation\_prometheus\_io\_scheme\] action: replace target\_label: \_\_scheme\_\_ regex: (https?) \- source\_labels: \[\_\_meta\_kubernetes\_service\_annotation\_prometheus\_io\_path\] action: replace target\_label: \_\_metrics\_path\_\_ regex: (.+) \- source\_labels: \[\_\_address\_\_, \_\_meta\_kubernetes\_service\_annotation\_prometheus\_io\_port\] action: replace target\_label: \_\_address\_\_ regex: (\[^:\]+)(?::\\d+)?;(\\d+) replacement: $1:$2 \- action: labelmap regex: \_\_meta\_kubernetes\_service\_label\_(.+) \- source\_labels: \[\_\_meta\_kubernetes\_namespace\] action: replace target\_label: kubernetes\_namespace \- source\_labels: \[\_\_meta\_kubernetes\_service\_name\] action: replace target\_label: kubernetes\_name \- job\_name: 'kubernetes-services' kubernetes\_sd\_configs: \- role: service metrics\_path: /probe params: module: \[http\_2xx\] relabel\_configs: \- source\_labels: \[\_\_meta\_kubernetes\_service\_annotation\_prometheus\_io\_probe\] action: keep regex: true \- source\_labels: \[\_\_address\_\_\] target\_label: \_\_param\_target \- target\_label: \_\_address\_\_ replacement: blackbox-exporter.example.com:9115 \- source\_labels: \[\_\_param\_target\] target\_label: instance \- action: labelmap regex: \_\_meta\_kubernetes\_service\_label\_(.+) \- source\_labels: \[\_\_meta\_kubernetes\_namespace\] target\_label: kubernetes\_namespace \- source\_labels: \[\_\_meta\_kubernetes\_service\_name\] target\_label: kubernetes\_name \- job\_name: 'kubernetes-ingresses' kubernetes\_sd\_configs: \- role: ingress relabel\_configs: \- source\_labels: \[\_\_meta\_kubernetes\_ingress\_annotation\_prometheus\_io\_probe\] action: keep regex: true \- source\_labels: \[\_\_meta\_kubernetes\_ingress\_scheme,\_\_address\_\_,\_\_meta\_kubernetes\_ingress\_path\] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target\_label: \_\_param\_target \- target\_label: \_\_address\_\_ replacement: blackbox-exporter.example.com:9115 \- source\_labels: \[\_\_param\_target\] target\_label: instance \- action: labelmap regex: \_\_meta\_kubernetes\_ingress\_label\_(.+) \- source\_labels: \[\_\_meta\_kubernetes\_namespace\] target\_label: kubernetes\_namespace \- source\_labels: \[\_\_meta\_kubernetes\_ingress\_name\] target\_label: kubernetes\_name \- job\_name: 'kubernetes-pods' kubernetes\_sd\_configs: \- role: pod relabel\_configs: \- source\_labels: \[\_\_meta\_kubernetes\_pod\_annotation\_prometheus\_io\_scrape\] action: keep regex: true \- source\_labels: \[\_\_meta\_kubernetes\_pod\_annotation\_prometheus\_io\_path\] action: replace target\_label: \_\_metrics\_path\_\_ regex: (.+) \- source\_labels: \[\_\_address\_\_, \_\_meta\_kubernetes\_pod\_annotation\_prometheus\_io\_port\] action: replace regex: (\[^:\]+)(?::\\d+)?;(\\d+) replacement: $1:$2 target\_label: \_\_address\_\_ \- action: labelmap regex: \_\_meta\_kubernetes\_pod\_label\_(.+) \- source\_labels: \[\_\_meta\_kubernetes\_namespace\] action: replace target\_label: kubernetes\_namespace \- source\_labels: \[\_\_meta\_kubernetes\_pod\_name\] action: replace target\_label: kubernetes\_pod\_name 上面的job就是定义了你需要抓取的是什么数据 通过kubernetes-apiservers采集apiserver相关的性能指标数据 通过cadvisor采集容器相关的性能指标数据 等等 以上只安装了一个prometheus的server,但是需要一个在被监控端收集数据的。 通过cadvisor采集容器相关的性能指标数据,已集成在kubelet上 通过prometheus-node-exporter采集主机的性能指标数据,需要部署在每个node上,必须以DaemonSet形式部署 通过kube-state-metrics采集K8S资源对象以及K8S组件的健康状态指标数据,需要部署在每一个node上,**必须**以DaemonSet形式部署 \# vim node-exporter.yaml apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring labels: k8s-app: node-exporter spec: template: metadata: labels: k8s-app: node-exporter spec: containers: \- image: prom/node-exporter:v0.16.0 name: node-exporter ports: \- containerPort: 9100 hostPort: 9100 protocol: TCP name: http volumeMounts: \- name: time mountPath: /etc/localtime readOnly: true volumes: \- name: time hostPath: path: /etc/localtime \--- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' prometheus.io/app-metrics: 'true' prometheus.io/app-metrics-path: '/metrics' labels: k8s-app: node-exporter name: node-exporter namespace: monitoring spec: ports: \- name: http port: 9100 targetPort: 9100 protocol: TCP selector: k8s-app: node-exporter **\# vim kube-state-metrics.yaml** apiVersion: extensions/v1beta1 kind: Deployment metadata: name: kube-state-metrics namespace: monitoring spec: replicas: 2 template: metadata: labels: app: kube-state-metrics spec: serviceAccountName: kube-state-metrics containers: \- name: kube-state-metrics image: gcr.io/google\_containers/kube-state-metrics:v0.5.0 ports: \- containerPort: 8080 \--- apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: monitoring \--- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics spec: ports: \- name: kube-state-metrics port: 8080 protocol: TCP selector: app: kube-state-metrics \--- apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-directory-size-metrics namespace: monitoring annotations: description: | This `DaemonSet` provides metrics in Prometheus format about disk usage on the nodes. The container `read-du` reads in sizes of all directories below /mnt and writes that to `/tmp/metrics`. It only reports directories larger then `100M` for now. The other container `caddy` just hands out the contents of that file on request via `http` on `/metrics` at port `9102` which are the defaults for Prometheus. These are scheduled on every node in the Kubernetes cluster. To choose directories from the node to check, just mount them on the `read-du` container below `/mnt`. spec: template: metadata: labels: app: node-directory-size-metrics annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9102' description: | This `Pod` provides metrics in Prometheus format about disk usage on the node. The container `read-du` reads in sizes of all directories below /mnt and writes that to `/tmp/metrics`. It only reports directories larger then `100M` for now. The other container `caddy` just hands out the contents of that file on request on `/metrics` at port `9102` which are the defaults for Prometheus. This `Pod` is scheduled on every node in the Kubernetes cluster. To choose directories from the node to check just mount them on `read-du` below `/mnt`. spec: containers: \- name: read-du image: giantswarm/tiny-tools imagePullPolicy: Always command: \- fish \- --command \- | touch /tmp/metrics-temp while true for directory in (du --bytes --separate-dirs --threshold=100M /mnt) echo $directory | read size path echo "node\_directory\_size\_bytes{path=\\"$path\\"} $size" \\ \>> /tmp/metrics-temp end mv /tmp/metrics-temp /tmp/metrics sleep 300 end volumeMounts: \- name: host-fs-var mountPath: /mnt/var readOnly: true \- name: metrics mountPath: /tmp \- name: caddy image: dockermuenster/caddy:0.9.3 command: \- "caddy" \- "-port=9102" \- "-root=/var/www" ports: \- containerPort: 9102 volumeMounts: \- name: metrics mountPath: /var/www volumes: \- name: host-fs-var hostPath: path: /var \- name: metrics emptyDir: medium: Memory 现在安装安装prometheus \# kubectl create -f . 刚才设置了Prometheus的ingress,域名是prometheus.pkbeta.com 现在可以浏览器登陆域名查看了 ![pastedGraphic_2.png](blob:https://www.kancloud.cn/859463e4-751f-4ece-8425-5eb1c3c49373) 1. 可以输入promql查询语句,查询你需要的数据 2. 查询按键 3. 列出可查询的参数 4. 显示查询的数据 5. 以图形显示查询的数据 **安装****Grafana** **基本概念** 1. 数据源\--------(grafana只是一个时序数据展现工具,它展现所需的时序数据有数据源提供) 2. 组织\-----------(grafana支持多组织,单个实例就可以服务多个相互之间不信任的组织) 3. 用户\-----------(一个用户可以属于一个或者多个组织,且同一个用户在不同的组中可以分配不同级别的权限) 4. 行\--------------(在仪表板中行是分割板,用于对面板进行分组) 5. 面板\-----------(面板是最基本的显示单元,且每一个面板会提供一个查询编辑器) 6. 查询编辑器 \-(查询编辑器暴露了数据源的能力,并且不同的数据源有不同的查询编辑器) 7. 仪表板     ----(仪表板是将各种组件组合起来最终展现的地方) \# vim grafana.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: grafana-core namespace: monitoring labels: app: grafana spec: replicas: 1 template: metadata: labels: app: grafana spec: containers: \- image: grafana/grafana:4.2.0 name: grafana imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi env: \- name: GF\_INSTALL\_PLUGINS value: "alexanderzobnin-zabbix-app" \- name: GF\_AUTH\_BASIC\_ENABLED value: "true" \- name: GF\_AUTH\_ANONYMOUS\_ENABLED value: "false" readinessProbe: httpGet: path: /login port: 3000 volumeMounts: \- name: grafana-persistent-storage mountPath: /var volumes: \- name: grafana-persistent-storage emptyDir: {} \--- apiVersion: v1 kind: Service metadata: name: grafana namespace: monitoring labels: app: grafana spec: type: ClusterIP ports: \- port: 3000 selector: app: grafana \--- apiVersion: extensions/v1beta1 kind: Ingress metadata: name: grafana namespace: monitoring spec: rules: \- host: grafana.pkbeta.com http: paths: \- path: / backend: serviceName: grafana servicePort: 3000 \# kubectl create -f grafana.yaml 上面定义了grafana的域名为grafana.pkbeta.com,用浏览器进入grafana的登录页面,默认账户密码为admin/admin ![pastedGraphic_3.png](blob:https://www.kancloud.cn/aff722f3-0ea2-41a9-b92e-7a7d5518e1df) ![pastedGraphic_4.png](blob:https://www.kancloud.cn/4e00fcb7-2bd5-4f3e-acf9-83b8fdbb748d) 现在还没有数据源和图形界面 首先添加数据源 ![pastedGraphic_5.png](blob:https://www.kancloud.cn/6b4a5d4b-2d96-4741-8f93-7fc7815a566c) ![pastedGraphic_6.png](blob:https://www.kancloud.cn/675ebc3e-5d56-4dff-b74f-ae0f4d92a3cc) ![pastedGraphic_7.png](blob:https://www.kancloud.cn/0ffcdc04-4186-4add-9637-8a4ee488f961) Name:定义数据源的名字,自定义 Type:数据源的类型,选择prometheus Url:最好写k8s内部的prometheus域名加端口 然后添加数据源。 ![pastedGraphic_8.png](blob:https://www.kancloud.cn/4997ffac-7a0e-41af-823b-6b8ddea64ef9) 现在有了数据源,还差个dashboard 可以导入别人做好的模板,也可以自己做一个 ![pastedGraphic_9.png](blob:https://www.kancloud.cn/940e9498-5a9d-4d7b-9e32-6c1201dad88f) ![pastedGraphic_10.png](blob:https://www.kancloud.cn/31f926fa-d0f1-43f9-9660-8a84c0611ea9) ![pastedGraphic_11.png](blob:https://www.kancloud.cn/f9d11066-6ca1-43a0-8758-b2e7f520a98b) 点击图片的标题就可以出现选项 ![pastedGraphic_12.png](blob:https://www.kancloud.cn/1cedb90f-5e1b-490c-8942-d8fa69d4c99d) 把default换成prometheus ![pastedGraphic_13.png](blob:https://www.kancloud.cn/e5fee0e9-8bb3-49f6-b2c2-8a2f281fcb7d) 修改Query为自己想要查询的promql ![pastedGraphic_14.png](blob:https://www.kancloud.cn/bc1bfe34-6e0f-42d6-bcf0-9650784a73e2) 或者也可以导入其他人做好的模板 ![pastedGraphic_15.png](blob:https://www.kancloud.cn/28d1cf02-7dea-4493-80f2-066f17b4b61e) ![pastedGraphic_16.png](blob:https://www.kancloud.cn/22ffb472-9f8f-4982-9551-98228971820b) 第一个是导入模板的文件 或者输入导入模板的编号 或者直接粘贴模板的json ![pastedGraphic_17.png](blob:https://www.kancloud.cn/8a01dde3-33b0-428e-abdd-7b2b2e0b1a34)Name:dashboard模板的名称 Prometheus:选择数据源 ![pastedGraphic_18.png](blob:https://www.kancloud.cn/d6f705e8-a75c-4273-8c62-9c90aedf1c5f)