Grafana · php笔记

## Grafana 可视化监控 **author：xiak** **last update: 2022-10-16 20:12:22** ---- [TOC=3,8] ---- ### 介绍 ![](https://grafana.com/static/img/screenshots/Modal_dashboards.png) [Prometheus](https://prometheus.io/) 是一种存储**时间序列指标**的简单方法，它为用户提供了收集、存储、检查和指标查询所需的工具。 [Grafana](https://www.grafana.com/) 可以提供强大灵活的可视化页面展示指标，它允许用户将 Prometheus 指标作为数据源导入，并将指标**可视化为图形和仪表板**。类比现实世界，Grafana 、Prometheus 就是汽车的仪表盘和飞机的黑匣子。当为仪表盘时，温度，摩擦力，胎压，...，希望掌控一切，速度与激情。当为黑匣子时，希望永远也不会用到它，但是当有天真的需要它时，全世界都希望能够找到它。 ---- #### 参观火箭发射控制中心 [Space X 龙飞船载人发射控制室全程记录](https://www.bilibili.com/video/av413373882/) > 2016 年，猎鹰 9 号火箭首次发射，在 SpaceX 控制中心里，工作人员们实时监测着火箭发射期间的系统行为数据和波动情况。即便这个过程中有大量的数据产生，但借助一个可视化图表，工作人员可以快速判断出系统是否在正常运转、又有哪些异常数据需要关注。这款工具就是 Grafana。[Grafana：SpaceX 的数据监测利器，云原生领域的 Tableau](https://mp.weixin.qq.com/s/zgd8KjpGoqwPGC6b1I9owg) ---- #### 看他们如何说 > “Prometheus 和 Grafana 现在是 Kubernetes 世界公认的标准，这也是我们使用的标准。我们的服务都导出 Prometheus 指标，然后收集这些指标并读入 Grafana Cloud。” —— Grail 高级 DevOps 工程师 Zach Pallin > “没有普罗米修斯和格拉法纳，我真的活不下去。我真的很喜欢能够看到我的应用程序中发生的一切。” —— Austin Adams，Ygrene 高级软件工程师 > 在卡拉搜索，我们用 Grafana 监控所有的服务状态，从引擎到索引。完善的监控帮助我们实时了解卡拉的搜索延迟，慢搜索，Docker 状态等等。 —— 卡拉搜索 ---- ### 安装相关端口： ``` prometheus: 9090 node_exporter: 9100 9104 Grafana: 3000 http://212.64.100.122:9090/metrics prometheus http://212.64.100.122:9100/metrics node_exporter http://212.64.100.122:9104/metrics mysqld_exporter ``` #### 安装 Prometheus ```shell wget https://github.com/prometheus/prometheus/releases/download/v2.39.0-rc.0/prometheus-2.39.0-rc.0.linux-amd64.tar.gz tar -xvzf prometheus-2.39.0-rc.0.linux-amd64.tar.gz cd prometheus-2.39.0-rc.0.linux-amd64 ./promtool check config prometheus.yml ./prometheus --config.file="/root/prometheus-2.39.0-rc.0.linux-amd64/prometheus.yml" pkill prometheus http://212.64.100.122:9090/metrics ``` ```shell ./prometheus \ --config.file="/opt/grafana/prometheus-2.39.0-rc.0.linux-amd64/prometheus.yml" \ --web.enable-admin-api \ --web.enable-lifecycle \ --storage.tsdb.retention.time=180d curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]={name=~".+"}' curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/clean_tombstones' ``` [prometheus删除指定数据_51CTO博客_prometheus删除历史数据](https://blog.51cto.com/jschu/3728968) [linuxea:清理kube-prometheus历史数据 - LinuxEA](http://myapp.linuxea.com/2590.html) [Prometheus 相关配置及命令(个人记录)_--web.enable-admin-api_中国一动的博客-CSDN博客](https://blog.csdn.net/ChenShiAi/article/details/108833617) ---- #### 安装 node_exporter ```shell wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz tar -xvzf node_exporter-1.4.0.linux-amd64.tar.gz cd node_exporter-1.4.0.linux-amd64 nohup ./node_exporter & http://212.64.100.122:9100/metrics ``` ---- #### 安装 mysqld_exporter ```shell wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz tar -xvzf mysqld_exporter-0.14.0.linux-amd64.tar.gz cd mysqld_exporter-0.14.0.linux-amd64 vi my.cnf nohup ./mysqld_exporter --config.my-cnf=/root/mysqld_exporter-0.14.0.linux-amd64/my.cnf & http://212.64.100.122:9104/metrics ``` vi my.cnf ~~~cnf [client] user=root password=**** ~~~ ---- #### 安装 php-fpm_exporter ```shell wget https://github.com/hipages/php-fpm_exporter/releases/download/v2.2.0/php-fpm_exporter_2.2.0_linux_amd64.tar.gz tar -xvzf php-fpm_exporter_2.2.0_linux_amd64.tar.gz cd php-fpm_exporter_2.2.0_linux_amd64 sudo -u www /root/php-fpm_exporter get --phpfpm.scrape-uri unix:/dev/shm/php-cgi.sock;/status ``` ---- #### 安装 Grafana https://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1 ~~~ wget https://dl.grafana.com/oss/release/grafana-9.2.0-1.x86_64.rpm sudo yum install grafana-9.2.0-1.x86_64.rpm ~~~ ```shell wget https://dl.grafana.com/enterprise/release/grafana-enterprise-9.2.0-1.x86_64.rpm sudo yum install grafana-enterprise-9.2.0-1.x86_64.rpm systemctl start grafana-server systemctl enable grafana-server http://212.64.100.122:3000/ netstat -nlpt | grep grafana ``` https://grafana.com/grafana/dashboards/ ---- ### 使用 ~~~ ### 监控指标监控什么？其实就是【指标】，你想知道什么【指标】比如汽车，你要随时知道速度、油量、温度、磨损等等，那么你的应用整体、架构、业务层、底层等等，你希望了解什么【指标】 ---- #### node_exporter 监控服务器节点情况 ---- #### nginx_exporter 监控 nginx 负载等情况 ---- #### phpfpm_exporter 监控 phpfpm 负载等情况 ---- #### mysqld_exporter 监控 mysql 负载、用量等情况 ---- #### redis_exporter 监控 redis 负载、用量等情况 ---- #### elasticsearch_exporter 监控 ElasticSearch 负载、索引、用量等情况 ---- #### pulsar_exporter 监控 pulsar 负载、用量等情况 ---- #### 应用指标监控 ##### 应用守护进程监控 app-daemon - 已启动进程组数量 - 已启动进程数量 - 进程内存消耗 - 进程cpu消耗 - 进程IO消耗 ---- ##### 网关应用监控 app-gatewayworker - 网关客户端连接数量 - 网关发送流量 - 网关接收流量 - 工人空闲数量 ---- ##### 停车场应用 app-parkinglot_exporter - pt_alilot_amqp_msg * 当日阿里物联网设备上报消息量{全部}(实时、时间线) * 当日阿里物联网设备上报消息量{type1}(时间线) * 当日阿里物联网设备上报消息量{type2}(时间线) - pt_request_log * 接口请求响应时间（最近100条请求的均值） * 当日设备端接口请求量{全部}(实时、时间线) * 当日设备端接口请求量{设备端-计费接口}(时间线) * 当日设备端接口请求量{设备端-出场接口}(时间线) * 当日设备端接口请求量{设备端-其它接口}(时间线) - pt_stoping * 总计在停数量(实时、时间线) - pt_rrpc_log * 当日下发数量{全部}(实时、时间线) * 当日下发数量{成功}(实时、时间线) * 当日下发数量{失败}(实时、时间线) - pt_rrpc_fail_queue * 总计数量{全部}(实时、时间线) * 总计数量{3次重试}(实时、时间线) * 总计数量{5次重试}(实时、时间线) - pt_waiter_passageway_report * 当日坐席通道上报记录数量{全部}(实时、时间线) * 当日坐席通道上报记录数量{有车牌}(实时、时间线) * 当日坐席通道上报记录数量{无车牌}(实时、时间线) - pt_report_log * 当日坐席异常上报事件日志数量{全部}(实时、时间线) * 当日坐席异常上报事件日志数量{待处理}(实时、时间线) * 当日坐席异常上报事件日志数量{已处理}(实时、时间线) - pt_operation_log * 当日设备端操作日志数量{全部}(实时、时间线) * 当日设备端操作日志数量{失败}(实时、时间线) * 当日设备端操作日志数量{成功}(实时、时间线) - pt_consume * 当日停车单数量{全部}(实时、时间线) * 当日停车单数量{异常}(实时、时间线) * 当日停车单数量{待出}(时间线) * 当日停车单数量{已出}(时间线) - pt_consume_orders * 当日停车订单数量{全部已支付}(实时、时间线) * 当日停车订单数量{微信已支付}(时间线) * 当日停车订单数量{支付宝已支付}(时间线) * 当日停车订单数量{其它已支付}(时间线) - pt_parking_log * 当日车位相机日志数量{全部}(时间线) * 当日车位相机日志数量{识别到车牌}(时间线) * 当日车位相机日志数量{未识别到车牌}(时间线) - pt_recharge * 当月固定车续费订单数量{全部已支付}(实时、时间线) * 当月固定车续费订单数量{微信已支付}(时间线) * 当月固定车续费订单数量{支付宝已支付}(时间线) * 当月固定车续费订单数量{其它已支付}(时间线) - pt_passageway_log * 当日通道日志数量{入场}(时间线) * 当日通道日志数量{无牌车出场}(时间线) * 当日通道日志数量{计费}(时间线) * 当日通道日志数量{出场}(时间线) - pt_sync_heart * 在线停车场数量(实时、时间线) * 离线停车场数量(实时、时间线) - pt_gateway * 网关数量(实时、时间线) * 在线数量(实时、时间线) * 离线数量(实时、时间线) * 其它数量(实时、时间线) ---- ~~~ ---- #### 什么是指标指标是说明总体数量特征的概念及其数值的综合，故又称为综合指标。在实际的统计工作和统计理论研究中，往往直接将说明总体数量特征的概念称为指标。https://baike.baidu.com/item/%E6%8C%87%E6%A0%87/19950696?fr=aladdin 根据意义的不同，可以将要统计的指标分为四个维度： - **业务指标**：业务层面的数值，如订单数量，支付渠道分析等 - **系统指标**：操作系统资源分析，如 CPU/内存抖动、磁盘/网络IO、系统进程调度、操作系统相关数值等 - **技术指标**：应用技术分析：如 OSS用量、短信用量、接口流量IO分布、守护进程分析、队列任务吞吐、RRPC调用情况、接口错误、基础组件相关数值分析等 - **性能指标**：应用性能分析：如请求耗时分析、队列消耗吞吐、db查询耗时瓶颈分析等在 Prometheus 中指标(metrics) 就是一个名称。 ---- #### 如何选择指标类型 Prometheus 共有4中指标类型： ##### counter counter 是一个累积计数指标，表示单个单调递增的计数器（只能增加不能减少），其值只能在重新启动时增加或重置为零。例如，您可以使用计数器来表示服务的累积请求数、累积完成的任务数或错误数。 ##### gauge 测量是一种标准的度量数值，表示任意变化的单个数值。这是最常用的指标类型，通常用于测量温度或当前内存的使用情况，或者并发请求的数量。 ##### histogram > 直方图(Histogram)，又称质量分布图，是一种统计报告图，由一系列高度不等的纵向条纹或线段表示数据分布的情况。一般用横轴表示数据类型，纵轴表示分布情况。 https://baike.baidu.com/item/%E7%9B%B4%E6%96%B9%E5%9B%BE/1103834?fr=aladdin >[tip] **注意直方图不是柱状图。** 直方图展示数据的分布，柱状图比较数据的大小。这是直方图与柱状图最根本的区别。 https://zhuanlan.zhihu.com/p/61433510 直方图对观察值（通常是请求持续时间或响应大小等）进行采样，并在可配置的存储桶中对其进行计数。它还提供观察结果的总数与所有观察值的总和。 https://www.xhyonline.com/?p=1594 ##### summary 和 histogram 类似，概要也对观察结果进行采样（通常是请求持续时间和响应大小等），和提供观察结果的总数与所有观察值的总和，但它在滑动时间窗口内计算可配置的分位数。 summary 和 histogram 主要是为了解决统计和分析样本的分布情况时的长尾问题。（如果大多数API请求都维持在100ms内，而个别请求的响应时间需要5s，那么就会导致平均的响应时间落到中位数的情况，个别数值对平均值的结果造成干扰，从而无法客观反映整体情况，这种现象被称为长尾问题） > 注意：指标值都是数值类型，如整型或浮点型。 ---- #### 如何使用指标标签通常对表示同一业务意义的指标的不同情况用标签区分，如： - 统计接口响应时间时，用标签区分不同的接口模块 - 统计订单数量时，用标签区分不同的订单类型 - 统计接口流量时，用标签区分输入/输出的数值这些不同的标签，但是它们表示都是同一业务属性的指标，所以是同一指标的不同标签维度上的数值而已。有些明明是不同的业务属性就不应该使用标签区分了，如： - 统计系统负载，用标签区分 CPU 和内存 ❌ - 统计磁盘，用标签区分转速和写入速率 ❌ - 统计redis信息，用标签区分 KEY 数量和占用内存❌ 这些显然就是完全不同业务意义的指标，应该使用不同的相互独立的指标。 ---- #### 安全：授权 ... ---- #### 在 PHP 项目中使用 ```shell composer require promphp/prometheus_client_php ``` ##### 1. 监测接口平均响应时间 ##### 2. 监测接口响应时间分布情况 ##### 3. 监测接口实时 IO 流量 ##### 4. 监测接口实时请求量 ##### 5. 监测业务指标 ---- #### prometheus_client_php https://github.com/PromPHP/prometheus_client_php ```php $registry = new \Prometheus\CollectorRegistry(new \Prometheus\Storage\InMemory()); // $registry = \Prometheus\CollectorRegistry::getDefault(); // redis // doc: https://prometheus.io/docs/concepts/metric_types/ /** * 1. 计数器（用于累计计数等） * * 计数器是一个累积指标，表示单个单调递增的计数器，其值只能在重新启动时增加或重置为零。 * 例如，您可以使用计数器来表示服务的请求数、完成的任务数或错误数。 */ $counter = $registry->getOrRegisterCounter('app_parkinglot', 'api_request_total', 'it increases', ['type', 'curr_url']); $counter->incBy(1, ['client', 'join']); $counter->incBy(0, ['client', 'noplateLeaveRequest']); $counter->incBy(3, ['client', 'recordConsume']); $counter->incBy(2, ['client', 'leave']); $counter2 = $registry->getOrRegisterCounter('app_parkinglot', 'smartpark_total', 'it increases'); $counter2->incBy(100); /** * 2. 测量（用于时间线、折线图等） * * 量规是一种度量标准，表示可以任意上下移动的单个数值。 * 仪表通常用于测量值，如温度或当前内存使用情况，但也可以上下移动的“计数”，如并发请求的数量。 */ $gauge = $registry->getOrRegisterGauge('app_parkinglot', 'today_consume_orders', 'it sets', ['payway']); $gauge->set(5, ['all']); $gauge->set(2, ['alipay']); $gauge->set(1, ['wxpay']); $gauge->set(2, ['other']); /** * 3. 直方图（用于柱状图等） * * 直方图对观察值（通常是请求持续时间或响应大小等）进行采样 * 并在可配置的存储桶中对其进行计数。它还提供所有观察值的总和。 */ $histogram = $registry->getOrRegisterHistogram('app_parkinglot', 'api_request_time1', 'it observes', ['type'], [0.1, 1, 2, 3.5, 4, 5, 6, 7, 8, 9]); $histogram->observe(0.1, ['client']); $histogram->observe(1, ['client']); $histogram->observe(1, ['client']); $histogram->observe(3.5, ['client']); /** * 4. 概要（与直方图类似） * * 摘要对观察结果进行采样（通常是请求持续时间和响应大小等）。 * 虽然它还提供观察结果的总数和所有观察值的总和，但它在滑动时间窗口内计算可配置的分位数。 */ $summary = $registry->getOrRegisterSummary('app_parkinglot', 'api_request_time2', 'it observes a sliding window', ['type'], 84600, [0.01, 0.05, 0.5, 0.95, 0.99]); $summary->observe(5, ['client']); // 渲染输出 $renderer = new \Prometheus\RenderTextFormat(); $result = $renderer->render($registry->getMetricFamilySamples()); header('Content-type: ' . \Prometheus\RenderTextFormat::MIME_TYPE); echo $result; ``` ~~~text/plain # HELP app_parkinglot_api_request_time1 it observes # TYPE app_parkinglot_api_request_time1 histogram app_parkinglot_api_request_time1_bucket{type="client",le="0.1"} 1 app_parkinglot_api_request_time1_bucket{type="client",le="1"} 3 app_parkinglot_api_request_time1_bucket{type="client",le="2"} 3 app_parkinglot_api_request_time1_bucket{type="client",le="3.5"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="4"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="5"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="6"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="7"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="8"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="9"} 4 app_parkinglot_api_request_time1_bucket{type="client",le="+Inf"} 4 app_parkinglot_api_request_time1_count{type="client"} 4 app_parkinglot_api_request_time1_sum{type="client"} 5.6 # HELP app_parkinglot_api_request_time2 it observes a sliding window # TYPE app_parkinglot_api_request_time2 summary app_parkinglot_api_request_time2{type="client",quantile="0.01"} 5 app_parkinglot_api_request_time2{type="client",quantile="0.05"} 5 app_parkinglot_api_request_time2{type="client",quantile="0.5"} 5 app_parkinglot_api_request_time2{type="client",quantile="0.95"} 5 app_parkinglot_api_request_time2{type="client",quantile="0.99"} 5 app_parkinglot_api_request_time2_count{type="client"} 1 app_parkinglot_api_request_time2_sum{type="client"} 5 # HELP app_parkinglot_api_request_total it increases # TYPE app_parkinglot_api_request_total counter app_parkinglot_api_request_total{type="client",curr_url="join"} 1 app_parkinglot_api_request_total{type="client",curr_url="leave"} 2 app_parkinglot_api_request_total{type="client",curr_url="noplateLeaveRequest"} 0 app_parkinglot_api_request_total{type="client",curr_url="recordConsume"} 3 # HELP app_parkinglot_smartpark_total it increases # TYPE app_parkinglot_smartpark_total counter app_parkinglot_smartpark_total 100 # HELP app_parkinglot_today_consume_orders it sets # TYPE app_parkinglot_today_consume_orders gauge app_parkinglot_today_consume_orders{payway="alipay"} 2 app_parkinglot_today_consume_orders{payway="all"} 5 app_parkinglot_today_consume_orders{payway="other"} 2 app_parkinglot_today_consume_orders{payway="wxpay"} 1 # HELP php_info Information about the PHP environment. # TYPE php_info gauge php_info{version="7.2.1"} 1 ~~~ ---- #### 使用域名访问：nginx代理 /usr/local/nginx/conf/vhost/grafana.domain.cn.conf ~~~ # this is required to proxy Grafana Live WebSocket connections. map $http_upgrade $connection_upgrade { default upgrade; '' close; } upstream grafana { server localhost:3000; } server { listen 80; server_name grafana.yf5g.cn; root /usr/share/nginx/html; index index.html index.htm; location / { proxy_set_header Host $http_host; proxy_pass http://grafana; } # Proxy Grafana Live WebSocket connections. location /api/live/ { proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; proxy_set_header Host $http_host; proxy_pass http://grafana; } } ~~~ /usr/local/nginx/conf/vhost/prom.domain.cn.conf ~~~ server { listen 80; server_name prom.yf5g.cn; root /usr/share/nginx/html; index index.html index.htm; location / { proxy_set_header Host $http_host; proxy_pass http://127.0.0.1:9090; } } ~~~ ~~~ # my global config global: scrape_interval: 5s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] - job_name: "apps" static_configs: # test kf master parkinglot-saas - targets: ["47.100.138.203:9222", "47.103.43.36:9222", "221.234.40.8:9222", "106.14.113.22:9222"] ~~~ ---- ### 安装 Loki https://grafana.com/docs/loki/latest/installation/local/ ```shell wget https://github.com/grafana/loki/releases/download/v2.6.1/logcli-linux-amd64.zip unzip logcli-linux-amd64.zip cd logcli-linux-amd64 wget https://raw.githubusercontent.com/grafana/loki/master/cmd/loki/loki-local-config.yaml wget https://raw.githubusercontent.com/grafana/loki/main/clients/cmd/promtail/promtail-local-config.yaml ./loki-linux-amd64 -config.file=loki-local-config.yaml ``` ```shell systemd start loki && systemd enable loki systemd start promtail && systemd enable promtail ``` ---- ### 相关资料 [Prometheus看完这些，入门就够了 - 知乎](https://zhuanlan.zhihu.com/p/267966193) [Prometheus监控系统实战](https://mp.weixin.qq.com/s/Y1wj8UjTxQfBikr6I2zD-w) [Grafana 中文入门教程 | 构建你的第一个仪表盘](https://mp.weixin.qq.com/s/IKdEBTP2E3juXkaCicdaYw) [Metric types | Prometheus](https://prometheus.io/docs/concepts/metric_types/) [PromQL全解析 - 知乎](https://zhuanlan.zhihu.com/p/477177336) [【翻译】Prometheus最佳实践 Summary和Histogram - 简书](https://www.jianshu.com/p/ccffd6b9e3d1) https://grafana.com/tutorials/run-grafana-behind-a-proxy/ https://blog.csdn.net/weixin_42393272/article/details/112838170 ~~~ Histogram和Summary主用用于统计和分析样本的分布情况. 在大多数情况下人们都倾向于使用某些量化指标的平均值,例如CPU的平均使用率,页面的平均响应时间.这种方式的问题很明显,以系统API调用的平均响应时间为例：如果大多数API请求都维持在100ms的响应时间范围内,而个别请求的响应时间需要5s,那么就会导致某些WEB页面的响应时间落到中位数的情况,而这种现象被称为长尾问题. 为了区分是平均的慢还是长尾的慢,最简单的方式就是按照请求延迟的范围进行分组.例如,统计延迟在0~10ms之间的请求数有多少而10~20ms之间的请求数又有多少.通过这种方式可以快速分析系统慢的原因.Histogram和Summary都是为了能够解决这样问题的存在,通过Histogram和Summary类型的监控指标,我们可以快速了解监控样本的分布情况. ~~~