运行density测试 · kubernetes

[TOC] ### **拷贝并修改Prometheus的Manifest文件** 拷贝prometheus的[manifests目录](https://github.com/kubernetes/perf-tests/tree/release-1.23/clusterloader2/pkg/prometheus/manifests)到clusterloader2主机上的特定目录下，假设这里我们把它放在`/home/docker/clusterloader2/prometheus/`下。另外，我们需要对其中的一些文件做修改： * 0ssd-storage-class.yaml 该文件使用了gce的云盘，我们把它换成local-pv，内容改成如下： ``` kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: ssd provisioner: kubernetes.io/no-provisioner volumeBindingMode: WaitForFirstConsumer ``` * 0pv.yaml 在`manifests/`目录下新建`0pv.yaml`文件，内容如下，注意下面的`<master-name>`要改成集群中实际的某个Master的名称（可以用kubectl get node查看某个Master的名字）。然后，在上面的Master节点上，手动创建目录`/prometheus/data` ``` apiVersion: v1 kind: PersistentVolume metadata: name: prometheus-data spec: capacity: storage: 10Gi volumeMode: Filesystem accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Delete storageClassName: ssd local: path: /prometheus/data nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - <master-name> ``` * prometheus-service.yaml 更改此文件，注释掉`app: prometheus`这一行 ``` apiVersion: v1 kind: Service metadata: labels: prometheus: k8s name: prometheus-k8s namespace: monitoring spec: ports: - name: web port: 9090 targetPort: web selector: # app: prometheus 注释掉此行 prometheus: k8s sessionAffinity: ClientIP ``` ### **测试步骤** 我们在10.35.20.1进行测试。把clusterloader2文件拷贝到该主机上，然后把kubemark集群admin.conf文件拷贝到相同的目录下，并重命名为kubemark-kubeconfig。接着，在相同的目录下创建config.yaml文件，内容如下（此文件基于 https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/density/config.yaml，注释了一些measurement，然后改了一些参数的值）： ``` # ASSUMPTIONS: # - Underlying cluster should have 100+ nodes. # - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100). #Constants {{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}} # Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887 # 每个命名空间100个节点，每个节点30个Pod，这样每个命名空间为3000个Pod {{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}} {{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}} {{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}} {{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}} # LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node. # Increasing allocation of both memory and cpu by 10% # decreases the value of priority function in scheduler by one point. # This results in decreased probability of choosing the same node again. {{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}} {{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}} {{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}} {{$MIN_SATURATION_PODS_TIMEOUT := 180}} {{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}} {{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}} {{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}} {{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}} {{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}} {{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}} {{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}} {{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}} {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}} #Variables {{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}} {{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}} {{$totalPods := MultiplyInt $podsPerNamespace $namespaces}} {{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}} {{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}} {{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}} # saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node # failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711 # 根据经验每秒大概能调度20个Pod，5000节点时有150000个Pod，需要7500秒才能调度完，所以这里要改成7500+ {{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}} {{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}} {{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}} # Probe measurements shared parameter {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}} name: density namespace: number: {{$namespaces}} tuningSets: - name: Uniform5qps qpsLoad: # 每秒钟创建5个object，本文中object为deployment qps: 5 # 该参数在上面为false，即不模拟节点故障 {{if $ENABLE_CHAOSMONKEY}} chaosMonkey: nodeFailure: failureRate: 0.01 interval: 1m jitterFactor: 10.0 simulatedDowntime: 10m {{end}} steps: - name: Starting measurements # 开始监控API调用 measurements: - Identifier: APIResponsivenessPrometheus Method: APIResponsivenessPrometheus Params: action: start - Identifier: APIResponsivenessPrometheusSimple Method: APIResponsivenessPrometheus Params: action: start # TODO(oxddr): figure out how many probers to run in function of cluster # 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency，故把它们注释掉 # - Identifier: InClusterNetworkLatency # Method: InClusterNetworkLatency # Params: # action: start # checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}} # replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}} # - Identifier: DnsLookupLatency # Method: DnsLookupLatency # Params: # action: start # checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}} # replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}} # 暂不清楚TestMetrics用来做什么，先注释 # - Identifier: TestMetrics # Method: TestMetrics # Params: # action: start # resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}} # systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}} # clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}} # clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}} # restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}} # enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}} - name: Starting saturation pod measurements # 开始监控Pod启动延时 measurements: - Identifier: SaturationPodStartupLatency Method: PodStartupLatency Params: action: start labelSelector: group = saturation threshold: {{$saturationDeploymentTimeout}}s - Identifier: WaitForRunningSaturationDeployments Method: WaitForControlledPodsRunning Params: action: start apiVersion: apps/v1 kind: Deployment labelSelector: group = saturation operationTimeout: {{$saturationDeploymentHardTimeout}}s - Identifier: SchedulingThroughput Method: SchedulingThroughput Params: action: start labelSelector: group = saturation # 开始创建saturation pod, 即30*N个Pod，N为节点数 - name: Creating saturation pods phases: - namespaceRange: min: 1 max: {{$namespaces}} # 一个命名空间中创建几个object，即几个deployment replicasPerNamespace: 1 tuningSet: Uniform5qps objectBundle: - basename: saturation-deployment objectTemplatePath: {{$saturationDeploymentSpec}} # 下面的参数用于填充deployment.yaml中的变量，根据前面的variables，podsPerNamespace的值为3000，即一个命名空间中有一个deployment，有3000个Pod templateFillMap: Replicas: {{$podsPerNamespace}} Group: saturation CpuRequest: 1m MemoryRequest: 10M # 等待saturation pod都为Running状态 - name: Waiting for saturation pods to be running measurements: - Identifier: WaitForRunningSaturationDeployments Method: WaitForControlledPodsRunning Params: action: gather - name: Collecting saturation pod measurements measurements: # 统计saturation pod的启动延时 - Identifier: SaturationPodStartupLatency Method: PodStartupLatency Params: action: gather # 统计saturation pod的调度吞吐量，即每秒调度多少个Pod，如果小于threshold，则该项measurement为failed。threshhold上面的默认为0，所以不会失败 - Identifier: SchedulingThroughput Method: SchedulingThroughput Params: action: gather enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}} threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}} # 在创建了30*N个Pod后，再创建500个latency pod（个数由前面的参数决定），观察当集群的Pod已经"饱和(saturation)"后，是否还能正常调度Pod # 开始监控latency pod的启动延时 - name: Starting latency pod measurements measurements: - Identifier: PodStartupLatency Method: PodStartupLatency Params: action: start labelSelector: group = latency - Identifier: WaitForRunningLatencyDeployments Method: WaitForControlledPodsRunning Params: action: start apiVersion: apps/v1 kind: Deployment labelSelector: group = latency operationTimeout: 15m # 创建latency pod - name: Creating latency pods phases: - namespaceRange: min: 1 max: {{$namespaces}} replicasPerNamespace: {{$latencyReplicas}} tuningSet: Uniform5qps objectBundle: - basename: latency-deployment objectTemplatePath: {{$latencyDeploymentSpec}} templateFillMap: Replicas: 1 Group: latency CpuRequest: {{$LATENCY_POD_CPU}}m MemoryRequest: {{$LATENCY_POD_MEMORY}}M # 等待latency pod处于Running状态 - name: Waiting for latency pods to be running measurements: - Identifier: WaitForRunningLatencyDeployments Method: WaitForControlledPodsRunning Params: action: gather # 删除latency pod - name: Deleting latency pods phases: - namespaceRange: min: 1 max: {{$namespaces}} replicasPerNamespace: 0 tuningSet: Uniform5qps objectBundle: - basename: latency-deployment objectTemplatePath: {{$latencyDeploymentSpec}} # 等待latency pod删除完成 - name: Waiting for latency pods to be deleted measurements: - Identifier: WaitForRunningLatencyDeployments Method: WaitForControlledPodsRunning Params: action: gather # 收集latency pod的启动延时 - name: Collecting pod startup latency measurements: - Identifier: PodStartupLatency Method: PodStartupLatency Params: action: gather # 删除saturation pod - name: Deleting saturation pods phases: - namespaceRange: min: 1 max: {{$namespaces}} replicasPerNamespace: 0 tuningSet: Uniform5qps objectBundle: - basename: saturation-deployment objectTemplatePath: {{$saturationDeploymentSpec}} # 等待saturation pod删除完成 - name: Waiting for saturation pods to be deleted measurements: - Identifier: WaitForRunningSaturationDeployments Method: WaitForControlledPodsRunning Params: action: gather - name: Collecting measurements measurements: # APIResponsivenessPrometheusSimple统计API调用延时，使用的是Histgram类型的指标 - Identifier: APIResponsivenessPrometheusSimple Method: APIResponsivenessPrometheus Params: action: gather enableViolations: true useSimpleLatencyQuery: true summaryName: APIResponsivenessPrometheus_simple allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}} # APIResponsivenessPrometheus统计API调用延时，使用的是Summary类型的指标，该指标更为准确，一般以它为准 {{if not $USE_SIMPLE_LATENCY_QUERY}} - Identifier: APIResponsivenessPrometheus Method: APIResponsivenessPrometheus Params: action: gather allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}} {{end}} # 注释掉这三个 # - Identifier: InClusterNetworkLatency # Method: InClusterNetworkLatency # Params: # action: gather # - Identifier: DnsLookupLatency # Method: DnsLookupLatency # Params: # action: gather # - Identifier: TestMetrics # Method: TestMetrics # Params: # action: gather # systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}} # clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}} # restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}} # enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}} ``` 然后，在相同的目录下创建deployment.yaml文件，内容如下（和 https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/density/deployment.yaml 文件的内容一样，没做改动）： ``` apiVersion: apps/v1 kind: Deployment metadata: name: {{.Name}} labels: group: {{.Group}} spec: replicas: {{.Replicas}} selector: matchLabels: name: {{.Name}} template: metadata: labels: name: {{.Name}} group: {{.Group}} spec: containers: - image: k8s.gcr.io/pause:3.1 imagePullPolicy: IfNotPresent name: {{.Name}} ports: resources: requests: cpu: {{.CpuRequest}} memory: {{.MemoryRequest}} # Add not-ready/unreachable tolerations for 15 minutes so that node # failure doesn't trigger pod deletion. tolerations: - key: "node.kubernetes.io/not-ready" operator: "Exists" effect: "NoExecute" tolerationSeconds: 900 - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 900 ``` 注意，上面的deployment.yaml中，没有对master的toleration，所以它们的Pod不会调度在kubemark集群master上，只会调度在虚拟节点上。接着，我们把虚拟节点扩容到500个，然后执行命令开始压测（注意要显示指定`--node=500`，否则clusterloader2会把三个master也算上，当成503个节点，创建的Pod数就是30*503)： ``` $ ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig --kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \ --enable-prometheus-server=true --tear-down-prometheus-server=false --prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt ``` ### **测试结果** 上面的命令，会输出如下的日志，在下面的日志中，有如下几点要注意： 1、在开始前，会在kubemark集群中安装prometheus stack，这是因为API调用延时的两个SLI依赖于Promtheus采集kube-apiserver的指标。而clusterloader2会默认在主机的`$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests`目录下寻找prometheus的yaml文件 ``` I0816 19:56:50.070635 14229 network_performance_measurement.go:87] Registering Network Performance Measurement I0816 19:56:50.134943 14229 clusterloader.go:157] ClusterConfig.MasterName set to 10.35.20.2 E0816 19:56:50.178965 14229 clusterloader.go:168] Getting master external ip error: didn't find any ExternalIP master IPs I0816 19:56:50.221833 14229 clusterloader.go:175] ClusterConfig.MasterInternalIP set to [10.35.20.2 10.35.20.3 10.35.20.4] I0816 19:56:50.221890 14229 clusterloader.go:267] Using config: {ClusterConfig:{KubeConfigPath:./kubemark-kubeconfig RunFromCluster:false Nodes:500 Provider:0xc00069aae0 EtcdCertificatePath:/etc/srv/kubernetes/pki/etcd-apiserver-server.crt EtcdKeyPath:/etc/srv/kubernetes/pki/etcd-apiserver-server.key EtcdInsecurePort:2382 MasterIPs:[] MasterInternalIPs:[10.35.20.2 10.35.20.3 10.35.20.4] MasterName:10.35.20.2 DeleteStaleNamespaces:false DeleteAutomanagedNamespaces:true APIServerPprofByClientEnabled:true KubeletPort:10250 K8SClientsNumber:5} ReportDir: EnableExecService:false ModifierConfig:{OverwriteTestConfig:[] SkipSteps:[]} PrometheusConfig:{EnableServer:true TearDownServer:false ScrapeEtcd:false ScrapeNodeExporter:false ScrapeKubelets:false ScrapeKubeProxy:true ScrapeKubeStateMetrics:false ScrapeMetricsServerMetrics:false ScrapeNodeLocalDNS:false ScrapeAnet:false APIServerScrapePort:6443 SnapshotProject: ManifestPath:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests CoreManifests:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/*.yaml DefaultServiceMonitors:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/default/*.yaml MasterIPServiceMonitors:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/*.yaml KubeStateMetricsManifests:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/kube-state-metrics/*.yaml MetricsServerManifests:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/metrics-server/*.yaml NodeExporterPod:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/node_exporter/node-exporter.yaml StorageClassProvisioner:kubernetes.io/gce-pd StorageClassVolumeType:pd-ssd ReadyTimeout:15m0s} OverridePaths:[]} I0816 19:56:50.267717 14229 cluster.go:74] Listing cluster nodes: I0816 19:56:50.267730 14229 cluster.go:86] Name: 10.35.20.2, clusterIP: 10.35.20.2, externalIP: , isSchedulable: true I0816 19:56:50.267733 14229 cluster.go:86] Name: 10.35.20.3, clusterIP: 10.35.20.3, externalIP: , isSchedulable: true I0816 19:56:50.267735 14229 cluster.go:86] Name: 10.35.20.4, clusterIP: 10.35.20.4, externalIP: , isSchedulable: true I0816 19:56:50.267737 14229 cluster.go:86] Name: k8s-worker-0, clusterIP: 10.10.221.197, externalIP: , isSchedulable: true I0816 19:56:50.267739 14229 cluster.go:86] Name: k8s-worker-1, clusterIP: 10.10.242.201, externalIP: , isSchedulable: true I0816 19:56:50.267741 14229 cluster.go:86] Name: k8s-worker-10, clusterIP: 10.10.221.200, externalIP: , isSchedulable: true ... ... I0816 19:56:50.317586 14229 framework.go:72] Creating framework with 5 clients and "./kubemark-kubeconfig" kubeconfig. I0816 19:56:50.328376 14229 framework.go:72] Creating framework with 1 clients and "./kubemark-kubeconfig" kubeconfig. I0816 19:56:50.329488 14229 prometheus.go:406] Using internal master ips ([10.35.20.2 10.35.20.3 10.35.20.4]) to monitor master's components I0816 19:56:50.329508 14229 prometheus.go:186] Setting up prometheus stack I0816 19:56:50.352537 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0alertmanagerConfigCustomResourceDefinition.yaml I0816 19:56:50.407743 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0alertmanagerCustomResourceDefinition.yaml I0816 19:56:50.473379 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0podmonitorCustomResourceDefinition.yaml I0816 19:56:50.487049 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0probeCustomResourceDefinition.yaml I0816 19:56:50.496292 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0prometheusCustomResourceDefinition.yaml I0816 19:56:50.582050 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0prometheusruleCustomResourceDefinition.yaml I0816 19:56:50.603479 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0servicemonitorCustomResourceDefinition.yaml I0816 19:56:50.621977 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0thanosrulerCustomResourceDefinition.yaml I0816 19:56:50.677669 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-clusterRole.yaml I0816 19:56:50.684537 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-clusterRoleBinding.yaml I0816 19:56:50.690106 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-deployment.yaml I0816 19:56:50.698910 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-service.yaml I0816 19:56:50.704062 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-serviceAccount.yaml I0816 19:56:50.709162 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-serviceMonitor.yaml I0816 19:56:50.742573 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0ssd-storage-class.yaml I0816 19:56:50.746934 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-dashboardDatasources.yaml I0816 19:56:50.753183 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-dashboardDefinitions.yaml I0816 19:56:50.843207 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-dashboardSources.yaml I0816 19:56:50.847023 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-deployment.yaml I0816 19:56:50.854351 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-service.yaml I0816 19:56:50.864135 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-serviceAccount.yaml I0816 19:56:50.869317 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-serviceMonitor.yaml I0816 19:56:50.874331 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-clusterRole.yaml I0816 19:56:50.878233 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-clusterRoleBinding.yaml I0816 19:56:50.882729 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-prometheus.yaml I0816 19:56:50.960848 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-roleBindingConfig.yaml I0816 19:56:50.965558 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-roleConfig.yaml I0816 19:56:50.969571 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-rules.yaml I0816 19:56:51.002266 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-service.yaml I0816 19:56:51.011344 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-serviceAccount.yaml I0816 19:56:51.015258 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-serviceMonitor.yaml I0816 19:56:51.019967 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-windows-scrape-configs.yaml I0816 19:56:51.025094 14229 prometheus.go:294] Exposing kube-apiserver metrics in the cluster I0816 19:56:51.047622 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/master-endpoints.yaml I0816 19:56:51.054637 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/master-service.yaml I0816 19:56:51.061353 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/master-serviceMonitor.yaml I0816 19:56:51.067330 14229 prometheus.go:341] Waiting for Prometheus stack to become healthy... I0816 19:57:21.089535 14229 util.go:104] All 11 expected targets are ready I0816 19:57:21.089554 14229 prometheus.go:238] Prometheus stack set up successfully W0816 19:57:21.089586 14229 imagepreload.go:87] No images specified. Skipping image preloading I0816 19:57:21.095293 14229 clusterloader.go:408] Test config successfully dumped to: generatedConfig_density.yaml I0816 19:57:21.095368 14229 clusterloader.go:221] -------------------------------------------------------------------------------- I0816 19:57:21.095378 14229 clusterloader.go:222] Running config.yaml I0816 19:57:21.095386 14229 clusterloader.go:223] -------------------------------------------------------------------------------- ``` 2、当执行完后，会输出各种详细的指标数据。通过搜索关键字`SchedulingThroughput`，我们可以看到如下的调度吞量（对应config.yaml中的Identifier为`SchedulingThroughput`这个measurement） ``` I0816 20:26:20.733126 14229 simple_test_executor.go:83] SchedulingThroughput: { "perc50": 20, "perc90": 20, "perc99": 20.2, "max": 24 } ``` 3、通过搜索关键字`pod_startup`，可以找到如下的`PodStartupLatency_SaturationPodStartupLatency`的启动延时；以及`StatelessPodStartupLatency_SaturationPodStartupLatency`的启动延时。由于这里创建的Pod都是Stateless的，所以这两个指标的数据是一致的。另外还可以找到`StatefulPodStartupLatency_SaturationPodStartupLatency`，不过由于没有stateful pod，所以数据为0。（这些数据对应着config.yaml中Identifier为`SaturationPodStartupLatency`这个measurement） ``` I0816 20:26:20.733129 14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1323.250478, "Perc90": 1878.221624, "Perc99": 2184.178124 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... I0816 20:26:20.733137 14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1323.250478, "Perc90": 1878.221624, "Perc99": 2184.178124 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... I0816 20:26:20.733142 14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 0, "Perc90": 0, "Perc99": 0 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... ``` 另外，我们还可以看到latency pod的启动延时，如下（对应着config.yaml中Identifier为`PodStartupLatency`这个measurement）： ``` I0816 20:26:20.733148 14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1350.344608, "Perc90": 1943.066452, "Perc99": 2169.727106 }, "unit": "ms", "labels": { "Metric": "pod_startup" } } ... I0816 20:26:20.733152 14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1350.344608, "Perc90": 1943.066452, "Perc99": 2169.727106 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... I0816 20:26:20.733156 14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 0, "Perc90": 0, "Perc99": 0 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... ``` 4、另外，我们还可以看到API调用延时的结果，它会列出所有（resource，verb）对的结果。对应config.yaml文件中Identifier为`APIResponsivenessPrometheus`这个measurement。 ``` I0816 20:26:20.733160 14229 simple_test_executor.go:83] APIResponsivenessPrometheus: { "version": "v1", "dataItems": [ { "data": { "Perc50": 500, "Perc90": 580, "Perc99": 598 }, "unit": "ms", "labels": { "Count": "3", "Resource": "events", "Scope": "cluster", "SlowCount": "0", "Subresource": "", "Verb": "LIST" } }, { "data": { "Perc50": 26.017594, "Perc90": 46.83167, "Perc99": 167.899999 }, "unit": "ms", "labels": { "Count": "1008", "Resource": "pods", "Scope": "namespace", "SlowCount": "0", "Subresource": "", "Verb": "LIST" } }, ... ``` 还可以看到`APIResponsivenessPrometheus_simple`这个measurement的结果： ``` I0816 20:26:20.735165 14229 simple_test_executor.go:83] APIResponsivenessPrometheus_simple: { "version": "v1", "dataItems": [ { "data": { "Perc50": 450, "Perc90": 570, "Perc99": 597 }, "unit": "ms", "labels": { "Count": "3", "Resource": "events", "Scope": "cluster", "SlowCount": "0", "Subresource": "", "Verb": "LIST" } }, { "data": { "Perc50": 33.333333, "Perc90": 80, "Perc99": 98 }, ... ``` 可以看到，config.yaml中的每个measurement都会输出详细的信息。当然，clusterloader2最后会判断每个measurement，如果都通过的话，最后几行就会得到如下的Success的结果（如果想要知道每个Measurement是如何判断成功或失败的，需要阅读clusterloader2的源码）： ``` I0816 20:26:20.743052 14229 simple_test_executor.go:98] I0816 20:26:35.762300 14229 simple_test_executor.go:395] Resources cleanup time: 15.019234743s I0816 20:26:35.762329 14229 clusterloader.go:231] -------------------------------------------------------------------------------- I0816 20:26:35.762337 14229 clusterloader.go:232] Test Finished I0816 20:26:35.762344 14229 clusterloader.go:233] Test: config.yaml I0816 20:26:35.762350 14229 clusterloader.go:234] Status: Success I0816 20:26:35.762356 14229 clusterloader.go:238] -------------------------------------------------------------------------------- ``` 最后，我们回到前面介绍的那三个SLI，在这个测试中，三个SLI是如何在这里面实现的呢？前两个API调用延时是通过`APIResponsivenessPrometheus`这个method实现的，第三个Pod启动延时是通过`PodStartupLatency`这个method实现的。 ### **Grafana** 上面的命令会自动安装prometheus、grafana等，在grafana上已经自动配置好了四个Dashboard，如下： * DNS：DNS延时的监控信息 * Master dashboard：kubemark集群中master相关的监控信息 * Network：网络相关的监控信息 * SLO：API调度延时相关的监控信息 ![](https://img.kancloud.cn/e5/38/e538c8387267208279b7d41a13abc678_2930x1035.png) 我们查看SLO这个dashboard，如下，可以看到API调用延时以及threhlold。发现，在mutating api call latency中，有一个指标已经超了threshold，其他的都没有超过。但是为什么我们的测试用例还是通过的呢？详情请参考Measurement下的《APIResponsivenessPrometheus》一文。 ![](https://img.kancloud.cn/b4/a6/b4a68a94f3a870a559b200fd3ff82e12_3839x1679.png) ![](https://img.kancloud.cn/65/44/6544e5ea69c829222bd6b26df8e1afe8_3838x1298.png) ### **耗时统计** clusterloader2命令执行完后，会在目录下生成一个junit.xml文件，统计了此次测试过程中的耗时分布，如下： ``` $ cat junit.xml <?xml version="1.0" encoding="UTF-8"?> <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1754.772"> <testcase name="density overall (config.yaml)" classname="ClusterLoaderV2" time="1754.767000937"></testcase> <testcase name="density: [step: 01] Starting measurements" classname="ClusterLoaderV2" time="0.000106709"></testcase> <testcase name="density: [step: 02] Starting saturation pod measurements" classname="ClusterLoaderV2" time="0.100424074"></testcase> <testcase name="density: [step: 03] Creating saturation pods" classname="ClusterLoaderV2" time="1.007091926"></testcase> <testcase name="density: [step: 04] Waiting for saturation pods to be running" classname="ClusterLoaderV2" time="758.146062481"></testcase> <testcase name="density: [step: 05] Collecting saturation pod measurements" classname="ClusterLoaderV2" time="0.596172704"></testcase> <testcase name="density: [step: 06] Starting latency pod measurements" classname="ClusterLoaderV2" time="0.101019002"></testcase> <testcase name="density: [step: 07] Creating latency pods" classname="ClusterLoaderV2" time="100.284284775"></testcase> <testcase name="density: [step: 08] Waiting for latency pods to be running" classname="ClusterLoaderV2" time="19.857551787"></testcase> <testcase name="density: [step: 09] Deleting latency pods" classname="ClusterLoaderV2" time="100.305020092"></testcase> <testcase name="density: [step: 10] Waiting for latency pods to be deleted" classname="ClusterLoaderV2" time="5.0046168"></testcase> <testcase name="density: [step: 11] Collecting pod startup latency" classname="ClusterLoaderV2" time="0.535112677"></testcase> <testcase name="density: [step: 12] Deleting saturation pods" classname="ClusterLoaderV2" time="1.00338642"></testcase> <testcase name="density: [step: 13] Waiting for saturation pods to be deleted" classname="ClusterLoaderV2" time="751.552869757"></testcase> <testcase name="density: [step: 14] Collecting measurements" classname="ClusterLoaderV2" time="1.157418612"></testcase> </testsuite> ```