[TOC]
### **拷贝并修改Prometheus的Manifest文件**
拷贝prometheus的[manifests目录](https://github.com/kubernetes/perf-tests/tree/release-1.23/clusterloader2/pkg/prometheus/manifests)到clusterloader2主机上的特定目录下,假设这里我们把它放在`/home/docker/clusterloader2/prometheus/`下。
另外,我们需要对其中的一些文件做修改:
* 0ssd-storage-class.yaml
该文件使用了gce的云盘,我们把它换成local-pv,内容改成如下:
```
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ssd
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
```
* 0pv.yaml
在`manifests/`目录下新建`0pv.yaml`文件,内容如下,注意下面的`<master-name>`要改成集群中实际的某个Master的名称(可以用kubectl get node查看某个Master的名字)。
然后,在上面的Master节点上,手动创建目录`/prometheus/data`
```
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-data
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: ssd
local:
path: /prometheus/data
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- <master-name>
```
* prometheus-service.yaml
更改此文件,注释掉`app: prometheus`这一行
```
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
ports:
- name: web
port: 9090
targetPort: web
selector:
# app: prometheus 注释掉此行
prometheus: k8s
sessionAffinity: ClientIP
```
### **测试步骤**
我们在10.35.20.1进行测试。把clusterloader2文件拷贝到该主机上,然后把kubemark集群admin.conf文件拷贝到相同的目录下,并重命名为kubemark-kubeconfig。
接着,在相同的目录下创建config.yaml文件,内容如下(此文件基于 https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/density/config.yaml, 注释了一些measurement,然后改了一些参数的值):
```
# ASSUMPTIONS:
# - Underlying cluster should have 100+ nodes.
# - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100).
#Constants
{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}}
# Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887
# 每个命名空间100个节点,每个节点30个Pod,这样每个命名空间为3000个Pod
{{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}}
{{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}}
{{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}}
{{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}}
# LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node.
# Increasing allocation of both memory and cpu by 10%
# decreases the value of priority function in scheduler by one point.
# This results in decreased probability of choosing the same node again.
{{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}}
{{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}}
{{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}}
{{$MIN_SATURATION_PODS_TIMEOUT := 180}}
{{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}}
{{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}}
{{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}}
{{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}}
{{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}}
{{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}}
{{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}}
{{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}}
{{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}}
#Variables
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}}
{{$totalPods := MultiplyInt $podsPerNamespace $namespaces}}
{{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}}
{{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}}
{{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}}
# saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node
# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711
# 根据经验每秒大概能调度20个Pod,5000节点时有150000个Pod,需要7500秒才能调度完,所以这里要改成7500+
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}}
{{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}}
{{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}}
# Probe measurements shared parameter
{{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}}
name: density
namespace:
number: {{$namespaces}}
tuningSets:
- name: Uniform5qps
qpsLoad:
# 每秒钟创建5个object,本文中object为deployment
qps: 5
# 该参数在上面为false,即不模拟节点故障
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
nodeFailure:
failureRate: 0.01
interval: 1m
jitterFactor: 10.0
simulatedDowntime: 10m
{{end}}
steps:
- name: Starting measurements
# 开始监控API调用
measurements:
- Identifier: APIResponsivenessPrometheus
Method: APIResponsivenessPrometheus
Params:
action: start
- Identifier: APIResponsivenessPrometheusSimple
Method: APIResponsivenessPrometheus
Params:
action: start
# TODO(oxddr): figure out how many probers to run in function of cluster
# 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency,故把它们注释掉
# - Identifier: InClusterNetworkLatency
# Method: InClusterNetworkLatency
# Params:
# action: start
# checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
# replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
# - Identifier: DnsLookupLatency
# Method: DnsLookupLatency
# Params:
# action: start
# checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
# replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
# 暂不清楚TestMetrics用来做什么,先注释
# - Identifier: TestMetrics
# Method: TestMetrics
# Params:
# action: start
# resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}}
# systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
# clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
# clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}}
# restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
# enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}
- name: Starting saturation pod measurements
# 开始监控Pod启动延时
measurements:
- Identifier: SaturationPodStartupLatency
Method: PodStartupLatency
Params:
action: start
labelSelector: group = saturation
threshold: {{$saturationDeploymentTimeout}}s
- Identifier: WaitForRunningSaturationDeployments
Method: WaitForControlledPodsRunning
Params:
action: start
apiVersion: apps/v1
kind: Deployment
labelSelector: group = saturation
operationTimeout: {{$saturationDeploymentHardTimeout}}s
- Identifier: SchedulingThroughput
Method: SchedulingThroughput
Params:
action: start
labelSelector: group = saturation
# 开始创建saturation pod, 即30*N个Pod,N为节点数
- name: Creating saturation pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
# 一个命名空间中创建几个object,即几个deployment
replicasPerNamespace: 1
tuningSet: Uniform5qps
objectBundle:
- basename: saturation-deployment
objectTemplatePath: {{$saturationDeploymentSpec}}
# 下面的参数用于填充deployment.yaml中的变量,根据前面的variables,podsPerNamespace的值为3000,即一个命名空间中有一个deployment,有3000个Pod
templateFillMap:
Replicas: {{$podsPerNamespace}}
Group: saturation
CpuRequest: 1m
MemoryRequest: 10M
# 等待saturation pod都为Running状态
- name: Waiting for saturation pods to be running
measurements:
- Identifier: WaitForRunningSaturationDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
- name: Collecting saturation pod measurements
measurements:
# 统计saturation pod的启动延时
- Identifier: SaturationPodStartupLatency
Method: PodStartupLatency
Params:
action: gather
# 统计saturation pod的调度吞吐量,即每秒调度多少个Pod,如果小于threshold,则该项measurement为failed。threshhold上面的默认为0,所以不会失败
- Identifier: SchedulingThroughput
Method: SchedulingThroughput
Params:
action: gather
enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}}
threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}}
# 在创建了30*N个Pod后,再创建500个latency pod(个数由前面的参数决定),观察当集群的Pod已经"饱和(saturation)"后,是否还能正常调度Pod
# 开始监控latency pod的启动延时
- name: Starting latency pod measurements
measurements:
- Identifier: PodStartupLatency
Method: PodStartupLatency
Params:
action: start
labelSelector: group = latency
- Identifier: WaitForRunningLatencyDeployments
Method: WaitForControlledPodsRunning
Params:
action: start
apiVersion: apps/v1
kind: Deployment
labelSelector: group = latency
operationTimeout: 15m
# 创建latency pod
- name: Creating latency pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: {{$latencyReplicas}}
tuningSet: Uniform5qps
objectBundle:
- basename: latency-deployment
objectTemplatePath: {{$latencyDeploymentSpec}}
templateFillMap:
Replicas: 1
Group: latency
CpuRequest: {{$LATENCY_POD_CPU}}m
MemoryRequest: {{$LATENCY_POD_MEMORY}}M
# 等待latency pod处于Running状态
- name: Waiting for latency pods to be running
measurements:
- Identifier: WaitForRunningLatencyDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
# 删除latency pod
- name: Deleting latency pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: 0
tuningSet: Uniform5qps
objectBundle:
- basename: latency-deployment
objectTemplatePath: {{$latencyDeploymentSpec}}
# 等待latency pod删除完成
- name: Waiting for latency pods to be deleted
measurements:
- Identifier: WaitForRunningLatencyDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
# 收集latency pod的启动延时
- name: Collecting pod startup latency
measurements:
- Identifier: PodStartupLatency
Method: PodStartupLatency
Params:
action: gather
# 删除saturation pod
- name: Deleting saturation pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: 0
tuningSet: Uniform5qps
objectBundle:
- basename: saturation-deployment
objectTemplatePath: {{$saturationDeploymentSpec}}
# 等待saturation pod删除完成
- name: Waiting for saturation pods to be deleted
measurements:
- Identifier: WaitForRunningSaturationDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
- name: Collecting measurements
measurements:
# APIResponsivenessPrometheusSimple统计API调用延时,使用的是Histgram类型的指标
- Identifier: APIResponsivenessPrometheusSimple
Method: APIResponsivenessPrometheus
Params:
action: gather
enableViolations: true
useSimpleLatencyQuery: true
summaryName: APIResponsivenessPrometheus_simple
allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
# APIResponsivenessPrometheus统计API调用延时,使用的是Summary类型的指标,该指标更为准确,一般以它为准
{{if not $USE_SIMPLE_LATENCY_QUERY}}
- Identifier: APIResponsivenessPrometheus
Method: APIResponsivenessPrometheus
Params:
action: gather
allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
{{end}}
# 注释掉这三个
# - Identifier: InClusterNetworkLatency
# Method: InClusterNetworkLatency
# Params:
# action: gather
# - Identifier: DnsLookupLatency
# Method: DnsLookupLatency
# Params:
# action: gather
# - Identifier: TestMetrics
# Method: TestMetrics
# Params:
# action: gather
# systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
# clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
# restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
# enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}
```
然后,在相同的目录下创建deployment.yaml文件,内容如下(和 https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/density/deployment.yaml 文件的内容一样,没做改动):
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{.Name}}
labels:
group: {{.Group}}
spec:
replicas: {{.Replicas}}
selector:
matchLabels:
name: {{.Name}}
template:
metadata:
labels:
name: {{.Name}}
group: {{.Group}}
spec:
containers:
- image: k8s.gcr.io/pause:3.1
imagePullPolicy: IfNotPresent
name: {{.Name}}
ports:
resources:
requests:
cpu: {{.CpuRequest}}
memory: {{.MemoryRequest}}
# Add not-ready/unreachable tolerations for 15 minutes so that node
# failure doesn't trigger pod deletion.
tolerations:
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 900
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 900
```
注意,上面的deployment.yaml中,没有对master的toleration,所以它们的Pod不会调度在kubemark集群master上,只会调度在虚拟节点上。
接着,我们把虚拟节点扩容到500个,然后执行命令开始压测(注意要显示指定`--node=500`,否则clusterloader2会把三个master也算上,当成503个节点,创建的Pod数就是30*503):
```
$ ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig --kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \
--enable-prometheus-server=true --tear-down-prometheus-server=false --prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt
```
### **测试结果**
上面的命令,会输出如下的日志,在下面的日志中,有如下几点要注意:
1、在开始前,会在kubemark集群中安装prometheus stack,这是因为API调用延时的两个SLI依赖于Promtheus采集kube-apiserver的指标。而clusterloader2会默认在主机的`$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests`目录下寻找prometheus的yaml文件
```
I0816 19:56:50.070635 14229 network_performance_measurement.go:87] Registering Network Performance Measurement
I0816 19:56:50.134943 14229 clusterloader.go:157] ClusterConfig.MasterName set to 10.35.20.2
E0816 19:56:50.178965 14229 clusterloader.go:168] Getting master external ip error: didn't find any ExternalIP master IPs
I0816 19:56:50.221833 14229 clusterloader.go:175] ClusterConfig.MasterInternalIP set to [10.35.20.2 10.35.20.3 10.35.20.4]
I0816 19:56:50.221890 14229 clusterloader.go:267] Using config: {ClusterConfig:{KubeConfigPath:./kubemark-kubeconfig RunFromCluster:false Nodes:500 Provider:0xc00069aae0 EtcdCertificatePath:/etc/srv/kubernetes/pki/etcd-apiserver-server.crt EtcdKeyPath:/etc/srv/kubernetes/pki/etcd-apiserver-server.key EtcdInsecurePort:2382 MasterIPs:[] MasterInternalIPs:[10.35.20.2 10.35.20.3 10.35.20.4] MasterName:10.35.20.2 DeleteStaleNamespaces:false DeleteAutomanagedNamespaces:true APIServerPprofByClientEnabled:true KubeletPort:10250 K8SClientsNumber:5} ReportDir: EnableExecService:false ModifierConfig:{OverwriteTestConfig:[] SkipSteps:[]} PrometheusConfig:{EnableServer:true TearDownServer:false ScrapeEtcd:false ScrapeNodeExporter:false ScrapeKubelets:false ScrapeKubeProxy:true ScrapeKubeStateMetrics:false ScrapeMetricsServerMetrics:false ScrapeNodeLocalDNS:false ScrapeAnet:false APIServerScrapePort:6443 SnapshotProject: ManifestPath:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests CoreManifests:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/*.yaml DefaultServiceMonitors:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/default/*.yaml MasterIPServiceMonitors:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/*.yaml KubeStateMetricsManifests:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/kube-state-metrics/*.yaml MetricsServerManifests:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/metrics-server/*.yaml NodeExporterPod:$GOPATH/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/node_exporter/node-exporter.yaml StorageClassProvisioner:kubernetes.io/gce-pd StorageClassVolumeType:pd-ssd ReadyTimeout:15m0s} OverridePaths:[]}
I0816 19:56:50.267717 14229 cluster.go:74] Listing cluster nodes:
I0816 19:56:50.267730 14229 cluster.go:86] Name: 10.35.20.2, clusterIP: 10.35.20.2, externalIP: , isSchedulable: true
I0816 19:56:50.267733 14229 cluster.go:86] Name: 10.35.20.3, clusterIP: 10.35.20.3, externalIP: , isSchedulable: true
I0816 19:56:50.267735 14229 cluster.go:86] Name: 10.35.20.4, clusterIP: 10.35.20.4, externalIP: , isSchedulable: true
I0816 19:56:50.267737 14229 cluster.go:86] Name: k8s-worker-0, clusterIP: 10.10.221.197, externalIP: , isSchedulable: true
I0816 19:56:50.267739 14229 cluster.go:86] Name: k8s-worker-1, clusterIP: 10.10.242.201, externalIP: , isSchedulable: true
I0816 19:56:50.267741 14229 cluster.go:86] Name: k8s-worker-10, clusterIP: 10.10.221.200, externalIP: , isSchedulable: true
...
...
I0816 19:56:50.317586 14229 framework.go:72] Creating framework with 5 clients and "./kubemark-kubeconfig" kubeconfig.
I0816 19:56:50.328376 14229 framework.go:72] Creating framework with 1 clients and "./kubemark-kubeconfig" kubeconfig.
I0816 19:56:50.329488 14229 prometheus.go:406] Using internal master ips ([10.35.20.2 10.35.20.3 10.35.20.4]) to monitor master's components
I0816 19:56:50.329508 14229 prometheus.go:186] Setting up prometheus stack
I0816 19:56:50.352537 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0alertmanagerConfigCustomResourceDefinition.yaml
I0816 19:56:50.407743 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
I0816 19:56:50.473379 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0podmonitorCustomResourceDefinition.yaml
I0816 19:56:50.487049 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0probeCustomResourceDefinition.yaml
I0816 19:56:50.496292 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0prometheusCustomResourceDefinition.yaml
I0816 19:56:50.582050 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
I0816 19:56:50.603479 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
I0816 19:56:50.621977 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-0thanosrulerCustomResourceDefinition.yaml
I0816 19:56:50.677669 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-clusterRole.yaml
I0816 19:56:50.684537 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-clusterRoleBinding.yaml
I0816 19:56:50.690106 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-deployment.yaml
I0816 19:56:50.698910 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-service.yaml
I0816 19:56:50.704062 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-serviceAccount.yaml
I0816 19:56:50.709162 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0prometheus-operator-serviceMonitor.yaml
I0816 19:56:50.742573 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/0ssd-storage-class.yaml
I0816 19:56:50.746934 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-dashboardDatasources.yaml
I0816 19:56:50.753183 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-dashboardDefinitions.yaml
I0816 19:56:50.843207 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-dashboardSources.yaml
I0816 19:56:50.847023 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-deployment.yaml
I0816 19:56:50.854351 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-service.yaml
I0816 19:56:50.864135 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-serviceAccount.yaml
I0816 19:56:50.869317 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/grafana-serviceMonitor.yaml
I0816 19:56:50.874331 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-clusterRole.yaml
I0816 19:56:50.878233 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-clusterRoleBinding.yaml
I0816 19:56:50.882729 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-prometheus.yaml
I0816 19:56:50.960848 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-roleBindingConfig.yaml
I0816 19:56:50.965558 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-roleConfig.yaml
I0816 19:56:50.969571 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-rules.yaml
I0816 19:56:51.002266 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-service.yaml
I0816 19:56:51.011344 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-serviceAccount.yaml
I0816 19:56:51.015258 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-serviceMonitor.yaml
I0816 19:56:51.019967 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/prometheus-windows-scrape-configs.yaml
I0816 19:56:51.025094 14229 prometheus.go:294] Exposing kube-apiserver metrics in the cluster
I0816 19:56:51.047622 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/master-endpoints.yaml
I0816 19:56:51.054637 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/master-service.yaml
I0816 19:56:51.061353 14229 framework.go:274] Applying /home/docker/gopath/src/k8s.io/perf-tests/clusterloader2/pkg/prometheus/manifests/master-ip/master-serviceMonitor.yaml
I0816 19:56:51.067330 14229 prometheus.go:341] Waiting for Prometheus stack to become healthy...
I0816 19:57:21.089535 14229 util.go:104] All 11 expected targets are ready
I0816 19:57:21.089554 14229 prometheus.go:238] Prometheus stack set up successfully
W0816 19:57:21.089586 14229 imagepreload.go:87] No images specified. Skipping image preloading
I0816 19:57:21.095293 14229 clusterloader.go:408] Test config successfully dumped to: generatedConfig_density.yaml
I0816 19:57:21.095368 14229 clusterloader.go:221] --------------------------------------------------------------------------------
I0816 19:57:21.095378 14229 clusterloader.go:222] Running config.yaml
I0816 19:57:21.095386 14229 clusterloader.go:223] --------------------------------------------------------------------------------
```
2、当执行完后,会输出各种详细的指标数据。通过搜索关键字`SchedulingThroughput`,我们可以看到如下的调度吞量(对应config.yaml中的Identifier为`SchedulingThroughput`这个measurement)
```
I0816 20:26:20.733126 14229 simple_test_executor.go:83] SchedulingThroughput: {
"perc50": 20,
"perc90": 20,
"perc99": 20.2,
"max": 24
}
```
3、通过搜索关键字`pod_startup`,可以找到如下的`PodStartupLatency_SaturationPodStartupLatency`的启动延时;以及`StatelessPodStartupLatency_SaturationPodStartupLatency`的启动延时。由于这里创建的Pod都是Stateless的,所以这两个指标的数据是一致的。另外还可以找到`StatefulPodStartupLatency_SaturationPodStartupLatency`,不过由于没有stateful pod,所以数据为0。(这些数据对应着config.yaml中Identifier为`SaturationPodStartupLatency`这个measurement)
```
I0816 20:26:20.733129 14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1323.250478,
"Perc90": 1878.221624,
"Perc99": 2184.178124
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
I0816 20:26:20.733137 14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1323.250478,
"Perc90": 1878.221624,
"Perc99": 2184.178124
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
I0816 20:26:20.733142 14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 0,
"Perc90": 0,
"Perc99": 0
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
```
另外,我们还可以看到latency pod的启动延时,如下(对应着config.yaml中Identifier为`PodStartupLatency`这个measurement):
```
I0816 20:26:20.733148 14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1350.344608,
"Perc90": 1943.066452,
"Perc99": 2169.727106
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
}
...
I0816 20:26:20.733152 14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1350.344608,
"Perc90": 1943.066452,
"Perc99": 2169.727106
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
I0816 20:26:20.733156 14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 0,
"Perc90": 0,
"Perc99": 0
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
```
4、另外,我们还可以看到API调用延时的结果,它会列出所有(resource,verb)对的结果。对应config.yaml文件中Identifier为`APIResponsivenessPrometheus`这个measurement。
```
I0816 20:26:20.733160 14229 simple_test_executor.go:83] APIResponsivenessPrometheus: {
"version": "v1",
"dataItems": [
{
"data": {
"Perc50": 500,
"Perc90": 580,
"Perc99": 598
},
"unit": "ms",
"labels": {
"Count": "3",
"Resource": "events",
"Scope": "cluster",
"SlowCount": "0",
"Subresource": "",
"Verb": "LIST"
}
},
{
"data": {
"Perc50": 26.017594,
"Perc90": 46.83167,
"Perc99": 167.899999
},
"unit": "ms",
"labels": {
"Count": "1008",
"Resource": "pods",
"Scope": "namespace",
"SlowCount": "0",
"Subresource": "",
"Verb": "LIST"
}
},
...
```
还可以看到`APIResponsivenessPrometheus_simple`这个measurement的结果:
```
I0816 20:26:20.735165 14229 simple_test_executor.go:83] APIResponsivenessPrometheus_simple: {
"version": "v1",
"dataItems": [
{
"data": {
"Perc50": 450,
"Perc90": 570,
"Perc99": 597
},
"unit": "ms",
"labels": {
"Count": "3",
"Resource": "events",
"Scope": "cluster",
"SlowCount": "0",
"Subresource": "",
"Verb": "LIST"
}
},
{
"data": {
"Perc50": 33.333333,
"Perc90": 80,
"Perc99": 98
},
...
```
可以看到,config.yaml中的每个measurement都会输出详细的信息。当然,clusterloader2最后会判断每个measurement,如果都通过的话,最后几行就会得到如下的Success的结果(如果想要知道每个Measurement是如何判断成功或失败的,需要阅读clusterloader2的源码):
```
I0816 20:26:20.743052 14229 simple_test_executor.go:98]
I0816 20:26:35.762300 14229 simple_test_executor.go:395] Resources cleanup time: 15.019234743s
I0816 20:26:35.762329 14229 clusterloader.go:231] --------------------------------------------------------------------------------
I0816 20:26:35.762337 14229 clusterloader.go:232] Test Finished
I0816 20:26:35.762344 14229 clusterloader.go:233] Test: config.yaml
I0816 20:26:35.762350 14229 clusterloader.go:234] Status: Success
I0816 20:26:35.762356 14229 clusterloader.go:238] --------------------------------------------------------------------------------
```
最后,我们回到前面介绍的那三个SLI,在这个测试中,三个SLI是如何在这里面实现的呢?前两个API调用延时是通过`APIResponsivenessPrometheus`这个method实现的,第三个Pod启动延时是通过`PodStartupLatency`这个method实现的。
### **Grafana**
上面的命令会自动安装prometheus、grafana等,在grafana上已经自动配置好了四个Dashboard,如下:
* DNS:DNS延时的监控信息
* Master dashboard:kubemark集群中master相关的监控信息
* Network:网络相关的监控信息
* SLO:API调度延时相关的监控信息
![](https://img.kancloud.cn/e5/38/e538c8387267208279b7d41a13abc678_2930x1035.png)
我们查看SLO这个dashboard,如下,可以看到API调用延时以及threhlold。发现,在mutating api call latency中,有一个指标已经超了threshold,其他的都没有超过。但是为什么我们的测试用例还是通过的呢?详情请参考Measurement下的《APIResponsivenessPrometheus》一文。
![](https://img.kancloud.cn/b4/a6/b4a68a94f3a870a559b200fd3ff82e12_3839x1679.png)
![](https://img.kancloud.cn/65/44/6544e5ea69c829222bd6b26df8e1afe8_3838x1298.png)
### **耗时统计**
clusterloader2命令执行完后,会在目录下生成一个junit.xml文件,统计了此次测试过程中的耗时分布,如下:
```
$ cat junit.xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1754.772">
<testcase name="density overall (config.yaml)" classname="ClusterLoaderV2" time="1754.767000937"></testcase>
<testcase name="density: [step: 01] Starting measurements" classname="ClusterLoaderV2" time="0.000106709"></testcase>
<testcase name="density: [step: 02] Starting saturation pod measurements" classname="ClusterLoaderV2" time="0.100424074"></testcase>
<testcase name="density: [step: 03] Creating saturation pods" classname="ClusterLoaderV2" time="1.007091926"></testcase>
<testcase name="density: [step: 04] Waiting for saturation pods to be running" classname="ClusterLoaderV2" time="758.146062481"></testcase>
<testcase name="density: [step: 05] Collecting saturation pod measurements" classname="ClusterLoaderV2" time="0.596172704"></testcase>
<testcase name="density: [step: 06] Starting latency pod measurements" classname="ClusterLoaderV2" time="0.101019002"></testcase>
<testcase name="density: [step: 07] Creating latency pods" classname="ClusterLoaderV2" time="100.284284775"></testcase>
<testcase name="density: [step: 08] Waiting for latency pods to be running" classname="ClusterLoaderV2" time="19.857551787"></testcase>
<testcase name="density: [step: 09] Deleting latency pods" classname="ClusterLoaderV2" time="100.305020092"></testcase>
<testcase name="density: [step: 10] Waiting for latency pods to be deleted" classname="ClusterLoaderV2" time="5.0046168"></testcase>
<testcase name="density: [step: 11] Collecting pod startup latency" classname="ClusterLoaderV2" time="0.535112677"></testcase>
<testcase name="density: [step: 12] Deleting saturation pods" classname="ClusterLoaderV2" time="1.00338642"></testcase>
<testcase name="density: [step: 13] Waiting for saturation pods to be deleted" classname="ClusterLoaderV2" time="751.552869757"></testcase>
<testcase name="density: [step: 14] Collecting measurements" classname="ClusterLoaderV2" time="1.157418612"></testcase>
</testsuite>
```
- 常用命令
- 安装
- 安装Kubeadm
- 安装单Master集群
- 安装高可用集群(手动分发证书)
- 安装高可用集群(自动分发证书)
- 启动参数解析
- certificate-key
- ETCD相关参数
- Kubernetes端口汇总
- 安装IPv4-IPv6双栈集群
- 下载二进制文件
- 使用Kata容器
- 快速安装shell脚本
- 存储
- 实践
- Ceph-RBD实践
- CephFS实践
- 对象存储
- 阿里云CSI
- CSI
- 安全
- 认证与授权
- 认证
- 认证-实践
- 授权
- ServiceAccount
- NodeAuthorizor
- TLS bootstrapping
- Kubelet的认证
- 准入控制
- 准入控制示例
- Pod安全上下文
- Selinux-Seccomp-Capabilities
- 给容器配置安全上下文
- PodSecurityPolicy
- K8S-1.8手动开启认证与授权
- Helm
- Helm命令
- Chart
- 快速入门
- 内置对象
- 模板函数与管道
- 模板函数列表
- 流程控制
- Chart依赖
- Repository
- 开源的Chart包
- CRD
- CRD入门
- 工作负载
- Pod
- Pod的重启策略
- Container
- 探针
- 工作负载的状态
- 有状态服务
- 网络插件
- Multus
- Calico+Flannel
- 容器网络限速
- 自研网络插件
- 设计文档
- Cilium
- 安装Cilium
- Calico
- Calico-FAQ
- IPAM
- Whereabouts
- 控制平面与Pod网络分开
- 重新编译
- 编译kubeadm
- 编译kubeadm-1.23
- 资源预留
- 资源预留简介
- imagefs与nodefs
- 资源预留 vs 驱逐 vs OOM
- 负载均衡
- 灰度与蓝绿
- Ingress的TLS
- 多个NginxIngressController实例
- Service的会话亲和
- CNI实践
- CNI规范
- 使用cnitool模拟调用
- CNI快速入门
- 性能测试
- 性能测试简介
- 制作kubemark镜像
- 使用clusterloader2进行性能测试
- 编译clusterloader2二进制文件
- 搭建性能测试环境
- 运行density测试
- 运行load测试
- 参数调优
- Measurement
- TestMetrics
- EtcdMetrics
- SLOMeasurement
- PrometheusMeasurement
- APIResponsivenessPrometheus
- PodStartupLatency
- FAQ
- 调度
- 亲和性与反亲和性
- GPU
- HPA
- 命名规范
- 可信云认证
- 磁盘限速
- Virtual-kubelet
- VK思路整理
- Kubebuilder
- FAQ
- 阿里云日志服务SLS