1. helm kube-prometheus-stack chart 下载
通过 helm 的方式,对 kube-prometheus-stack chart 服务的进行部署:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm search repo kube-prometheus-stack
helm pull prometheus/kube-prometheus-stack
tar xf kube-prometheus-stack-41.4.1.tgz && cd kube-prometheus-stack
2. 修改 values.yaml 文件
在部署 Prometheus 之前,已进行以下准备:
- 创建了一个名为 nfs-client 的 storageclass
- 在 ingress-nginx 的名称空间,部署 ingress
## 编辑 values.yaml,对以下配置进行调整
alertmanager:
ingress:
enabled: true
ingressClassName: nginx
hosts:
- alertmanager.local
paths:
- /
alertmanagerSpec:
retention: 720h
storage:
volumeClaimTemplate:
spec:
storageClassName: nfs-client
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
---
grafana:
adminPassword: 1qaz2wsx
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.local
---
prometheus:
ingress:
enabled: true
ingressClassName: nginx
hosts:
- prometheus.local
paths:
- /
prometheusSpes:
retention: 360d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: nfs-client
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 300Gi
## 修改镜像的地址
prometheusOperator:
admissionWebhooks
patch:
image:
repository: registry.aliyuncs.com/google_containers/kube-webhook-certgen
---
## charts/grafana/values.yaml
persistence:
enabled: true
storageClassName: nfs-client
size: 100Gi
## chart/kube-state-metrics/values.yaml
## 修改镜像的地址
image:
repository: bitnami/kube-state-metrics
tag: 2.6.0
3. 部署
## 将服务部署到 monitoring 名称空间
kubectl create ns monitoring
helm install promethues . -n monitoring
## 检查是否正常
kubectl get all -n monitoring
NAME READY STATUS RESTARTS AGE
pod/alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 1 (2m37s ago) 102s
pod/prometheus-grafana-7c466d88c5-tq9zh 3/3 Running 0 17m
pod/prometheus-kube-prometheus-operator-67b84b5d9b-z7cws 1/1 Running 0 17m
pod/prometheus-kube-state-metrics-77d5757f57-chrnx 1/1 Running 0 17m
pod/prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 17m
pod/prometheus-prometheus-node-exporter-gj6rr 1/1 Running 0 17m
pod/prometheus-prometheus-node-exporter-rkl6q 1/1 Running 0 17m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 17m
service/prometheus-grafana ClusterIP 172.24.140.186 <none> 80/TCP 17m
service/prometheus-kube-prometheus-alertmanager ClusterIP 172.24.60.136 <none> 9093/TCP 17m
service/prometheus-kube-prometheus-operator ClusterIP 172.24.106.230 <none> 443/TCP 17m
service/prometheus-kube-prometheus-prometheus ClusterIP 172.24.114.84 <none> 9090/TCP 17m
service/prometheus-kube-state-metrics ClusterIP 172.24.250.206 <none> 8080/TCP 17m
service/prometheus-operated ClusterIP None <none> 9090/TCP 17m
service/prometheus-prometheus-node-exporter ClusterIP 172.24.74.178 <none> 9100/TCP 17m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/prometheus-prometheus-node-exporter 2 2 2 2 2 <none> 17m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-grafana 1/1 1 1 17m
deployment.apps/prometheus-kube-prometheus-operator 1/1 1 1 17m
deployment.apps/prometheus-kube-state-metrics 1/1 1 1 17m
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-grafana-7c466d88c5 1 1 1 17m
replicaset.apps/prometheus-kube-prometheus-operator-67b84b5d9b 1 1 1 17m
replicaset.apps/prometheus-kube-state-metrics-77d5757f57 1 1 1 17m
NAME READY AGE
statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager 1/1 17m
statefulset.apps/prometheus-prometheus-kube-prometheus-prometheus 1/1 17m
报错处理:
报错1:**‘The CustomResourceDefinition “prometheuses.monitoring.coreos.com” is invalid: metadata.annotations: Too long: must have at most 262144 bytes’
处理**:cd ./kube-prometheus-stack/crds/ && kubectl create -f crd-prometheuses.yaml
报错2:‘failed calling webhook “prometheusrulemutate.monitoring.coreos.com”’
处理:应该是之前有装过不同版本的prometheus,在卸载后,相关 webhook 资源未完全删除。通过 kubectl get mutatingwebhookconfigurations 、 kubectl get validatingwebhookconfigurations 命令,查找报错的资源对象,删除即可:kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io prometheus-kube-prometheus-admission,最后再更新部署。
4. 配置调整
访问 prometheus.local 时,点击Status-> Targets 页面,会发现 Prometheus 并不能正常获取一些组件的 metrices。对于 Kubernetes 的组件,大多情况可以通过 HTTP/HTTPS 访问组件的 /metrics 端点来获取组件的metrics,对于一些默认情况下不暴露端点的组件,可以使用 --bind-address 标志进行启用。
把 prometheus.local/alertmanager.local/grafana.local 本地解析更新到 hosts 文件中
4.1 kube-controller-manager
- 修改配置
kube-controller-manager 组件暴露 metrics 的端口是 10257,当访问时测试时,会报 “curl: (7) Failed connect to 10.49.18.103:10257; Connection refused ”的错误。结合kube-controller-manager 官网说明,调整组件 bind-address 参数的配置:
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
containers:
- command:
- kube-controller-manager
- --allocate-node-cidrs=true
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
### 修改
#- --bind-address=127.0.0.1
- --bind-address=0.0.0.0
...省略...
参考:https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#options
–bind-address string Default: 0.0.0.0
The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or :: ), all interfaces will be used.
- 验证
lsof -i:10257
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kube-cont 13515 root 7u IPv6 38954551 0t0 TCP *:10257 (LISTEN)
kube-cont 13515 root 34u IPv6 38968508 0t0 TCP master01.pl.hpc:10257->node1:61771 (ESTABLISHED)
### 10257 是安全端口,接收的是 https 请求
curl https://10.49.18.103:10257/metrics --cacert /etc/kubernetes/pki/ca.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key
sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds [ALPHA] Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
...省略...
4.2 kube-proxy 组件
- 修改配置
kubectl edit cm -n kube-system kube-proxy
...省略...
kind: KubeProxyConfiguration
### 修改
# metricsBindAddress: ""
metricsBindAddress: 0.0.0.0:10249
mode: ipvs
...省略...
备注:kube-proxy 配置是通过 configmap 的方式挂载到容器中,所以不要直接在 kube-proxy daemonset 中添加 metricBindAddress 参数,这种方式添加不会生效。
- 验证
lsof -i:10249
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kube-prox 36749 root 13u IPv6 39030529 0t0 TCP master01.pl.hpc:10249->node1:38685 (ESTABLISHED)
kube-prox 36749 root 14u IPv6 39100619 0t0 TCP *:10249 (LISTEN)
curl 10.49.18.103:10249/metrics
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in
audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter
go_gc_cycles_automatic_gc_cycles_total 13
# HELP go_gc_cycles_forced_gc_cycles_total Count of completed GC cycles forced by the application.
# TYPE go_gc_cycles_forced_gc_cycles_total counter
...省略...
参考:https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/
–metrics-bind-address ipport Default: 127.0.0.1:10249
The IP address with port for the metrics server to serve on (set to ‘0.0.0.0:10249’ for all IPv4 interfaces and ‘[::]:10249’ for all IPv6 interfaces). Set empty to disable. This parameter is ignored if a config file is specified by --config.
4.3 kube-scheduler 组件
- 修改配置
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
...省略...
spec:
containers:
- command:
- kube-controller-manager
- --allocate-node-cidrs=true
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
### 修改
#- --bind-address=127.0.0.1
- --bind-address=0.0.0.0
...省略...
- 验证
lsof -i:10259
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kube-sche 24404 root 7u IPv6 38957456 0t0 TCP *:10259 (LISTEN)
kube-sche 24404 root 10u IPv6 39009914 0t0 TCP master01.pl.hpc:10259->node1:29900 (ESTABLISHED)
### 10259 是安全端口,接收的是 https 请求
curl https://10.49.18.103:10259/metrics --cacert /etc/kubernetes/pki/ca.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key --insecure
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds [ALPHA] Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="7200"} 0
...省略...
参考:https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
–bind-address string Default: 0.0.0.0
The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or :😃, all interfaces will be used.
4.4 etcd
vim /etc/kubernetes/manifests/etcd.yaml
... 省略...
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://10.49.18.103:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --initial-advertise-peer-urls=https://10.49.18.103:2380
- --initial-cluster=master01=https://10.49.18.103:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://10.49.18.103:2379
### 修改
# - --listen-metrics-urls=http://127.0.0.1:2381
- --listen-metrics-urls=http://0.0.0.0:2381
...省略...
- 验证
lsof -i:2381
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
etcd 40671 root 14u IPv6 42749738 0t0 TCP *:compaq-https (LISTEN)
etcd 40671 root 87u IPv6 42863814 0t0 TCP master01.pl.hpc:compaq-https->node1:35348 (ESTABLISHED)
curl 10.49.18.103:2381/metrics
# TYPE etcd_cluster_version gauge
etcd_cluster_version{cluster_version="3.5"} 1
# HELP etcd_debugging_auth_revision The current revision of auth store.
# TYPE etcd_debugging_auth_revision gauge
etcd_debugging_auth_revision 1
# HELP etcd_debugging_disk_backend_commit_rebalance_duration_seconds The latency distributions of commit.rebalance called by bboltdb backend.
# TYPE etcd_debugging_disk_backend_commit_rebalance_duration_seconds histogram
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucket{le="0.001"} 4126
etcd_debugging_disk_backend_commit_rebalance_duration_seconds_bucket{le="0.002"} 4126
...省略...