问题描述:
K8S 集群自动扩容出现问题,错误如下
Warning FailedGetResourceMetric 57m (x2401 over 5d) horizontal-pod-autoscaler unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedComputeMetricsReplicas 31m (x14 over 51m) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedGetResourceMetric 15m (x29 over 51m) horizontal-pod-autoscaler unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
问题排查过程:
初步分析,这是 metric-server 服务出现了异常,赶紧敲个命令冷静下
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
creationTimestamp: 2018-08-15T06:59:36Z
name: v1beta1.metrics.k8s.io
resourceVersion: "11864395"
selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io
uid: c537dcaa-a058-11e8-9f63-005056840d15
spec:
caBundle: null
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
version: v1beta1
versionPriority: 100
status:
conditions:
- lastTransitionTime: 2018-08-15T07:47:08Z
message: all checks passed
reason: Passed
status: "True"
type: Available
看起来没啥问题,再试试 top
kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
192.168.13.31 1393m 8% 31583Mi 56%
192.168.13.34 2333m 14% 26049Mi 46%
192.168.13.36 3220m 20% 20927Mi 37%
192.168.13.48 322m 4% 2049Mi 12%
192.168.13.49 153m 1% 1979Mi 12%
192.168.13.50 144m 1% 1885Mi 11%
再去查看 kube-apiserver 进程的日志
journalctl -xeu kube-apiserver
Aug 15 15:47:16 centos7-13-32 kube-apiserver[3874063]: E0815 15:47:16.322434 3874063 available_controller.go:295] v1beta1.custom.metrics.k8s.io failed with: Get https://172.20.198.84:6443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Aug 15 15:47:16 centos7-13-32 kube-apiserver[3874063]: E0815 15:47:16.322896 3874063 available_controller.go:295] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Aug 15 16:17:10 centos7-13-32 kube-apiserver[3198546]: E0815 16:17:10.934046 3198546 controller.go:111] loading OpenAPI spec for "v1beta1.custom.metrics.k8s.io" failed with: OpenAPI spec does not exists
Aug 15 16:17:09 centos7-13-32 kube-apiserver[3198546]: E0815 16:17:09.950588 3198546 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: OpenAPI spec does not exists
把关键错误提取出来就是 OpenAPI 这块出了问题了
经过一系列的试验,最终确定该问题的起因是 网络故障
导致该问题的原因是,Master 节点(也就是 ApiServer)与 metrics-server 或者 custom-metrics-apiserver 通信失败了
具体验证方式,就是在 ApiServer 节点直接 ping metric-server pod 的 IP,看能否通信
我这里是失败了,因为集群是 HA 模式的,本次解决方案是将 ApiServer 切换为备用节点,解决!
====== 补充说明
使用 Calico 网络方案有时会出现问题,具体原因还在研究,一个解决方法是重启所有 k8s 服务(有可能解决哦)
systemctl stop kube-apiserver
systemctl stop kube-scheduler
systemctl stop kube-controller-manager
systemctl stop kubelet
systemctl stop kube-proxy
systemctl start kube-apiserver
systemctl start kube-scheduler
systemctl start kube-controller-manager
systemctl start kubelet
systemctl start kube-proxy