A. pod状态处于pending
pod未被成功调度到node节点,通过查看pod event 都能够找到原因
下面列几个常见报错
a. 资源不足
kubectl apply -f https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-resources-limit.yaml
# kubectl get pod
NAME READY STATUS RESTARTS AGE
demo-resources-limit-7698bb955f-ldtgk 0/1 Pending 0 2m7s
# kubectl describe pod demo-resources-limit-7698bb955f-ldtgk
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m3s default-scheduler 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.
#内容解读一共1个node节点,0个node满足资源需求,1个不满足cpu,1个不满足内存。
https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-node-selector.yaml
# kubectl get pod
NAME READY STATUS RESTARTS AGE
demo-node-selector-6cd7c5474f-whct6 0/1 Pending 0 72s
# kubectl describe pod demo-node-selector-6cd7c5474f-whct6
Node-Selectors: nodetest=yeyeye
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m50s (x2 over 2m50s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
# deploy 配置了node selector,找不到对应label的node
# 解决方法,修改deploy中node selector或者给node 加上label
# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ubuntu Ready master 223d v1.13.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=ubuntu,node-role.kubernetes.io/master=
# kubectl label nodes ubuntu nodetest=yeyeye
node/ubuntu labeled
# kubectl get pod
NAME READY STATUS RESTARTS AGE
demo-node-selector-6cd7c5474f-whct6 1/1 Running 0 5m58s
kubectl apply -f https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-pv-pvc.yaml
kubectl apply -f https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-volume.yaml
# 查看pod event
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20s default-scheduler Successfully assigned troubleshot/demo-volume-5f974bf75c-vmpmp to ubuntu
Warning FailedMount 19s kubelet, ubuntu MountVolume.SetUp failed for volume "demo-pv" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/4ae14be1-b1f4-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv --scope -- mount -t nfs 192.168.4.130:/opt/add-dev/nfs/ /var/lib/kubelet/pods/4ae14be1-b1f4-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv
Output: Running scope as unit run-r7d15a31341454888b9a3d471611e827f.scope.
mount: wrong fs type, bad option, bad superblock on 192.168.4.130:/opt/add-dev/nfs/,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
# kubectl get pod
NAME READY STATUS RESTARTS AGE
demo-volume-5f974bf75c-tpkxv 0/1 ContainerCreating 0 3m48s
# kubectl describe pod demo-volume-5f974bf75c-tpkxv
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m19s default-scheduler Successfully assigned troubleshot/demo-volume-5f974bf75c-tpkxv to ubuntu
Warning FailedMount 16s kubelet, ubuntu Unable to mount volumes for pod "demo-volume-5f974bf75c-tpkxv_troubleshot(45985214-b1f6-11e9-8cb3-001c4209f822)": timeout expired waiting for volumes to attach or mount for pod "troubleshot"/"demo-volume-5f974bf75c-tpkxv". list of unmounted volumes=[share]. list of unattached volumes=[share default-token-2msph]
Warning FailedMount 13s kubelet, ubuntu MountVolume.SetUp failed for volume "demo-pv" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/45985214-b1f6-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv --scope -- mount -t nfs 192.168.4.130:/opt/add-dev/nfs/ /var/lib/kubelet/pods/45985214-b1f6-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv
Output: Running scope as unit run-r179d447154fc42fa9f1108382cf846df.scope.
mount.nfs: Connection timed out
B. pod状态处于 ImagePullBackOff或者 ErrImagePull
这个报错很明显,镜像拉取失败,通过 kubectl get pod -owide 查看pod 所在node节点,在node节点上手动docker pull 试试,或者是外网的镜像比如 gcr.io 可能需要自己手动搭梯子
另外需要注意,如果使用的是自己搭建的仓库,不是https的,而pod event 中镜像地址却是https:// 开头,说明你docker 配置中少了 Insecure Registries 的配置
pod event 有 start container
event出现start container,说明pod已经调度成功,镜像拉取成功,存储挂载成功(不代表读写权限正常)
但是可能因为健康检查配置不当或者容器启动脚本、存储权限、运行权限的原因导致启动失败而反复重启。
首先还是kubectl describe pod 看event,若没有报错内容,则查看pod log A. pod liveness 监测失败 首先需要理解liveness的功能,当liveness 监测失败次数达到设定值的时候,就会重启容器。这种原因导致的容器重启原因在event中能够很明确的获取到,修改相应设定即可,或延迟init时间,或调整监测参数、端口。解决没有什么难度,而且遇到的也相对较少。
# kubectl get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
demo-liveness-fail-6b47f5bc74-n8qj4 1/1 Running 2 3m54s 10.68.0.56 ubuntu <none> <none>
# kubectl describe pod demo-liveness-fail-6b47f5bc74-n8qj4
Normal Pulling 2m42s (x4 over 4m49s) kubelet, ubuntu pulling image "tomcat"
Warning Unhealthy 2m42s (x3 over 2m48s) kubelet, ubuntu Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing 2m42s kubelet, ubuntu Killing container with id docker://demo-liveness-fail:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 2m22s (x2 over 3m30s) kubelet, ubuntu Successfully pulled image "tomcat"
Normal Created 2m22s (x2 over 3m30s) kubelet, ubuntu Created container
Normal Started 2m21s (x2 over 3m30s) kubelet, ubuntu Started container
Warning Unhealthy 92s (x3 over 98s) kubelet, ubuntu Liveness probe failed: Get http://10.68.0.56:8080/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
# 更新yaml文件,因为服务不存在/healthz的访问路径,把监测url /healthz 改为/docs
kubectl apply -f https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-liveness-fail-update.yaml
# kubectl get pod
NAME READY STATUS RESTARTS AGE
demo-liveness-fail-766f9df949-b68wr 1/1 Running 0 10m
B. pod 启动异常导致反复重启 这种情况是非常常见的 这种基本是启动脚本、权限之类问题引起。 基本思路是describe pod 看event,同时观察log。 首先 describe pod 看event 以及 Last State/Message 会有一些有用信息。
a. 注意 Exit Code,如果是0,那说明pod是正常运行退出的,你没有设置启动脚本或者启动脚本有误,一下就执行完了,pod认为启动脚本已经执行完成,会吧container杀掉,而pod的重启策略是一直重启 restartPolicy: Always ,你就会看到你的pod在不停重启。 比如
# 没有设置容器启动脚本,因为centos:7镜像默认没有启动脚本
kubectl run demo-centos --image=centos:7
# 或者给他一个很快就能执行完成的启动命令,也会反复重启
kubectl run demo-centos2 --image=centos:7 --command ls /
# 换这个,短期内是不会重启了
kubectl run demo-centos3 --image=centos:7 --command sleep 36000
# 这个,基本是永远不重启了,就是别让他闲着
kubectl run demo-centos4 --image=centos:7 --command tailf /var/log/lastlog
b. 注意 Exit Code,如果不是0 那么可能是,你的启动参数错了
kubectl run demo-centos2 --image=centos:7 --command wahaha
kubectl describe pod demo-centos5-7fc7f8bccc-zpzct
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"wahaha\": executable file not found in $PATH": unknown
Exit Code: 127
c. 或者你的存储权限不对,此处为挂载的nfs server
# nfs server 参数,默认root_squash 是on
# cat /etc/exports
/opt/add-dev/nfs/ *(rw,sync)
# oc new-project test1
# oc run nginx --image=nginx --port=80
# oc get pod
NAME READY STATUS RESTARTS AGE
nginx-1-hx8gj 0/1 Error 0 10s
# oc logs nginx-1-hx8gj
2019/07/31 14:33:26 [warn] 1#1: the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /etc/nginx/nginx.conf:2
nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /etc/nginx/nginx.conf:2
2019/07/31 14:33:26 [emerg] 1#1: mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied)
nginx: [emerg] mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied)
# kubectl -n demo-test run tomtest --image=tomcat
# kubectl -n demo-test get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
tomtest 0/1 0 0 3m3s
# kubectl -n demo-test get pod
No resources found.
# kubectl -n demo-test describe deployments tomtest
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 4m2s deployment-controller Scaled up replica set tomtest-865b47b7df to 1
如何诊断
# 检查 ipv4 forwarding 是否开启
sysctl net.ipv4.ip_forward
# 0 意味着未开启
net.ipv4.ip_forward = 0
如何修复
# this will turn things back on a live server
sysctl -w net.ipv4.ip_forward=1
# on Centos this will make the setting apply after reboot
echo net.ipv4.ip_forward=1 >> /etc/sysconf.d/10-ipv4-forwarding-on.conf
# 验证并生效
sysctl -p