A. pod状态处于pending
pod未被成功调度到node节点,通过查看pod event 都能够找到原因
下面列几个常见报错
a. 资源不足
kubectlapply-fhttps://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-resources-limit.yaml# kubectl get podNAMEREADYSTATUSRESTARTSAGEdemo-resources-limit-7698bb955f-ldtgk0/1Pending02m7s# kubectl describe pod demo-resources-limit-7698bb955f-ldtgkEvents:TypeReasonAgeFromMessage-------------------------WarningFailedScheduling3m3sdefault-scheduler0/1nodesareavailable:1Insufficientcpu,1Insufficientmemory.#内容解读一共1个node节点,0个node满足资源需求,1个不满足cpu,1个不满足内存。
kubectlapply-fhttps://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-pv-pvc.yamlkubectlapply-fhttps://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/demo-volume.yaml# 查看pod eventEvents:TypeReasonAgeFromMessage-------------------------NormalScheduled20sdefault-schedulerSuccessfullyassignedtroubleshot/demo-volume-5f974bf75c-vmpmptoubuntuWarningFailedMount19skubelet,ubuntuMountVolume.SetUpfailedforvolume"demo-pv":mountfailed:exitstatus32Mountingcommand:systemd-runMountingarguments:--description=Kubernetestransientmountfor/var/lib/kubelet/pods/4ae14be1-b1f4-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv--scope--mount-tnfs192.168.4.130:/opt/add-dev/nfs//var/lib/kubelet/pods/4ae14be1-b1f4-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pvOutput:Runningscopeasunitrun-r7d15a31341454888b9a3d471611e827f.scope.mount:wrongfstype,badoption,badsuperblockon192.168.4.130:/opt/add-dev/nfs/,missingcodepageorhelperprogram,orothererror (for several filesystems (e.g.nfs,cifs) you mightneeda/sbin/mount.<type>helperprogram)
# kubectl get podNAMEREADYSTATUSRESTARTSAGEdemo-volume-5f974bf75c-tpkxv0/1ContainerCreating03m48s# kubectl describe pod demo-volume-5f974bf75c-tpkxv Events:TypeReasonAgeFromMessage-------------------------NormalScheduled2m19sdefault-schedulerSuccessfullyassignedtroubleshot/demo-volume-5f974bf75c-tpkxvtoubuntuWarningFailedMount16skubelet,ubuntuUnabletomountvolumesforpod"demo-volume-5f974bf75c-tpkxv_troubleshot(45985214-b1f6-11e9-8cb3-001c4209f822)":timeoutexpiredwaitingforvolumestoattachormountforpod"troubleshot"/"demo-volume-5f974bf75c-tpkxv".listofunmountedvolumes=[share].listofunattachedvolumes=[sharedefault-token-2msph]WarningFailedMount13skubelet,ubuntuMountVolume.SetUpfailedforvolume"demo-pv":mountfailed:exitstatus32Mountingcommand:systemd-runMountingarguments:--description=Kubernetestransientmountfor/var/lib/kubelet/pods/45985214-b1f6-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv--scope--mount-tnfs192.168.4.130:/opt/add-dev/nfs//var/lib/kubelet/pods/45985214-b1f6-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pvOutput:Runningscopeasunitrun-r179d447154fc42fa9f1108382cf846df.scope.mount.nfs:Connectiontimedout
# kubectl get podNAMEREADYSTATUSRESTARTSAGEdemo-volume-78d654c4d-2cwdx1/1Running06m22s# kubectl exec -it demo-volume-78d654c4d-2cwdx bashroot@demo-volume-78d654c4d-2cwdx:/usr/local/tomcat#df-hFilesystemSizeUsedAvailUse%Mountedonoverlay48G17G32G35%/tmpfs64M064M0%/devtmpfs16G016G0%/sys/fs/cgroup192.168.4.133:/opt/add-dev/nfs100G62G39G62%/tmp/dev/mapper/centos-root48G17G32G35%/etc/hostsshm64M064M0%/dev/shmtmpfs16G12K16G1%/run/secrets/kubernetes.io/serviceaccounttmpfs16G016G0%/proc/acpitmpfs16G016G0%/proc/scsitmpfs16G016G0%/sys/firmware
现在都正常了,进入容器 df -h 查看挂载ok
B. pod状态处于 ImagePullBackOff或者 ErrImagePull
这个报错很明显,镜像拉取失败,通过 kubectl get pod -owide 查看pod 所在node节点,在node节点上手动docker pull 试试,或者是外网的镜像比如 gcr.io 可能需要自己手动搭梯子
另外需要注意,如果使用的是自己搭建的仓库,不是https的,而pod event 中镜像地址却是https:// 开头,说明你docker 配置中少了 Insecure Registries 的配置
pod event 有 start container
event出现start container,说明pod已经调度成功,镜像拉取成功,存储挂载成功(不代表读写权限正常)
但是可能因为健康检查配置不当或者容器启动脚本、存储权限、运行权限的原因导致启动失败而反复重启。
首先还是kubectl describe pod 看event,若没有报错内容,则查看pod log A. pod liveness 监测失败 首先需要理解liveness的功能,当liveness 监测失败次数达到设定值的时候,就会重启容器。这种原因导致的容器重启原因在event中能够很明确的获取到,修改相应设定即可,或延迟init时间,或调整监测参数、端口。解决没有什么难度,而且遇到的也相对较少。
# kubectl get pod -owideNAMEREADYSTATUSRESTARTSAGEIPNODENOMINATEDNODEREADINESSGATESdemo-liveness-fail-6b47f5bc74-n8qj41/1Running23m54s10.68.0.56ubuntu<none><none># kubectl describe pod demo-liveness-fail-6b47f5bc74-n8qj4NormalPulling2m42s (x4 over4m49s) kubelet, ubuntu pulling image "tomcat"WarningUnhealthy2m42s (x3 over2m48s) kubelet, ubuntu Liveness probe failed: HTTP probe failed with statuscode: 404NormalKilling2m42skubelet,ubuntuKillingcontainerwithiddocker://demo-liveness-fail:Containerfailedlivenessprobe..Containerwillbekilledandrecreated.NormalPulled2m22s (x2 over3m30s) kubelet, ubuntu Successfully pulled image "tomcat"NormalCreated2m22s (x2 over3m30s) kubelet, ubuntu Created containerNormalStarted2m21s (x2 over3m30s) kubelet, ubuntu Started containerWarningUnhealthy92s (x3 over98s) kubelet, ubuntu Liveness probe failed: Get http://10.68.0.56:8080/healthz: net/http: request canceled (Client.Timeoutexceededwhileawaitingheaders)
kubectlrundemo-centos2--image=centos:7--commandwahahakubectldescribepoddemo-centos5-7fc7f8bccc-zpzctLastState:TerminatedReason:ContainerCannotRunMessage:OCIruntimecreatefailed:container_linux.go:348:startingcontainerprocesscaused"exec: \"wahaha\": executable file not found in $PATH":unknownExitCode:127
c. 或者你的存储权限不对,此处为挂载的nfs server
# nfs server 参数,默认root_squash 是on# cat /etc/exports/opt/add-dev/nfs/*(rw,sync)
# oc new-project test1# oc run nginx --image=nginx --port=80# oc get podNAMEREADYSTATUSRESTARTSAGEnginx-1-hx8gj0/1Error010s# oc logs nginx-1-hx8gj 2019/07/3114:33:26 [warn] 1#1: the "user" directive makes sense only ifthemasterprocessrunswithsuper-userprivileges,ignoredin/etc/nginx/nginx.conf:2nginx: [warn] the "user" directive makes sense only ifthemasterprocessrunswithsuper-userprivileges,ignoredin/etc/nginx/nginx.conf:22019/07/3114:33:26 [emerg] 1#1: mkdir()"/var/cache/nginx/client_temp" failed (13:Permissiondenied)nginx: [emerg] mkdir()"/var/cache/nginx/client_temp" failed (13:Permissiondenied)
ocnew-projecttest2ocrunnginx2--image=nginx--port=80ocgetpodoccreateserviceaccountngrootocadmpolicyadd-scc-to-useranyuid-zngrootocpatchdc/nginx2--patch'{"spec":{"template":{"spec":{"serviceAccountName": "ngroot"}}}}'# oc get podNAMEREADYSTATUSRESTARTSAGEnginx2-2-mkrdp1/1Running012s
解决方法3: 我们按照scc restricted 复制一个,只给他增加 nginx 需要的权限
首先导出sccrestricted为yaml文件,并copy一份ocgetsccrestricted--export-oyaml>restricted.yamlcprestricted.yamlrestricted-ng.yamlvimrestricted-ng.yaml修改name:把restricted改为restricted-ng修改runAsUser:把MustRunAsRange修改为RunAsAny修改groups:把system:authenticated一行删掉,不然这个scc会把所有project的默认scc给改了修改priority:null改成5//优先级要高于默认restricted我改好的restricted-ng.yaml在https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/restricted-ng.yaml导入新的sccocapply-frestricted-ng.yamlocgetscc创建新的project和应用用于测试ocnew-projecttest3ocrunnginx3--image=nginx--port=80创建新的serviceaccount并使用上面创建的sccoccreateserviceaccountng3rootocadmpolicyadd-scc-to-userrestricted-ng-zng3rootocpatchdc/nginx3--patch'{"spec":{"template":{"spec":{"serviceAccountName": "ng3root"}}}}'[root@origin311 ~]# oc get podNAMEREADYSTATUSRESTARTSAGEnginx3-2-wsk971/1Running020s[root@origin311 ~]# oc logs nginx3-2-wsk97 2019/07/3115:52:46 [emerg] 7#7: setgid(101) failed (1:Operationnotpermitted)2019/07/3115:52:46 [alert] 1#1: worker process 7 exited with fatal code 2 and cannot be respawned发现还有一个报错,没有setgid权限oceditsccrestricted-ng修改requiredDropCapabilities:这项,这个里面是被限制的权限把-SETUID-SETGID删掉,两个都删了,若只删setgid,下一步pod就是接着setuid的报错ocrolloutlatestnginx3[root@origin311 ~]# oc get pod -owideNAMEREADYSTATUSRESTARTSAGEIPNODENOMINATEDNODEnginx3-4-5bdpj1/1Running01m10.128.0.126origin311.localpd.com<none>[root@origin311 ~]# oc logs nginx3-4-5bdpj [root@origin311 ~]# 没报错就是正常了,测试下请求没问题[root@origin311 ~]# curl 10.128.0.126<!DOCTYPEhtml><html><head><title>Welcome to nginx!</title>
FAQ: 1. 如果你默认scc 不小心调乱了,可以用这个命令重置 oc adm policy reconcile-sccs --confirm 2. SCC 优先级:一个 sa user 可以被加到多的 scc 当中。当一个 service account user 有多个 SCC 可用时,其 scc 按照下面的规则排序 最高优先级的scc排在最前面。默认地,对于 cluster admin 角色的用户的 pod,anyuid scc 会被授予最高优先级,排在最前面。这会允许集群管理员能够以任意 user 运行 pod,而不需要指定 pod 的 SecurityContext 中的 RunAsUser 字段。 如果优先级相同,scc 将从限制最高的向限制最少的顺序排序。 如果优先级和限制都一样,那么将按照名字排序。
如何诊断# 检查 ipv4 forwarding 是否开启sysctlnet.ipv4.ip_forward# 0 意味着未开启net.ipv4.ip_forward=0如何修复# this will turn things back on a live serversysctl-wnet.ipv4.ip_forward=1# on Centos this will make the setting apply after rebootechonet.ipv4.ip_forward=1>>/etc/sysconf.d/10-ipv4-forwarding-on.conf# 验证并生效sysctl-p