# kubectl get podNAMEREADYSTATUSRESTARTSAGEdemo-volume-5f974bf75c-tpkxv0/1ContainerCreating03m48s# kubectl describe pod demo-volume-5f974bf75c-tpkxv Events:TypeReasonAgeFromMessage------------------------- Normal Scheduled 2m19s default-scheduler Successfully assigned troubleshot/demo-volume-5f974bf75c-tpkxv to ubuntu
Warning FailedMount 16s kubelet, ubuntu Unable to mount volumes for pod "demo-volume-5f974bf75c-tpkxv_troubleshot(45985214-b1f6-11e9-8cb3-001c4209f822)": timeout expired waiting for volumes to attach or mount for pod "troubleshot"/"demo-volume-5f974bf75c-tpkxv". list of unmounted volumes=[share]. list of unattached volumes=[share default-token-2msph]
Warning FailedMount 13s kubelet, ubuntu MountVolume.SetUp failed for volume "demo-pv" : mount failed: exit status 32
Mountingcommand:systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/45985214-b1f6-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv --scope -- mount -t nfs 192.168.4.130:/opt/add-dev/nfs/ /var/lib/kubelet/pods/45985214-b1f6-11e9-8cb3-001c4209f822/volumes/kubernetes.io~nfs/demo-pv
Output:Runningscopeasunitrun-r179d447154fc42fa9f1108382cf846df.scope.mount.nfs:Connectiontimedout
# kubectl get podNAMEREADYSTATUSRESTARTSAGEdemo-volume-78d654c4d-2cwdx1/1Running06m22s# kubectl exec -it demo-volume-78d654c4d-2cwdx bashroot@demo-volume-78d654c4d-2cwdx:/usr/local/tomcat#df-hFilesystemSizeUsedAvailUse%Mountedonoverlay48G17G32G35%/tmpfs64M064M0%/devtmpfs16G016G0%/sys/fs/cgroup192.168.4.133:/opt/add-dev/nfs100G62G39G62%/tmp/dev/mapper/centos-root48G17G32G35%/etc/hostsshm64M064M0%/dev/shmtmpfs16G12K16G1%/run/secrets/kubernetes.io/serviceaccounttmpfs16G016G0%/proc/acpitmpfs16G016G0%/proc/scsitmpfs16G016G0%/sys/firmware
现在都正常了,进入容器 df -h 查看挂载ok
B. pod状态处于 ImagePullBackOff或者 ErrImagePull
这个报错很明显,镜像拉取失败,通过 kubectl get pod -owide 查看pod 所在node节点,在node节点上手动docker pull 试试,或者是外网的镜像比如 gcr.io 可能需要自己手动搭梯子
另外需要注意,如果使用的是自己搭建的仓库,不是https的,而pod event 中镜像地址却是https:// 开头,说明你docker 配置中少了 Insecure Registries 的配置
pod event 有 start container
event出现start container,说明pod已经调度成功,镜像拉取成功,存储挂载成功(不代表读写权限正常)
但是可能因为健康检查配置不当或者容器启动脚本、存储权限、运行权限的原因导致启动失败而反复重启。
首先还是kubectl describe pod 看event,若没有报错内容,则查看pod log A. pod liveness 监测失败 首先需要理解liveness的功能,当liveness 监测失败次数达到设定值的时候,就会重启容器。这种原因导致的容器重启原因在event中能够很明确的获取到,修改相应设定即可,或延迟init时间,或调整监测参数、端口。解决没有什么难度,而且遇到的也相对较少。
# kubectl get pod -owideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
demo-liveness-fail-6b47f5bc74-n8qj4 1/1 Running 2 3m54s 10.68.0.56 ubuntu <none> <none>
# kubectl describe pod demo-liveness-fail-6b47f5bc74-n8qj4NormalPulling2m42s (x4 over4m49s) kubelet, ubuntu pulling image "tomcat" Warning Unhealthy 2m42s (x3 over 2m48s) kubelet, ubuntu Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing 2m42s kubelet, ubuntu Killing container with id docker://demo-liveness-fail:Container failed liveness probe.. Container will be killed and recreated.
NormalPulled2m22s (x2 over3m30s) kubelet, ubuntu Successfully pulled image "tomcat"NormalCreated2m22s (x2 over3m30s) kubelet, ubuntu Created containerNormalStarted2m21s (x2 over3m30s) kubelet, ubuntu Started container Warning Unhealthy 92s (x3 over 98s) kubelet, ubuntu Liveness probe failed: Get http://10.68.0.56:8080/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
# oc new-project test1# oc run nginx --image=nginx --port=80# oc get podNAMEREADYSTATUSRESTARTSAGEnginx-1-hx8gj0/1Error010s# oc logs nginx-1-hx8gj 2019/07/31 14:33:26 [warn] 1#1: the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /etc/nginx/nginx.conf:2
nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /etc/nginx/nginx.conf:2
2019/07/3114:33:26 [emerg] 1#1: mkdir()"/var/cache/nginx/client_temp" failed (13:Permissiondenied)nginx: [emerg] mkdir()"/var/cache/nginx/client_temp" failed (13:Permissiondenied)
ocnew-projecttest2ocrunnginx2--image=nginx--port=80ocgetpodoccreateserviceaccountngrootocadmpolicyadd-scc-to-useranyuid-zngrootocpatchdc/nginx2--patch'{"spec":{"template":{"spec":{"serviceAccountName": "ngroot"}}}}'# oc get podNAMEREADYSTATUSRESTARTSAGEnginx2-2-mkrdp1/1Running012s
解决方法3: 我们按照scc restricted 复制一个,只给他增加 nginx 需要的权限
首先导出sccrestricted为yaml文件,并copy一份ocgetsccrestricted--export-oyaml>restricted.yamlcprestricted.yamlrestricted-ng.yamlvimrestricted-ng.yaml修改name:把restricted改为restricted-ng修改runAsUser:把MustRunAsRange修改为RunAsAny修改groups:把system:authenticated一行删掉,不然这个scc会把所有project的默认scc给改了修改priority:null改成5//优先级要高于默认restricted我改好的restricted-ng.yaml 在 https://raw.githubusercontent.com/cai11745/k8s-ocp-yaml/master/yaml-file/troubleshooting/restricted-ng.yaml
导入新的sccocapply-frestricted-ng.yamlocgetscc创建新的project和应用用于测试ocnew-projecttest3ocrunnginx3--image=nginx--port=80创建新的serviceaccount并使用上面创建的sccoccreateserviceaccountng3rootocadmpolicyadd-scc-to-userrestricted-ng-zng3rootocpatchdc/nginx3--patch'{"spec":{"template":{"spec":{"serviceAccountName": "ng3root"}}}}'[root@origin311 ~]# oc get podNAMEREADYSTATUSRESTARTSAGEnginx3-2-wsk971/1Running020s[root@origin311 ~]# oc logs nginx3-2-wsk97 2019/07/3115:52:46 [emerg] 7#7: setgid(101) failed (1:Operationnotpermitted)2019/07/3115:52:46 [alert] 1#1: worker process 7 exited with fatal code 2 and cannot be respawned发现还有一个报错,没有setgid权限oceditsccrestricted-ng修改requiredDropCapabilities:这项,这个里面是被限制的权限把-SETUID-SETGID删掉,两个都删了,若只删setgid,下一步pod就是接着setuid的报错ocrolloutlatestnginx3[root@origin311 ~]# oc get pod -owideNAMEREADYSTATUSRESTARTSAGEIPNODENOMINATEDNODEnginx3-4-5bdpj1/1Running01m10.128.0.126origin311.localpd.com<none>[root@origin311 ~]# oc logs nginx3-4-5bdpj [root@origin311 ~]# 没报错就是正常了,测试下请求没问题[root@origin311 ~]# curl 10.128.0.126<!DOCTYPEhtml><html><head><title>Welcome to nginx!</title>
FAQ: 1. 如果你默认scc 不小心调乱了,可以用这个命令重置 oc adm policy reconcile-sccs --confirm 2. SCC 优先级:一个 sa user 可以被加到多的 scc 当中。当一个 service account user 有多个 SCC 可用时,其 scc 按照下面的规则排序 最高优先级的scc排在最前面。默认地,对于 cluster admin 角色的用户的 pod,anyuid scc 会被授予最高优先级,排在最前面。这会允许集群管理员能够以任意 user 运行 pod,而不需要指定 pod 的 SecurityContext 中的 RunAsUser 字段。 如果优先级相同,scc 将从限制最高的向限制最少的顺序排序。 如果优先级和限制都一样,那么将按照名字排序。
如何诊断# 检查 ipv4 forwarding 是否开启sysctlnet.ipv4.ip_forward# 0 意味着未开启net.ipv4.ip_forward=0如何修复# this will turn things back on a live serversysctl-wnet.ipv4.ip_forward=1# on Centos this will make the setting apply after rebootechonet.ipv4.ip_forward=1>>/etc/sysconf.d/10-ipv4-forwarding-on.conf# 验证并生效sysctl-p