Kubernetes 故障排查实战——Pod 不起来、OOM、镜像拉取失败的排查手册

老张2026/4/30大约 6 分钟

Kubernetes 故障排查实战——Pod 不起来、OOM、镜像拉取失败的排查手册

适读人群：在 K8s 上遇到 Pod 问题不知从何下手的工程师 | 阅读时长：约 17 分钟 | 核心价值：系统性的排查思路 + 高频问题的直接解法

凌晨 1 点 23 分，我还在对着一个 Pod，它就是不起来。

状态一直在 CrashLoopBackOff，日志显示的错误信息不够清晰，kubectl describe 看了三遍没有新发现。这种感觉我相信很多人都有过——明明看了一堆信息，就是找不到根因。

后来我总结了一套排查框架，几乎所有 Pod 问题都能用这个流程快速定位。

排查框架：先看状态，再看事件，再看日志

# 第一步：看 Pod 状态
kubectl get pods -n <namespace> -o wide

# 第二步：看 Pod 详细事件
kubectl describe pod <pod-name> -n <namespace>

# 第三步：看容器日志
kubectl logs <pod-name> -n <namespace>                    # 当前日志
kubectl logs <pod-name> -n <namespace> --previous         # 上一次崩溃的日志
kubectl logs <pod-name> -n <namespace> -c <container>     # 多容器时指定容器

# 第四步：进入容器调试
kubectl exec -it <pod-name> -n <namespace> -- sh

# 第五步：看节点状态
kubectl get nodes
kubectl describe node <node-name>

记住这个顺序，别一上来就看日志——很多问题根本到不了"产生日志"这步，比如镜像拉取失败、资源不足调度失败。

场景一：Pod 一直 Pending

Pending 意味着 Pod 还没有被调度到节点，还没有开始运行。

kubectl describe pod <pod-name> | grep -A10 "Events:"

常见 Events 和对应解法：

1. 资源不足

0/8 nodes are available: 8 Insufficient cpu, 8 Insufficient memory

# 查看节点可用资源
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top nodes

# 查看 Pod 的资源请求
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'

解法：

扩容节点
调低 Pod 的 resources.requests（如果设置过高）
检查是否有 ResourceQuota 限制命名空间资源总量

2. PVC 绑定失败

0/8 nodes are available: 8 node(s) didn't find available persistent volumes

kubectl get pvc -n <namespace>
# 如果 STATUS 是 Pending，说明 PVC 没有找到匹配的 PV 或 StorageClass 有问题
kubectl describe pvc <pvc-name> -n <namespace>

3. 节点亲和性/污点不满足

0/8 nodes are available: 3 node(s) had taint that pod didn't tolerate, 5 node(s) didn't match node selector

检查 Pod 的 nodeSelector/affinity/tolerations 配置是否和现有节点的 label/taint 匹配。

场景二：CrashLoopBackOff

Pod 启动后立即崩溃，K8s 不断重启它，重启间隔越来越长（指数退避）。

# 关键：看前一次崩溃的日志
kubectl logs <pod-name> -n <namespace> --previous

# 查看退出代码
kubectl describe pod <pod-name> | grep "Last State" -A5
# Exit Code 1：应用主动退出，代码里的 os.Exit(1) 或异常
# Exit Code 137：OOM Kill（被内核杀死），信号 9
# Exit Code 139：段错误（SIGSEGV）
# Exit Code 143：被 SIGTERM 终止

踩坑实录一：应用找不到配置文件，启动失败

现象：CrashLoopBackOff，日志显示 Config file not found: /app/config/application.yml

原因：ConfigMap 挂载路径配置错误，或者 ConfigMap 本身不存在。

排查：

# 检查 ConfigMap 是否存在
kubectl get configmap app-config -n <namespace>

# 检查挂载配置
kubectl get pod <pod-name> -o yaml | grep -A10 "volumeMounts"
kubectl get pod <pod-name> -o yaml | grep -A10 "volumes"

# 进入容器确认文件是否挂载正确
kubectl exec -it <pod-name> -n <namespace> -- ls /app/config/

解法：修正 ConfigMap 名称或挂载路径。

踩坑实录二：Java 应用 OOM，Exit Code 137

现象：Pod 状态 OOMKilled，每天重启几次，尤其流量高峰期。

排查：

# 查看内存使用历史（Prometheus）
kubectl top pod <pod-name> -n <namespace>

# 查看 JVM 内存设置
kubectl exec -it <pod-name> -n <namespace> -- \
    java -XX:+PrintFlagsFinal -version 2>&1 | grep -i heapsize

常见原因：

limits.memory 设置太低
JVM 未开启容器感知（-XX:+UseContainerSupport），按宿主机内存分配堆，超出容器 limit

解法：

env:
- name: JAVA_OPTS
  value: >-
    -XX:+UseContainerSupport
    -XX:MaxRAMPercentage=75.0
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:HeapDumpPath=/tmp/heapdump.hprof

加上 HeapDump，下次 OOM 时可以分析内存快照找泄漏点。

场景三：ImagePullBackOff / ErrImagePull

镜像拉取失败。

kubectl describe pod <pod-name> | grep -A5 "Failed"

常见错误和解法：

1. 镜像不存在

Failed to pull image "myapp:v99": rpc error: ... manifest unknown

镜像 tag 写错了，或者镜像还没推送到仓库。检查：

# 本地验证镜像是否存在
docker pull registry.example.com/myapp:v99

2. 私有仓库认证失败

Failed to pull image: unauthorized: access denied

# 创建 registry 认证 Secret
kubectl create secret docker-registry regcred \
    --docker-server=registry.example.com \
    --docker-username=username \
    --docker-password=password \
    -n <namespace>

# 在 Pod spec 里引用
spec:
  imagePullSecrets:
  - name: regcred

踩坑实录三：节点上的旧镜像缓存导致问题

现象：更新了镜像，但 Pod 跑的还是旧代码。kubectl describe pod 显示镜像是新 tag，但行为是旧版本。

原因：节点上已有相同 tag 的旧镜像，imagePullPolicy: IfNotPresent（默认策略）不会重新拉取。

解法：

每次构建用不同的 tag（commit hash 或时间戳），不用固定 tag
临时修复：手动在节点上删除旧镜像

# 在对应节点上
crictl images | grep myapp
crictl rmi <image-id>

场景四：Pod Running 但访问不通

Pod 跑起来了，但 Service 访问不了，或者应用返回错误。

检查 Service 和 Pod 的 selector 是否匹配

# 查看 Service 的 selector
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector

# 查看 Pod 的 labels
kubectl get pod <pod-name> -n <namespace> --show-labels

# 查看 Service 的 Endpoints（应该有 Pod IP）
kubectl get endpoints <service-name> -n <namespace>
# 如果 Endpoints 是空的，说明 selector 不匹配

检查 Pod 的 readiness 探针

kubectl describe pod <pod-name> | grep -A5 "Readiness"
# 如果 readiness probe 一直失败，Pod 不会加入 Service 的 Endpoints

进入容器本地验证服务

# 进入容器，在容器内 curl 应用
kubectl exec -it <pod-name> -n <namespace> -- curl http://localhost:8080/health

# 从集群内其他容器测试（排除网络问题）
kubectl run debug --image=curlimages/curl --rm -it --restart=Never \
    -n <namespace> -- curl http://<service-name>.<namespace>:8080/health

场景五：节点问题导致大量 Pod 异常

# 查看节点状态
kubectl get nodes
# NotReady 的节点上的 Pod 会被驱逐

# 查看节点事件
kubectl describe node <node-name> | tail -30

# 查看节点上跑的 Pod
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

# 节点高负载时查看
kubectl top nodes

节点常见问题：

节点内存压力（memory pressure）：kubelet 会驱逐低优先级 Pod
节点磁盘空间不足：Pod 日志或 overlay 文件系统撑满
kubelet 无法连接 API Server：节点状态变为 NotReady

常用调试工具箱

# 临时调试容器（不影响正在运行的应用）
kubectl debug -it <pod-name> \
    --image=busybox \
    --target=<container-name> \
    -n <namespace>

# 查看集群事件（按时间排序）
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# 查看所有命名空间的异常事件
kubectl get events -A --field-selector type=Warning

# 查看 Pod 的完整 YAML（包含运行时信息）
kubectl get pod <pod-name> -n <namespace> -o yaml

# 强制删除卡住的 Pod（谨慎使用）
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

一张排查流程图

Pod 有问题
    │
    ├─ Pending ─────────────► kubectl describe 看 Events
    │                              ├─ 资源不足 → 扩容/调低 requests
    │                              ├─ PVC 问题 → 检查 PVC/StorageClass
    │                              └─ 调度约束 → 检查 node selector/taint
    │
    ├─ CrashLoopBackOff ────► kubectl logs --previous
    │                              ├─ Exit 137 → OOM，加内存/检查 JVM 配置
    │                              ├─ Exit 1   → 应用错误，看详细日志
    │                              └─ 找不到文件 → 检查 ConfigMap/挂载
    │
    ├─ ImagePullBackOff ────► kubectl describe 看 Events
    │                              ├─ tag 不存在 → 检查镜像 tag
    │                              └─ 认证失败 → 检查 imagePullSecrets
    │
    └─ Running 但访问不通 ──► kubectl get endpoints 看是否有 Pod IP
                                   ├─ Endpoints 空 → 检查 Service selector
                                   └─ Endpoints 有 → 进容器本地 curl 验证

K8s 故障排查是个经验活，需要积累。但只要按照"状态 → 事件 → 日志 → 容器内调试"这个顺序来，80% 的问题都能在 10 分钟内定位。