Kubernetes: Scenario-Based Questions

5. A Kubernetes pod is stuck in a CrashLoopBackOff state. How do you investigate and fix it?

CrashLoopBackOff indicates a pod is failing repeatedly and being restarted by Kubernetes. It’s a symptom of unhandled failures within the containerized application or misconfigurations in the pod definition.

🔍 Initial Investigation Steps

Check Pod Events: Use kubectl describe pod <pod-name> to view failure reasons, container statuses, and recent events.
Inspect Logs: Run kubectl logs <pod-name> or kubectl logs <pod-name> -c <container> for multi-container pods.
Validate Liveness/Readiness Probes: Misconfigured health checks often trigger restarts.
Resource Requests: Low memory or CPU allocations may cause OOMKilled restarts.

🛠 Common Causes

Application Exit Code ≠ 0: Entry point script or server crashing with an error.
Bad Configuration: Environment variable not set, incorrect DB URL, or missing file paths.
Probe Misfires: Liveness probe fails due to incorrect endpoint or aggressive timing.
Init Container Failures: Pod won’t start if init containers don’t complete.

🧪 Diagnostic Tools

kubectl get pods -A: Identify all failing pods across namespaces.
kubectl describe pod: Details about restarts, status, events.
kubectl logs --previous: Logs from the last failed container state.
Enable metrics-server and use Lens or Prometheus dashboards for deeper container-level metrics.

✅ Remediation

Fix the actual app issue (e.g., null pointer, port binding, DB unreachable).
Update probe paths, delays, and thresholds for graceful startup.
Temporarily disable probes using kubectl patch if needed for deeper debugging.
Use ephemeral containers or kubectl debug for live diagnosis (K8s 1.18+).

🚫 Mistakes to Avoid

Assuming the pod is fine without reading logs.
Force-restarting the pod repeatedly without understanding the root cause.
Modifying core deployment YAMLs in production without version control.

📌 Real-World Insight

In production clusters, CrashLoopBackOff is often caught via alerting systems. Mature teams pair logs with metrics (e.g., Prometheus + Loki) and standardize health checks to quickly diagnose and minimize downtime.

←→