Kubernetes: Scenario-Based Questions
5. A Kubernetes pod is stuck in a CrashLoopBackOff state. How do you investigate and fix it?
CrashLoopBackOff indicates a pod is failing repeatedly and being restarted by Kubernetes. It’s a symptom of unhandled failures within the containerized application or misconfigurations in the pod definition.
🔍 Initial Investigation Steps
- Check Pod Events: Use
kubectl describe pod <pod-name>
to view failure reasons, container statuses, and recent events. - Inspect Logs: Run
kubectl logs <pod-name>
orkubectl logs <pod-name> -c <container>
for multi-container pods. - Validate Liveness/Readiness Probes: Misconfigured health checks often trigger restarts.
- Resource Requests: Low memory or CPU allocations may cause OOMKilled restarts.
🛠 Common Causes
- Application Exit Code ≠ 0: Entry point script or server crashing with an error.
- Bad Configuration: Environment variable not set, incorrect DB URL, or missing file paths.
- Probe Misfires: Liveness probe fails due to incorrect endpoint or aggressive timing.
- Init Container Failures: Pod won’t start if init containers don’t complete.
🧪 Diagnostic Tools
kubectl get pods -A
: Identify all failing pods across namespaces.kubectl describe pod
: Details about restarts, status, events.kubectl logs --previous
: Logs from the last failed container state.- Enable metrics-server and use Lens or Prometheus dashboards for deeper container-level metrics.
✅ Remediation
- Fix the actual app issue (e.g., null pointer, port binding, DB unreachable).
- Update probe paths, delays, and thresholds for graceful startup.
- Temporarily disable probes using
kubectl patch
if needed for deeper debugging. - Use ephemeral containers or
kubectl debug
for live diagnosis (K8s 1.18+).
🚫 Mistakes to Avoid
- Assuming the pod is fine without reading logs.
- Force-restarting the pod repeatedly without understanding the root cause.
- Modifying core deployment YAMLs in production without version control.
📌 Real-World Insight
In production clusters, CrashLoopBackOff is often caught via alerting systems. Mature teams pair logs with metrics (e.g., Prometheus + Loki) and standardize health checks to quickly diagnose and minimize downtime.