Orchestration platforms like Kubernetes make distributed systems manageable—until they don't. The scenarios that rarely happen in development—a node losing network connectivity mid-deployment, a container that passes liveness probes but fails to serve traffic, or a ConfigMap update that corrupts downstream services—are the ones that cause the longest outages in production. This guide is for teams who want to move beyond basic orchestration and systematically address the edge cases that erode reliability.
Where Edge Cases Actually Show Up in Production
Most orchestration edge cases don't appear during initial setup. They emerge after weeks or months of operation, often triggered by scaling events, upgrades, or routine maintenance. One common scenario is a multi-tenant cluster where one team's deployment accidentally exhausts a shared resource, like node ports or IP addresses, causing other services to fail. Another is the classic "rolling update that never finishes" because a new pod fails its readiness check, but the old pod has already been terminated, leaving a service with zero healthy replicas.
The Network Partition That Wasn't a Partition
A team I read about faced a puzzling issue: their application would intermittently timeout on database calls, but only during peak traffic. After days of debugging, they discovered that the orchestrator's DNS caching was stale—pods were resolving the database service to an old IP address that pointed to a decommissioned node. The orchestrator's service discovery worked correctly, but the application's DNS resolver had a longer TTL than the orchestrator expected. This mismatch between layers is a classic edge case: neither the orchestrator nor the application is wrong, but together they create a failure mode that's hard to reproduce in test environments.
Zombie Pods and Stale Endpoints
Another frequent issue is the "zombie pod"—a container that is still running but no longer registered with the orchestrator's control plane. This can happen when a node experiences a transient network failure: the control plane marks the pod as lost and schedules a replacement, but the original pod continues running and serving traffic. The result is duplicate responses, data corruption, or load balancer confusion. The edge case isn't the node failure itself; it's the race condition between the orchestrator's reconciliation loop and the actual state of the container.
Foundations That Teams Often Misunderstand
Many reliability problems in orchestration stem from a few core concepts that are easy to grasp but hard to internalize. The most common is the difference between liveness, readiness, and startup probes. Teams often use the same endpoint for all three, which defeats their purpose. A liveness probe should indicate whether the application is stuck (e.g., a deadlock), while a readiness probe should indicate whether it can serve traffic (e.g., still loading data). Using a single endpoint means a slow startup can trigger a restart loop, or a temporary overload can cause the orchestrator to kill the pod unnecessarily.
Resource Requests vs. Limits
Another foundational misunderstanding is the relationship between resource requests and limits. Requests guarantee resources for scheduling, while limits cap usage. But many teams set requests equal to limits, which prevents the kernel from reclaiming idle resources and leads to lower overall utilization. Worse, they forget that limits are enforced via throttling (for CPU) or OOM kills (for memory). A pod that bursts memory above its limit gets killed without warning, even if the node has plenty of free memory. The edge case here is a pod that normally uses 200 MB but occasionally spikes to 500 MB during a batch job—setting limits too tight causes crashes, while setting them too loose risks noisy neighbors.
Pod Disruption Budgets
Pod Disruption Budgets (PDBs) are another area where teams get tripped up. A PDB specifies how many pods of a deployment can be unavailable during voluntary disruptions (like node drains). But many teams set minAvailable to a high number, like 100%, which prevents any voluntary disruption from proceeding. This sounds safe, but it means that if a node needs to be drained for a security patch, the drain will hang indefinitely because no pods can be evicted. The edge case is a cluster that becomes effectively unmaintainable because the PDBs are too restrictive. The fix is to allow some disruption headroom, like 90% or even 80%, and rely on the orchestrator to reschedule pods quickly.
Patterns That Usually Work
Over time, the community has converged on a set of patterns that handle common edge cases gracefully. These aren't silver bullets, but they raise the floor significantly.
Graceful Shutdown with PreStop Hooks
A well-configured PreStop hook is one of the simplest yet most effective patterns. It gives the container time to finish in-flight requests and close connections before the orchestrator sends a SIGTERM. Without it, clients experience connection resets and data loss. The pattern works best when the hook has a timeout that aligns with the application's typical request duration—usually 10–30 seconds. The edge case to watch for is when the hook itself fails (e.g., a script that hangs), so always set a terminationGracePeriodSeconds that exceeds the hook timeout.
Idempotency and Retry with Backoff
Applications that are designed to handle duplicate events and transient failures are naturally more resilient in orchestrated environments. This means using idempotency keys for API calls, implementing exponential backoff for retries, and avoiding state that depends on exact ordering. The edge case that this pattern addresses is the "at-least-once" delivery semantics that many orchestrators use for events like ConfigMap updates or pod evictions. If the application can't handle duplicates, a simple network retry can cause data corruption.
Circuit Breakers for External Dependencies
When a pod depends on an external service (like a database or third-party API), a circuit breaker pattern prevents cascading failures. If the external service is slow or failing, the circuit breaker trips and returns a fallback response immediately, rather than letting all pods pile up requests and exhaust resources. The edge case is when the external service recovers but the circuit breaker stays open—this requires a half-open state that probes the service periodically. Implementing this correctly requires careful tuning of thresholds and timeouts.
Anti-Patterns and Why Teams Revert
Not every popular practice is a good one. Some patterns that seem like best practices actually introduce fragility, and teams often revert them after an incident.
Overloading Readiness Probes
One anti-pattern is making readiness probes check deep dependencies, like database connectivity. The idea is to prevent traffic from reaching a pod that can't serve requests. But in practice, this causes cascading failures: if the database has a brief hiccup, all pods fail their readiness probes, the orchestrator removes them from the service, and the entire service becomes unavailable—even though the application code itself is fine. The correct approach is to use readiness probes for shallow checks (like whether the application has loaded its configuration) and handle deep dependency failures with retries and circuit breakers within the application.
Setting CPU Limits Too Low
Another common mistake is setting CPU limits to a fraction of what the application needs during bursts, based on average usage. When the application hits a spike, the kernel throttles CPU, which slows down request processing, which causes more requests to queue up, which increases CPU usage further—a vicious cycle that can lead to timeouts and cascading failures. Teams that experience this often remove CPU limits entirely, which works but can lead to noisy neighbors in multi-tenant clusters. A better approach is to set limits high enough to cover burst scenarios and rely on resource requests for scheduling guarantees.
Ignoring Pod Topology Spread Constraints
Many teams deploy pods without any topology spread constraints, assuming the scheduler will distribute them evenly. But the scheduler's default behavior is to fill one node before moving to the next, especially during scale-up events. This means that a single node failure can take down a large portion of a service. The anti-pattern is to rely on luck rather than explicit constraints. The fix is to use pod topology spread constraints to enforce distribution across zones, nodes, or even racks. The edge case is when the cluster doesn't have enough nodes to satisfy the constraints—then the pods will remain pending, which is better than having them all on one node but can cause confusion if not monitored.
Maintenance, Drift, and Long-Term Costs
Edge cases don't stay solved. Over time, configuration drift, software updates, and team changes reintroduce old problems. A cluster that was carefully tuned six months ago may have accumulated small changes—a developer adjusting a probe timeout for a local test, a new version of a sidecar proxy with different default behavior, or a node upgrade that changes kernel parameters. These incremental drifts are hard to detect because nothing breaks immediately; the system just becomes slightly less resilient.
Config Drift in Helm Charts and Operators
Helm charts and operators are meant to standardize deployments, but they can also hide drift. A team might install a chart with default values that work initially, then later override a few values via a ConfigMap or direct edit. When the chart is upgraded, the customizations can be overwritten, or worse, the chart's new version changes the default behavior for a setting the team relied on. The edge case is a silent regression: the application still runs, but a previously handled failure mode (like a slow startup) now causes crashes because a timeout was reset to a lower value.
The Cost of Over-Engineering
There's also a long-term cost to adding too many safeguards. Every readiness probe, PreStop hook, and circuit breaker adds complexity and maintenance burden. Teams that implement every pattern they read about often find themselves debugging the safety nets rather than the actual application. The edge case here is the "safety net that becomes a snare": a poorly tuned circuit breaker that trips during normal traffic, or a PreStop hook that takes so long that the orchestrator kills the pod before it completes. The key is to start with a minimal set of patterns and add more only when data shows they're needed.
When Not to Use This Approach
Not every cluster needs deep edge-case handling. For small, single-team clusters running low-criticality applications, the overhead of tuning probes, PDBs, and topology constraints may not be worth the effort. The risk of an edge case causing downtime is low, and the time spent on orchestration tuning could be better spent on feature development. Similarly, for ephemeral environments like CI/CD pipelines, where pods are short-lived and failures are acceptable, many of these patterns add unnecessary latency.
When the Application Is Stateless and Simple
If your application is truly stateless (no local storage, no long-lived connections) and has simple dependencies (a single database or API), the edge cases are fewer and easier to handle. A simple deployment with a single liveness probe and no readiness probe might be sufficient. The orchestrator's default behavior—restarting failed pods and rescheduling them on healthy nodes—covers most failure scenarios. Adding advanced patterns like circuit breakers or topology constraints would be overkill.
When the Team Lacks Operational Maturity
Another situation where less is more is when the team doesn't have the operational maturity to maintain complex orchestration configurations. A team that struggles with basic monitoring and incident response will only compound their problems by adding intricate probe logic and disruption budgets. In this case, it's better to keep configurations simple and invest in fundamentals: good logging, alerting, and a reliable deployment pipeline. As the team matures, they can gradually adopt more sophisticated patterns.
Open Questions and Common Misconceptions
Even experienced practitioners debate some aspects of orchestration edge cases. Here are a few questions that come up frequently.
Should readiness probes check external dependencies?
As mentioned earlier, the consensus is no—readiness probes should be shallow. But there are exceptions. For example, if the application is a caching layer that is useless without a working backend, a readiness probe that checks the backend might make sense, as long as it has a short timeout and doesn't cause cascading failures. The key is to understand the trade-off: you're trading a faster recovery for a higher risk of total service outage.
Is it safe to use a single endpoint for liveness and readiness?
In simple cases, yes. If the application is a straightforward HTTP service with no startup delay and no overload sensitivity, a single endpoint works. But as soon as you have any of those characteristics, separate probes become necessary. The misconception is that separate probes are always required; they're not, but they're a good default.
How do you test for edge cases?
Chaos engineering is the most common answer, but it's not always practical. Many teams start with fault injection in staging environments: killing pods, simulating network partitions, or introducing latency. The challenge is that edge cases often involve timing and race conditions that are hard to reproduce. A more pragmatic approach is to analyze incident postmortems and add automated tests for the specific failure modes that have occurred in production.
Summary and Next Experiments
Edge cases in orchestration are not random—they follow patterns. Network partitions, resource exhaustion, probe misconfiguration, and configuration drift are the most common categories. The key to handling them is not to predict every failure, but to build systems that degrade gracefully and recover automatically. Start with the basics: separate liveness and readiness probes, set realistic resource requests and limits, use PreStop hooks for graceful shutdown, and define pod disruption budgets with some headroom.
Next, experiment with one pattern at a time. If you don't have circuit breakers, add them to one service and monitor the impact. If you've never used topology spread constraints, apply them to a stateless workload and observe scheduling behavior. The goal is to build confidence through small, reversible changes. And always, document the edge cases you encounter—they are the best guide for what to handle next.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!