Only registred users can make comments
Aleksandro Matejic

Mastering Kubernetes Health Checks: Probe Configurations with Valkyrie - Part 3 out of 3

Kubernetes Probes: Hands-On Experiments and Real-World Configuration

Kubernetes probes are the foundation of application health monitoring in containerized environments. This guide takes you through practical experiments using the Valkyrie application to understand how startup, liveness, and readiness probes behave in real scenarios.

In Part 1, we explored probe theory, and in Part 2, we deployed the Valkyrie application. Now it's time to get hands-on and see how Kubernetes probes respond to real application behavior.

Prerequisites and Setup

Before starting these experiments, ensure Valkyrie is running in your cluster.

The implementation details are covered in Part 2

The code and the Kubernetes manifests can be downloaded from the following GitHub Repository:

https://github.com/devoriales/app-health-probes

Check if Valkyrie is running:

kubectl get pods -n valkyrie

# If not running, apply the manifests
kubectl apply -f valkyrie-manifests.yaml

If you don't have an ingress controller installed, use port forwarding to access the application:

# Start port-forward to access the UI
kubectl port-forward -n valkyrie svc/critical-app-clusterip 8080:80

Open http://localhost:8080 in your browser. You should see:

  • Liveness and Readiness status indicators
  • Toggle switches for simulating failures
  • Links to health endpoints

Understanding Valkyrie's Testing Features

Valkyrie provides several features specifically designed for probe testing:

  • Configurable startup time via PRIME_NUMBER_COUNT environment variable (Simulates long startup time)
  • Health endpoints that return appropriate HTTP status codes:
    • /liveness-health - Returns 200 (healthy) or 500/503 (unhealthy)
    • /readiness-health - Returns 200 (ready) or 503 (not ready)
  • Interactive failure simulation through web UI toggles
  • Timestamp tracking at /timestamps showing when probes were first executed

Experiment 1: Understanding Startup Probes

Startup probes determine when an application has successfully started and prevent premature liveness and readiness checks. This experiment demonstrates why startup probes are essential for slow-starting applications.

The Startup Simulation

Valkyrie simulates CPU-intensive startup processes by calculating prime numbers. This mimics real-world scenarios like loading configurations, establishing database connections, or warming caches.

Here's how the simulation works:

func simulateLongStartup(limit int) {
    start := time.Now()
    count := 0
    
    // Calculate prime numbers to consume CPU
    for num := 2; count < limit; num++ {
        if isPrime(num) {
            count++
        }
        if time.Since(start) > 120*time.Second {
            fmt.Println("Startup process took too long, exiting.")
            break
        }
    }
    
    // Mark startup as complete
    atomic.StoreInt32(&startupComplete, 1)
    
    // Create completion indicator file
    os.WriteFile("/tmp/startup-file", 
        []byte(`Startup complete at ` + time.Now().Format("2006-01-02T15:04:05")), 
        0644)
}

The simulateLongStartup() function will calculate the prime numbers which is CPU intensive. The higher the limit we set, the longer it will take to finish the calculation.

Step 1: Deploy Without Proper Startup Probe Configuration

First, configure Valkyrie with an intensive startup process:

env:
- name: PRIME_NUMBER_COUNT
  value: "1000000"  # This creates a 45+ second startup time

Set up the startup probe with insufficient time allowance:

# Startup Probe - Determines when the application has successfully started
startupProbe:
  exec:
    command:
    - sh
    - -c
    - "test -f /tmp/startup-file"  # Checks if a file exists to indicate startup completion
  initialDelaySeconds: 10  # How long to wait before running the first probe check
  periodSeconds: 10        # How often (in seconds) to perform the probe
  timeoutSeconds: 10       # Number of seconds after which the probe times out
  failureThreshold: 2     # How many times the probe can fail before the pod is marked as Unhealthy
  successThreshold: 1      # For startup probes, this is set to 1.
  terminationGracePeriodSeconds: 30  # Time to wait before forcefully terminating the container if it doesn't start in time

The way we can calculate this is using the following formula:

initialDelaySeconds + (failureThreshold × periodSeconds) ≥ maximum startup time

In this particular example, it means 10 + (2 x 10) = 30 seconds. Everything beyond will make the Startup probe fail.

Deploy and observe the failure:

kubectl apply -f manifests.yaml
kubectl get pods -n valkyrie -w

Expected Result:

NAME                            READY   STATUS    RESTARTS     AGE
critical-app-588c6b8785-mdnxz   0/1     Running   2 (4s ago)   44s

The pod continuously restarts because the startup probe fails before the application completes initialization.

We can also see the events by describing the pod:

kubectl describe pod -n valkyrie -l app=critical-app | grep -A 20 "Events:"


Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  25s               default-scheduler  Successfully assigned valkyrie/critical-app-5bc54f5968-t4nqz to k3d-devoriales-cluster-server-0
  Normal   Pulled     6s (x2 over 26s)  kubelet            Container image "registry.localhost:5000/valkyrie-app:1.4" already present on machine
  Normal   Created    6s (x2 over 26s)  kubelet            Created container critical-app
  Normal   Started    6s (x2 over 26s)  kubelet            Started container critical-app
  Warning  Unhealthy  6s (x2 over 16s)  kubelet            Startup probe failed:
  Normal   Killing    6s                kubelet            Container critical-app failed startup probe, will be restarted

Step 2: Diagnose the Root Cause

Before adjusting probe settings, verify if the issue is with the application or the probe configuration. Remove the startup probe temporarily giving the application a chance to finish the startup. 

Comment out the startupProbe section in manifests.yaml:

        # startupProbe:
        #   exec:
        #     command:
        #     - sh
        #     - -c
        #     - "test -f /tmp/startup-file"  # Checks if a file exists to indicate startup completion
        #   initialDelaySeconds: 10  # How long to wait before running the first probe check
        #   periodSeconds: 10        # How often (in seconds) to perform the probe
        #   timeoutSeconds: 10       # Number of seconds after which the probe times out
        #   failureThreshold: 5     # How many times the probe can fail before the pod is marked as Unhealthy
        #   successThreshold: 1      # Minimum number of consecutive successes before marking the container as started. For startup probes, this is set to 1.
        #   terminationGracePeriodSeconds: 30  # Time to wait before forcefully terminating the container if it doesn't start in time

Re-apply the manifests.yaml file:

kubectl apply -f manifests.yaml
kubectl logs -n valkyrie -l app=critical-app --tail=20

Expected Output:

2025-06-08 06:28:30, Count: 999998
2025-06-08 06:28:30, Count: 999999
2025-06-08 06:28:30, Count: 1000000
Startup complete at 2025-06-08T06:28:30, took 43.836445224s

This confirms the application can complete startup but needs more time than the probe allows.

Step 3: Fix the Startup Probe Configuration

Since we know the application can start, but needs more than 30 seconds (around 45), we can state that our startUp probe configuration did not give the application a chance to start. 

One tip is to also give some buffer time since the kubelet will add some additional time for it to react. Usually 3-4 seconds in my experience. 

Calculate proper timing based on observed behavior:

  • Application needs: ~45 seconds
  • Buffer time: ~10 seconds
  • Total required: ~55 seconds

This means we could set the following like this:

initialDelaySeconds + (failureThreshold × periodSeconds) ≥ maximum startup time = 10 + (5 x 10) = 60 seconds 

Configure the probe to allow sufficient time:

startupProbe:
  exec:
    command:
    - sh
    - -c
    - "test -f /tmp/startup-file"
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 10
  failureThreshold: 5  # Allows 60 seconds total: 10 + (5×10)
  successThreshold: 1

Again, we use the formula: initialDelaySeconds + (failureThreshold × periodSeconds) ≥ maximum startup time

❗It's always good to add some buffer time for the probe to succeed. Here we give around 15 seconds.

Apply the corrected configuration:

kubectl apply -f manifests.yaml
kubectl get pods -n valkyrie -l app=critical-app -w

Success Indicator:

NAME                           READY   STATUS    RESTARTS   AGE
critical-app-68d947576-7z55k   1/1     Running   0          2m41s

We see that the pod is ready (1/1 container running)

Be aware, the startup probe will fail at least 4 times due to the  periodSeconds: 10 since it checks for the file existence every 10 seconds, but it won't restart the container due to the failureThreshold: 5 which allows the container 5 failures before the pod is marked as Unhealthy.

Experiment 2: Readiness Probe Behavior

Readiness probes determine when a pod is ready to receive traffic. Unlike liveness probes, readiness failures don't restart containers but remove them from service endpoints.

Step 1: Monitor Multiple Components

Open three terminal windows to observe different aspects:

# Terminal 1: Watch pod status
kubectl get pods -n valkyrie -l app=critical-app -w

# Terminal 2: Monitor service endpoints
kubectl get endpoints -n valkyrie critical-app-clusterip -w

# Terminal 3: Track events
kubectl get events -n valkyrie --field-selector involvedObject.name=critical-app -w

We want to see how the pods are removed or added to the endpoints when the Readiness probe changes the status.

Step 2: Trigger Readiness Failure

Readiness Probe is controlled by the endpoint: /readiness-health

The following function in Valkyrie returns either 200, or 503 Service Unavailable:

// Handler to check the readiness of the application
func readinessHealthHandler(w http.ResponseWriter, r *http.Request) {
	// sleep for 2 seconds to simulate a slow response
	if atomic.LoadInt32(&simulateReadinessFailure) == 1 {
		http.Error(w, `not ready`, http.StatusServiceUnavailable)
		return
	}
	if atomic.LoadInt32(&startupComplete) == 0 {
		http.Error(w, `not ready`, http.StatusServiceUnavailable)
		return
	}
	fmt.Fprintf(w, `ready`)
}

func toggleLivenessFailureHandler(w http.ResponseWriter, r *http.Request) {
	toggleFailure(&simulateLivenessFailure, w)
}

And that's exactly the endpoint we have configured in our manifests.yaml:

readinessProbe:
          httpGet:
            path: /readiness-health

Now, access the Valkyrie UI at http://localhost:8080 and make the Readiness probe to fail:

Valkyrie app - Readiness failure simulation

  1. Note the "Readiness Status" shows "ready"
  2. Click the "Simulate Readiness Failure" checkbox
  3. Observe the terminals

Observed Behavior:

  • Pod remains in "Running" state
  • Pod IP removed from service endpoints
  • No container restart occurs

Verify endpoint removal (might take some seconds):

kubectl describe endpoints -n valkyrie critical-app-clusterip

Output:

Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.42.1.11
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  8080  TCP

When the Readiness probe fails, it will remove the pod from the service endpoint. Above, we can see that there is no Pod address specified in the list. That means the pod will not serve the traffic.

Step 3: Restore Readiness

Unfortunately, you'll not be able to just uncheck the "Simulate Readiness Failure" checkbox.

The reason is, the pod has been removed from the endpoint list and you'll not be able to reach the Valkyrie application.

Just kill the pod, so you get a new one:

kubectl delete pod <pod-name>

When a new Valkyrie pod comes up, it will start serving the traffic again.

Experiment 3: Liveness Probe Behavior

Liveness probes detect when applications become unresponsive and need container restarting. This experiment shows how liveness failures trigger container restarts.

The function of Valkyrie responsible for the liveness is the following:

func livenessHealthHandler(w http.ResponseWriter, r *http.Request) {
	if atomic.LoadInt32(&simulateLivenessFailure) == 1 {
		// sleep for 2 seconds to simulate a slow response
		time.Sleep(2 * time.Second)
		http.Error(w, `down`, http.StatusInternalServerError)
		return
	}
	// If the application is still starting up, return a 503
	if atomic.LoadInt32(&startupComplete) == 0 {
		http.Error(w, `down`, http.StatusServiceUnavailable)
		return
	}
	// If the application has started up, return a 200
	setProbeTimestamp("livenessProbe") // Capture liveness timestamp
	// return 200
	fmt.Fprintf(w, `up`)
}

Very similar to the Readiness probe, we have an endpoint like the following:

curl -i http://valkyrie.local:8080/liveness-health

HTTP/1.1 200 OK
Content-Length: 2
Content-Type: text/plain; charset=utf-8
Date: Sun, 08 Jun 2025 07:08:35 GMT

up


or

HTTP/1.1 500 Internal Server Error
Content-Length: 5
Content-Type: text/plain; charset=utf-8
Date: Sun, 08 Jun 2025 07:09:55 GMT
X-Content-Type-Options: nosniff

down

It will either return 200 OK (up) or 500 Internal Server Error (down) status.

The Liveness probe is configured to explicitly check that endpoint:

livenessProbe:
  httpGet:
    path: /liveness-health

The liveness probe will essentially kill the container when it gets the 500 error.

Trigger Liveness Failure

  1. Access the Valkyrie UI at http://localhost:8080
  2. Note the "Liveness Status" shows "alive"
  3. Click the "Simulate Liveness Failure" checkbox
  4. Monitor pod events

Valkyrie app - Liveness failure simulation

Expected Events:

kubectl describe pod -n valkyrie -l app=critical-app | grep -A5 "Events:"

Output:

Events:
  Normal   Killing         0s                     kubelet  Container critical-app failed liveness probe, will be restarted

Also, we can see the count of the pod restarts by running:

kubectl get pods
NAME                           READY   STATUS    RESTARTS      AGE
critical-app-7588d6467-js4vf   0/1     Running   3 (47s ago)   145m

Pod Status After Restart:

kubectl get pods -n valkyrie

Output:

NAME                            READY   STATUS    RESTARTS      AGE
critical-app-6c587d54f9-tlbtp   1/1     Running   1 (7m56s ago) 11m

What we can conclude is, that when the liveness probe fails it restarts the container.

❗Be very careful with the Liveness probe, since if it's not well implemented, it can cause cascading failures (as stated in the official documentation as well).

Understanding How Parameters Work Together

Understanding how timing parameters interact is crucial for production deployments because misconfigured probe timing can cause seemingly healthy applications to become unavailable or restart unnecessarily.

In production environments, these timing decisions directly impact user experience, application stability, and operational overhead.

Poor probe timing configuration causes common production issues like:

  • Applications appearing "flaky" due to aggressive failure thresholds
  • Slow deployment rollouts because readiness takes too long
  • Unnecessary container restarts during normal application pauses (GC, high load)
  • Complete service outages when all pods fail readiness simultaneously
  • Extended recovery times after infrastructure issues
 

Understanding successThreshold Across Probe Types

The successThreshold parameter behaves differently across probe types.

Startup Probes: Always successThreshold: 1

For startup probes, successThreshold must always be 1, and Kubernetes enforces this restriction. Here's why:

startupProbe:
  exec:
    command:
    - sh
    - -c
    - "test -f /tmp/startup-file"
  successThreshold: 1  # MUST be 1 - Kubernetes requirement

Startup is a binary state - either your application has started or it hasn't. There's no concept of "partially started" that would benefit from multiple consecutive successes. Once the startup probe succeeds once, it has served its purpose: confirming the application has initialized and preventing premature liveness/readiness checks.

Liveness Probes: Always successThreshold: 1

Liveness probes also must have successThreshold: 1:

 
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  successThreshold: 1  # MUST be 1 - Kubernetes requirement

Liveness represents whether the application process is alive and responsive. Like startup, this is binary - the process either responds or it doesn't. If a liveness probe succeeds, it immediately indicates the container is healthy and doesn't need restart.

Readiness Probes: successThreshold Can Be > 1

Readiness probes are the only probe type where successThreshold can be greater than 1:

 
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  successThreshold: 2  # Can be > 1 for readiness probes
  failureThreshold: 1

Readiness represents the ability to handle traffic, which can be more nuanced. Applications might have intermittent readiness states due to:

  • Warm-up periods after deployment
  • Cache initialization that affects performance
  • Connection pool establishment
  • JIT compilation in Java applications
  • Resource allocation that varies over time

Practical Example of successThreshold > 1:

Consider a Java application that needs time to warm up:

 
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  successThreshold: 3  # Requires 3 consecutive successes
  failureThreshold: 2

Timeline:

  • t=5s: SUCCESS (count: 1/3) → Still not ready
  • t=10s: SUCCESS (count: 2/3) → Still not ready
  • t=15s: SUCCESS (count: 3/3) → Pod becomes ready

This prevents the pod from receiving traffic during the initial warm-up period when response times might be unpredictable, even though the application responds to health checks.

Why Higher successThreshold Can Be Beneficial:

  1. Prevents Premature Traffic - Ensures applications are truly ready to handle production load
  2. Reduces User Impact - Prevents routing traffic to pods that might have slow response times
  3. Handles Transient States - Accounts for applications that might briefly appear ready during initialization

When NOT to Use Higher successThreshold:

  • Simple stateless applications that are immediately ready when they respond
  • Applications with consistent startup behavior
  • When you need rapid scaling and can't afford extended readiness delays

Important Limitation:

❗If you set successThreshold: 3 and failureThreshold: 1, one failure resets the success count back to zero. This means the pod needs 3 consecutive successes without any failures to become ready, which can significantly delay traffic routing in unstable environments.

Production-Ready Probe Configurations

Configuring probes for production environments requires careful consideration of your application's behavior, infrastructure constraints, and external dependencies. The configurations below are some good patterns, but remember that every application is unique and requires testing in your specific environment.

Startup Probe Configuration Strategy

The startup probe is your first line of defense against premature health checks. It should accommodate your application's worst-case startup scenario while accounting for infrastructure variability. This we have already seen in our experiments.

 
startupProbe:
  httpGet:
    path: /health/startup  # Dedicated lightweight endpoint
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 5       # Allow time for response under load
  failureThreshold: 30    # 5 minutes total allowance
  successThreshold: 1

This configuration provides a 5-minute window for application startup. The dedicated /health/startup endpoint should only verify core application components are initialized, not external dependencies. The timeout is set to 5 seconds to handle scenarios where the application is under CPU pressure during startup.

❗My experience is, to be careful including the external service checks in startup probes. If your database is temporarily unavailable, the startup probe will continuously restart your containers. 

Liveness Probe Configuration Strategy

The liveness probe determines if your application process is healthy and responsive. It should be the most conservative of all probes to avoid unnecessary restarts.

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 0  # Startup probe ensures readiness
  periodSeconds: 30       # Conservative frequency
  timeoutSeconds: 5
  failureThreshold: 3     # 90 seconds before restart
  successThreshold: 1

The 30-second period with 3 failure threshold provides a 90-second grace period before restart. This accounts for temporary application pauses due to garbage collection, high CPU load, or brief resource contention. The liveness endpoint should only verify that the application process is responsive and core functionality is working.

Including database connectivity, cache availability, or external API checks in liveness probes. These dependencies can cause healthy application instances to restart unnecessarily when external services experience issues, amplifying the impact of infrastructure problems.

Readiness Probe Configuration Strategy

The readiness probe controls traffic flow and can include dependency checks since failures only remove pods from load balancing without restarting them. 

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10       # Responsive to changes
  timeoutSeconds: 3
  failureThreshold: 2     # Quick response to issues
  successThreshold: 1

The readiness probe could be more aggressive than liveness since failures don't trigger restarts. However, be cautious about including external dependency checks in readiness probes. While it might seem logical to verify database connectivity, cache availability, and external services, this creates a dangerous all-or-nothing scenario.

If your readiness probe checks database connectivity and you have 5 replicas, when the database fails, all 5 pods will be removed from service endpoints simultaneously. This creates a complete service outage for your application, even though the pods are healthy and might be able to serve cached responses, static content, or degraded functionality.

I'd still recommend checking the internal service health. It's probably better to implement a graceful degradation in the code. 

Also, in my opinion, even if perhaps found radical by many people, I'd rather monitor the dependency health separately and make alerts on those dependency issues

Ask yourself:

  1. Can my service provide ANY meaningful functionality without this dependency?
    • YES → Don't include in readiness, handle gracefully
    • NO → Include in readiness
  2. Is serving degraded responses better than serving no response?
    • YES → Don't include in readiness
    • NO → Include in readiness
  3. What's worse for users: degraded functionality or complete outage?
    • Degraded is acceptable → Don't include in readiness
    • Complete outage is preferable → Include in readiness

Understanding the Complexity of Probe Failures

When probes start failing in production, the root cause analysis can be surprisingly complex. Probe failures are often symptoms of deeper issues rather than simple configuration problems. Understanding this complexity is crucial for effective troubleshooting.

Resource Constraint Impact

CPU throttling is one of the most common hidden causes of probe failures. When containers hit CPU limits, several things happen simultaneously:

  • CPU Throttling Effects: Your application's response times increase dramatically when the kernel throttles CPU usage. A health check that normally responds in 50ms might take 3-5 seconds under throttling. If your probe timeout is set to 3 seconds, you'll see intermittent failures that seem random but correlate with application load.
  • Memory Pressure Symptoms: When containers approach memory limits, garbage collection becomes more aggressive, causing application pauses that can trigger probe timeouts. Additionally, the kernel's out-of-memory killer might terminate processes unexpectedly, causing probe failures that look like application bugs.
  • I/O Wait Impact: Storage performance issues affect probe response times. When persistent volumes experience high latency or throughput issues, applications may appear unresponsive to probes while actually waiting for disk operations to complete.

Code Quality vs Probe Configuration

Distinguishing between application issues and probe misconfiguration requires systematic analysis:

  • Application Performance Problems: If your health endpoints include database queries, cache lookups, or complex calculations, slow performance in these areas will manifest as probe failures. The question becomes whether to optimize the application code or adjust probe timeouts.
  • Memory Leaks and Resource Usage: Applications with memory leaks might pass probes initially but fail over time as available memory decreases. This creates a pattern where probes succeed after restarts but fail after running for extended periods.
  • Concurrency Issues: Applications with race conditions, deadlocks, or thread pool exhaustion might intermittently fail probes. These issues are often load-dependent and difficult to reproduce in testing environments.

The External Dependency Trap

External dependencies create the most dangerous probe failure scenarios because they can cascade across your entire infrastructure:

  • Database Failure Amplification: Consider a scenario where your database becomes temporarily unavailable. If startup and liveness probes check database connectivity, every healthy application instance will restart when the database fails. When the database recovers, you'll have a thundering herd of containers starting simultaneously, potentially overwhelming the database again and creating a cycle of failures.
  • Network Partition Effects: During network partitions, applications might lose connectivity to external services while remaining otherwise healthy. Probes that check external dependencies will mark these instances as failed, removing them from service even though they could handle requests not requiring external services.
  • Shared Resource Contention: When multiple applications probe the same external service, failures in that service can cascade across unrelated applications. This is particularly dangerous with shared databases, message queues, or API gateways.

Advanced Troubleshooting Methodology

When probe failures occur, systematic analysis is essential to identify the actual root cause:

Correlation Analysis

Start by correlating probe failures with infrastructure metrics. Examine CPU utilization, memory usage, network latency, and storage performance during failure periods. Often, probe failures correlate with resource saturation that isn't immediately obvious.

# Check resource usage during probe failures
kubectl top pod -n <namespace> <pod-name>

# Examine CPU throttling
kubectl exec -n <namespace> <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.stat

# Check memory pressure
kubectl exec -n <namespace> <pod-name> -- cat /sys/fs/cgroup/memory/memory.pressure_level

Application-Level Diagnostics

Probe failures might indicate application-level issues that require code changes rather than configuration adjustments:

# Monitor GC patterns in Java applications
kubectl exec -n <namespace> <pod-name> -- jstat -gc <pid>

# Check for deadlocks or thread contention
kubectl exec -n <namespace> <pod-name> -- jstack <pid>

# Analyze connection pool status
kubectl exec -n <namespace> <pod-name> -- curl http://localhost:8080/actuator/metrics/hikaricp.connections

Infrastructure Dependencies

Examine the health of infrastructure components that support your application:

# Check persistent volume performance
kubectl describe pv <volume-name>

# Monitor network connectivity to external services
kubectl exec -n <namespace> <pod-name> -- ping -c 5 <external-service>

# Test DNS resolution performance
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>

Production Troubleshooting Examples 

Understanding real-world troubleshooting scenarios helps prepare for complex production issues:

Scenario 1: Intermittent Liveness Failures

Scenario: Containers restart randomly during peak traffic periods with liveness probe failures.

Investigation Process: Check CPU throttling metrics, garbage collection logs, and connection pool statistics. Often reveals resource limits set too low for actual application needs during high load.

Possible Resolution: Increase CPU requests/limits, optimize garbage collection settings, or implement circuit breakers for external dependencies.

Scenario 2: Startup Probe Timeout During Deployments

Scenario: New deployments fail consistently with startup probe timeouts, but existing pods run fine.

Investigation Process: Compare resource usage between old and new code versions, examine initialization dependencies, and check for configuration changes affecting startup time.

Possible Resolution: Might require code optimization, configuration adjustments, or infrastructure scaling to handle initialization load.

Scenario 3: Readiness Probe Flapping

Scenario: Pods continuously move in and out of service endpoints, causing request failures.

Investigation Process: Examine external dependency health, network latency patterns, and application connection handling.

Possible Resolution: Often requires implementing proper circuit breakers, connection pooling optimization, or adjusting probe sensitivity.

Key Insights from Production Experience

Through production troubleshooting, several critical insights emerge:

Probe Configuration is Application-Specific: Generic probe configurations fail in production. Every application requires careful analysis of its startup patterns, resource usage, and dependency relationships.

Infrastructure Affects Probe Behavior: Kubernetes cluster configuration, network policies, resource quotas, and storage performance all impact probe reliability. These factors must be considered during probe configuration.

External Dependencies Require Circuit Breakers: Applications with external dependencies need proper circuit breaker patterns implemented at the application level, not just in probe configurations.

Monitoring is Essential: Effective probe troubleshooting requires comprehensive monitoring of application metrics, infrastructure resources, and dependency health.

Testing Under Load: Probe configurations that work under light load often fail during peak traffic. Load testing should include probe behavior validation.

Gradual Rollouts Reduce Risk: When changing probe configurations, use gradual rollouts with careful monitoring to detect issues before they affect the entire application fleet.

Conclusion

Kubernetes probes are your application's health monitoring system. Configure them based on measured behavior, not assumptions. Test thoroughly in staging environments that mirror production conditions before deploying.

The key to successful probe configuration is understanding your application's actual behavior under various conditions and configuring probes that accurately reflect its health status.

Remember: probes should enhance application reliability, not create additional failure points through overly aggressive configurations.

Further Reading

Comments