The Kubernetes Paradox: Why Your Pod Is Throttled Despite Idle CPU
- Sep 8
- 5 min read

You've been there. Your Kubernetes pod shows low average CPU usage, yet its application performance is abysmal. Latency spikes, requests time out, and your logs are full of errors. The frustrating part? Your cluster's nodes have plenty of idle CPU capacity. This isn't a "noisy neighbor" issue; it's a "silent killer" of performance rooted in how Kubernetes, the Linux kernel, and Cgroups interact.
This deep dive will uncover why this happens and provide actionable steps to solve it.
The Core Mechanism: Quotas and the 100ms Tick
When you define a CPU limit for a pod in Kubernetes, you're not setting a simple hard cap. Instead, Kubernetes translates your limit into two specific parameters for the container's control group (cgroup):
cpu.cfs_period_us: This is the time period over which CPU usage is measured. By default, it's set to 100,000 microseconds (100ms). This is a fixed value and can't be changed by the user.
cpu.cfs_quota_us: This is the total amount of CPU time (in microseconds) your container is allowed to use within each 100ms period. This value is calculated based on your specified CPU limit.
The relationship is simple:
cpu.cfs_quota_us=cpu_limittimescpu.cfs_period_us
Let's use an example. If you set a CPU limit of 500m (500 millicores), which is equivalent to 0.5 CPU:
cpu.cfs_quota_us=0.5
times100,000µs=50,000µs
This means your pod is allocated 50ms of CPU time for every 100ms period. If your application's threads consume that 50ms quota in the first 20ms of the period, the Linux Completely Fair Scheduler (CFS) will suspend your application's processes. The pod is "throttled" and must wait until the next 100ms period begins to resume execution, regardless of how much free CPU is available on the node.
This is the central paradox: the low average CPU usage you see is a direct result of the pod being throttled for most of the time.
Detailed Example: A Multi-threaded Golang Microservice
Let's illustrate this with a concrete example. We have a microservice that performs CPU-intensive calculations. It's written in Go, and it uses multiple goroutines to parallelize the workload.
The deployment manifest for this service is configured with a low CPU limit:
apiVersion: apps/v1
kind: Deployment
...
spec:
template:
spec:
containers:
- name: cpu-intensive-app
image: your-app-image:v1
resources:
requests:
cpu: "250m"
limits:
cpu: "500m"
The Go application has a function that spins up 4 goroutines, each designed to burn CPU for a short duration.
package main
import (
"fmt"
"runtime"
"time"
)
func busyWork() {
// A loop to simulate heavy CPU usage
for i := 0; i < 1e9; i++ {
}
}
func main() {
numWorkers := 4
runtime.GOMAXPROCS(numWorkers)
fmt.Println("Starting CPU intensive work...")
done := make(chan bool)
for i := 0; i < numWorkers; i++ {
go func() {
busyWork()
done <- true
}()
}
for i := 0; i < numWorkers; i++ {
<-done
}
fmt.Println("Work finished.")
}
What happens?
The Start: The container starts, and the cpu.cfs_quota_us is set to 50,000µs (50ms).
Concurrency: The Go program starts 4 goroutines. These goroutines can run in parallel on up to 4 logical CPUs.
Quota Burn: Each of the 4 goroutines immediately starts consuming CPU. In a perfectly balanced scenario, the collective CPU consumption rate is approximately 4x faster than a single thread.
Throttling: The total 50ms quota is consumed very quickly, in just a fraction of the 100ms period.
Let's assume the 4 goroutines consume the full 50ms quota in just 15ms.
For the remaining 85ms of the period, the entire pod is throttled. Its processes are suspended, and they can't run even if the node's CPU is completely idle.
The Cycle Repeats: After the 100ms period ends, the quota is refilled, the goroutines resume, and the cycle repeats.
The consequence is that the "Work finished" message is delayed significantly. What should have taken a few dozen milliseconds ends up taking hundreds of milliseconds or even seconds. A monitoring tool would show a sawtooth pattern in CPU usage, with brief spikes followed by flat lines, and a high count for throttling metrics.
Cgroups v2 and Modern Kubernetes
While Cgroups v1 uses cfs_period and cfs_quota, Cgroups v2 introduces a unified hierarchy and a more streamlined approach. It uses the cpu.max parameter, which specifies both the CPU quota and the period.
The format is cpu.max: $QUOTA $PERIOD.
For a 500m CPU limit, this would be:
cpu.max:50000100000
The underlying behavior remains the same: a time-based quota is enforced and consuming it early leads to throttling. The shift to Cgroups v2 is a technical improvement for the kernel, but it does not magically solve the throttling problem caused by a low CPU limit.
Monitoring for Throttling
The most effective way to diagnose this issue is to look at the cgroup metrics exposed by the container runtime.
The key metrics are:
container_cpu_cfs_throttled_periods_total: The total number of times the container has been throttled.
container_cpu_cfs_periods_total: The total number of 100ms periods that have elapsed.
container_cpu_cfs_throttled_seconds_total: The total time (in seconds) that the container has spent throttled.
A high value for throttled_periods_total relative to periods_total is a clear indicator of a problem.
Here's a sample Prometheus query to get the throttling rate:
rate(container_cpu_cfs_throttled_periods_total{namespace="your-namespace"}[5m]) / rate(container_cpu_cfs_periods_total{namespace="your-namespace"}[5m])
A value approaching 1 or even higher indicates severe throttling.
Actionable Solutions
Requests with No Limits (The "Burstable" Approach): For most stateless microservices and web applications, this is the best approach. Set a CPU request to guarantee a minimum resource allocation, but omit the limit entirely. This allows your pod to "burst" and use all available CPU on the node during spiky workloads without being artificially capped. This is the most resource-efficient model.
resources:
requests:
cpu: "250m" # No limits key!
Requests Equal to Limits (The "Guaranteed" Approach): For critical, performance-sensitive applications like databases, message queues, or in-memory caches, you might want predictable performance. By setting requests equal to limits, you get the "Guaranteed" QoS class, which allocates dedicated resources and avoids throttling.
resources:
requests:
cpu: "2"
limits:
cpu: "2"
This ensures the pod always has access to two full CPU cores.
Monitoring and Optimization: If you must use limits, the best practice is to set them significantly higher than the average usage and monitor throttling metrics closely. Use horizontal pod autoscaling (HPA) based on CPU utilization to add more pods instead of increasing the limits, which promotes better resource distribution.
Conclusion
The "idle but throttled" problem is a classic example of an abstraction leaking. What appears to be a simple, intuitive setting hides a complex and punitive time-based quota system. By understanding the underlying mechanism of CFS and Cgroups, you can avoid this common pitfall. The key takeaway is to rely on CPU requests to manage the scheduler's fair sharing and use a flexible approach to limits, either by removing them or setting them high enough to allow for bursty behavior. This leads to more efficient cluster utilization and prevents hidden performance bottlenecks.
Uncover the Hidden Bottlenecks in Your Kubernetes Setup"
Think your cluster is fine? Think again. Many teams face silent throttling issues that degrade performance and inflate costs.
👥 Speak with Ananta Cloud experts and let us walk you through what’s really happening in your cluster — and how to fix it.
Email: hello@anantacloud.com | LinkedIn: @anantacloud | Schedule Meeting
Comments