CPU usage is among the most popular performance metrics, but it can be quite misleading if you do not know how it is measured, doubly so in case of the detailed components: user, system, idle, iowait, etc. Below I try to explain how the Linux kernel measures it and how to interpret it correctly.
Here are the numbers we will be discussing as presented by
which will act as an example of how all Linux monitoring tools end up
with the different kinds of CPU usage:
What is CPU usage? What is CPU time?
CPU usage is derived from CPU time, which on modern
systems is measured using the APIC
timer present on each CPU
core. The kernel programs the timer to produce the interrupt with a
specific frequency, which is configurable at compile time and
accessible in the kernel code as the
HZ constant. Each occurrence of
the timer interrupt is called a tick. The handler for the timer
interrupt examines what the given CPU was doing right before the
interrupt and accounts the time period between two ticks, in
nanoseconds, to one class of system activity: user, system,
idle, and so on. CPU time, as measured by the Linux kernel, is the
number of nanoseconds accounted by the timer interrupt, in total or
for a given class. It is exposed as a set of class totals since the
system booted, for each CPU, in
To calculate CPU usage,
etc. periodically re-read this file and keep track of how the CPU time
changes from one read to another. CPU usage for a given class is the
increase in CPU time, for the class and in the period between the
re-reads, divided by total CPU time increase in this period. For
example, on my system
HZ is set to 250 and
top defaults to an
update every 3 seconds, so that 750 ticks will make up for the CPU
usage I see at any given time. A value of 10 for
us in the
system-level summary means that 10% of ticks in the 3 second interval,
75 out of the 750, interrupted a process executing in user mode.
CPU usage, given as a single number without further qualification, is the sum of the CPU time increases in the non-idle system activity classes (user, nice, system, softirq, hardirq) divided by the total increase in the sampling period.
Finally, CPU usage of a specific process is the increase in its CPU
time divided by the total increase in CPU time, so that a CPU usage of
50% means that 375 out of the 750 ticks interrupted the process. The
per-process CPU time is tracked by the same kernel code that tracks
the system CPU time and is exposed in
/proc/[PID]/stat as a total
since process start.
When is CPU usage a good metric?
CPU usage, as the name indicates but as not everyone might have reflected upon, is a utilization metric, it measures whether or not good use is made of the CPUs in the system and interpreted this way it tends to make the most sense. You can use it to tell whether the system can possibly be downgraded without degrading performance of the services it handles, or alternatively whether it can possibly handle more workload. This requires additional verification, for example a server handling a low-latency application might need an expensive CPU, even if it is idle for most of the time, just to keep the latency low.
There are all kinds of other things you can sometimes infer from CPU usage, but it is better to use a tool dedicated for the job. For example, if you are looking into the performance of a specific program, a profiler will be less likely to produce misleading results than just looking at the CPU usage.
For a specific example of CPU usage being confusing, consider fully
saturating the CPU for a period of 1 second. With the 3 second update
top, the observed CPU usage will be at most 33,(3)%, due
to the usage averaging out over the whole 3 second period. It will be
this high only if all of the workload felt into the same update
interval and the value can be as low as 16.(6)% if it equally
overlapped two intervals. You can verify this by looking at
3, where the
-d 3 makes sure your top uses our example update
interval, and by trying to intentionally run
at different times within the interval:
Averaging the CPU usage over multiple CPUs/cores can create another
layer of confusion. In
top, you can press
1 to get the CPU usages
for individual CPUs, usually making the results much easier to
interpret. If you are configuring a dashboard, consider displaying the
CPU usages for each CPU separately. With a lot CPUs, displaying
minimum, maximum and std. dev across CPUs, for each of the user,
system, … classes, might be better than just showing averages.
CPU usage will also completely fail to be useful if there is a dependency of the workload on the timer tick. The CPU could have switched many times between different processes during the period of one tick, could have switched between user mode and system mode multiple times, processes could have migrated between CPUs, and so on. Nevertheless, on each CPU, one process and one class of system activity, which happened to be interrupted by the timer, will be accounted for the whole tick. If the work happens to be done precisely between timer interrupts, Linux will report all the CPU time as idle, despite potentially a nearly fully saturated CPU. This is discussed in the Linux kernel documentation in Documentation/cpu-load.txt, which was created as a result of this LKML conversation.
What do user, system, idle, iowait, etc. mean?
We will now look at each of the 10 CPU time classes the kernel tracks in detail. Monitoring tools universally base their CPU usage reporting on those classes, but they might differ in how they aggregate some of the classes together, so you might want to check how your tool of choice does this.
User and Nice CPU time
User CPU time is accounted when a process is interrupted while executing its own executable, and its niceness is less than or equal to 0. For processes with niceness greater than 0, Nice CPU time is used instead. You can verify this by monitoring in top each of the three workloads below:
System CPU time
System CPU time is accounted when kernel is interrupted while executing system calls on behalf of some process. This can happen for as many reasons as there are system calls, but one example is heavy memory allocation, as illustrated by this test:
To find the process responsible for high system CPU time, you can use
pidstat from sysstat, which
will show per-process
system times. Once you are ready to
zoom-in on a specific process, you can use
perf to see what specific
system calls the process is spending its time in.
strace is also a
possibility, but it slows down the traced process more than
perf. You can practice using both on
Idle CPU time
Idle CPU time is accounted when no runnable process is present in the system. In this case the kernel will actually begin executing a special idle process. This is nicely covered in this article.
Iowait CPU time
Iowait is a form of idle time, accounted when there are no runnable processes, but used instead of idle when there is at least one process that was put to sleep by the kernel due to excessive I/O, as a throttling mechanism. Iowait indicates free CPU time that could be used for something CPU-intensive but not I/O-intensive. High iowait might also indicate excess I/O, but this works only in one direction, high iowait implies high I/O, not vice versa - when you have runnable processes in addition to the processes put to sleep due to I/O, iowait will not be high.
You can see it for yourself by monitoring the two workloads below in
top, on an otherwise idle system. Use the number of CPU cores you
have in your system instead of
4. This will create a pure I/O
workload, causing high iowait:
This will combine the same I/O workload with additional CPU-intensive threads, causing high user CPU usage but not high iowait:
Again it might be worth to look at CPU usages separetely for each CPU to see what is going on.
The I/O in iowait is of course typically disk I/O. Note however that
as you write to a file using the
write system call, you interact
with the Linux page cache,
rather than directly with the disk. Linux has a configurable threshold
of dirty pages, pages in page cache that were modified in memory but
which were not yet written out back to the disk drive. After crossing
this threshold, processes doing the I/O will be put to sleep, and the
actual write out to disk will begin. This is the specific path in the
kernel that tends to spike up iowait. This
describes the page cache settings and approaches to tuning them,
although often rather than tuning the cache, you just have to figure
out how to reduce the I/O workload of the problematic process. If you
have problems identifying which process is responsible,
iotop might be of interest.
Irq and softirq CPU time
Hardirq and softirq CPU time is time spent servicing hardware and software interrupts. I cannot think of a workload that would cause one or the other number to be substantial. Software interrupt handlers, and hardware interrupt handlers even more so, are designed to finish as fast as possible, so any substantial amount of time spent here most likely indicates either a hardware problem or a bug in one of the kernel drivers. You can monitor the IRQ counts like this:
Guest CPU time
Guest CPU time is only visible in a KVM hypervisor, and counts time spent running a KVM guest. The kernel distingushes guest time from “guest nice” time, following the logic of user CPU time and nice CPU time described earlier. In fact guest CPU time is included in user CPU time, and “guest nice” CPU time in nice CPU time.
Steal CPU time
Steal CPU time is only visible inside a KVM or Xen virtual machine guest, but has to be calculated by the hypervisor. It counts the time when the VM process in which the steal time is visible, was runnable, but was waiting in the runqueue of the (real) CPU, while the hypervisor was busy executing another process. When this is high for prolonged periods of time, it indicates an overloaded hypervisor. It will not be accounted at all if the VM is idle and not waiting to be executed in the first place, even if the hypervisor is under very heavy load.
How does the kernel track CPU time, in detail?
From this point on we dive in directly into the source code of the
top, to see how all of it really works. These are
details, for the really interested.
In the kernel, the system-wide CPU time counters are stored in a
kernel_cpustat, defined in
and accessed throughout the kernel using the
kcpustat_cpu macros. The struct just wraps an array called
cpustat, with valid indexes given by the
cpustat array tracks CPU time spent on each class of system
activity, in nanoseconds. The entries in the
cpustat array are
subtotals, so that a tick will be accounted to just one
CPUTIME_GUEST_NICE are the only
exception, those were added later then the rest and are incremented
respectively in tandem with
CPUTIME_NICE, to keep
cpustat values from
kernel_cpustat are exposed in the
/proc/stat file. Reading the file results in a call to the
which basically writes out the contents of the
cpustat array for
each CPU. The original values of the nanoseconds counters also end up
being expressed as hundreths of a second. The relevant part of the
file looks like this:
The job of something like
top is to re-read this file every few
seconds, per the configurable update interval, and to keep track of
how much each of the counters increased since the previous read.
top does this in the
method. When presenting results in the
method, the difference between previous and current counter for each
CPU time class is divided by the sum of the differences over all the
classes, and you get the eight CPU usage
does the same thing, but aggregates some of the numbers together, so
that for example
CPUTIME_NICE are added up and
utime counts the nanoseconds spent executing the process directly, in user mode,
stime the time spent running kernel code, like system calls, on
behalf of the process. This information is exposed for a given
is the method that writes out the contents of this file when it is
read. The file itself looks like this:
stime, written out here among many other fields, are
totals since the process has started.
ps is able to present those
numbers, but since it runs instanteously rather than continuously, it
is also only capable of showing the same totals in a more readable
form. For some reason, not many tools will display per-process user
and system CPU usage as it evolves in time, e.g. I have not found a
way to get this in
pidstat from the excellent
sysstat package can do so. The
system call can also get you this information.
The system-wide counters in
kernel_cpustat and the process-specific
task_struct get updated by the same kernel control flow,
that starts with the timer interrupt handler. Note that the kernel has
a host of infrastructure for managing clocks and timers, so it is not
trivial to trace through this part of the code.
helpful here, in it you can see the various clock devices and timers
queued for each CPU.
I have to cover here one omission from our discussion of timer ticks: the Linux kernel can be configured at compile-time to stop producing ticks in certain circumstances:
- when the system is idle (
- when the system is idle or there is just one runnable process (
This is called a tickless kernel although it is tickless only in
those specific circumstances. Stopping the tick when the system is
idle allows the CPU to enter a sleep state and can result in
significant power savings, so
CONFIG_NO_HZ_IDLE is widely enabled.
NO_HZ options are documented in detail under
A kernel with one of the
NO_HZ options enabled, will call
from the timer interrupt handler, while a traditional kernel will call
Both methods end up doing an equivalent of this:
The value of
user_mode(get_irq_regs()) is what decides whether the
tick will be counted towards user time or system time. For x86,
user_mode is defined in
and looks like this:
The check relies on the fact that kernel code executes in CPU
protection ring 0,
and user code in ring 3. The CS register tells you the ring you are
currently in, and
regs is a struct that stores the register values
from right before the timer interrupt. The details of how the rings
work are explained in more detail in this
We will cover the relevant parts from top to
CPUTIME_STEAL, in the
method, which uses the generic paravirtualization interface of
pv_time_ops to return a result. In case of KVM, the
a pointer to
and the steal time is updated, on the hypervisor, by
record_steal_time uses the
run_delay field of
which is a member of
run_delay stores the total time
the process was waiting for execution in the CPU runqueue, for all
processes in the system, not just for this purpose. In case of Xen,
the role of the kernel is reduced to reading data that is maintained
by the Xen hypervisor from the guest, see e.g.
Moving on through
account_process_tick, if the process is running in
is called, and based on
task_nice(p) > 0 will decide between
If the process is running in system mode, we are left with the
rq represents the CPU run
queue of the current CPU,
the one for which the timer interrupt we are handling here. The
struct is defined in
task_struct of the idle process.
(irq_count() != HARDIRQ_OFFSET) checks if we are currently handling
a hardware interrupt.
irq_count() is defined in
and based on the value of
preempt_count defined in
preempt_count packs three different counters and a flag in the bits
of one int.
irq_count strips off the bits not related to counting
hardware irqs, but does not subtract the integer offset that results
from the lower bits counting other things, so that the actual value of
just the irq counter itself is equal to
irq_count() != HARDIRQ_OFFSET ends up testing if
any interrupt is being processed at the moment.
All in all, if we are in system mode, and either the current process is not the idle process or we are processing an interrupt, we end up in account_system_time:
account_system_time is mostly based on the counters based on
preempt_count we just discussed,
in_serving_softirq also being
PF_VCPU flag which is used as condition for accounting guest time,
is set by
Finally, if the current process is the idle process and we are not servicing an interrupt, we end up in account_idle_time:
rq->nr_iowait is maintained by the scheduler based on the
in_iowait flag in
task_struct. This flag is set in other parts of
the kernel via
is done in quite a few places, mostly related to disk I/O. For the
previously mentioned writeback of dirty pages it is used near the
As we covered all the counters, this concludes the article, I hope you have learnt something useful reading it :)
Other than the resources linked throughout the text, the Linux Kernel Development book by Robert Love was very useful in understanding various kernel subsystems, along with the linux-insides git book. Brendan Gregg has a ton of great resources on performance and I certainly used one thing or another in this article that I learn from reading his various articles.