CPU usage is among the most popular performance metrics, but it can be quite misleading if you do not know how it is measured, doubly so in case of the detailed components: user, system, idle, iowait, etc. Below I try to explain how the Linux kernel measures it and how to interpret it correctly.

Here are the numbers we will be discussing as presented by top, which will act as an example of how all Linux monitoring tools end up with the different kinds of CPU usage:

Top

What is CPU usage? What is CPU time?

CPU usage is derived from CPU time, which on modern x86 Linux systems is measured using the APIC timer present on each CPU core. The kernel programs the timer to produce the interrupt with a specific frequency, which is configurable at compile time and accessible in the kernel code as the HZ constant. Each occurrence of the timer interrupt is called a tick. The handler for the timer interrupt examines what the given CPU was doing right before the interrupt and accounts the time period between two ticks, in nanoseconds, to one class of system activity: user, system, idle, and so on. CPU time, as measured by the Linux kernel, is the number of nanoseconds accounted by the timer interrupt, in total or for a given class. It is exposed as a set of class totals since the system booted, for each CPU, in /proc/stat.

To calculate CPU usage, top, htop, telegraf, collectd etc. periodically re-read this file and keep track of how the CPU time changes from one read to another. CPU usage for a given class is the increase in CPU time, for the class and in the period between the re-reads, divided by total CPU time increase in this period. For example, on my system HZ is set to 250 and top defaults to an update every 3 seconds, so that 750 ticks will make up for the CPU usage I see at any given time. A value of 10 for us in the system-level summary means that 10% of ticks in the 3 second interval, 75 out of the 750, interrupted a process executing in user mode.

CPU usage, given as a single number without further qualification, is the sum of the CPU time increases in the non-idle system activity classes (user, nice, system, softirq, hardirq) divided by the total increase in the sampling period.

Finally, CPU usage of a specific process is the increase in its CPU time divided by the total increase in CPU time, so that a CPU usage of 50% means that 375 out of the 750 ticks interrupted the process. The per-process CPU time is tracked by the same kernel code that tracks the system CPU time and is exposed in /proc/[PID]/stat as a total since process start.

When is CPU usage a good metric?

CPU usage, as the name indicates but as not everyone might have reflected upon, is a utilization metric, it measures whether or not good use is made of the CPUs in the system and interpreted this way it tends to make the most sense. You can use it to tell whether the system can possibly be downgraded without degrading performance of the services it handles, or alternatively whether it can possibly handle more workload. This requires additional verification, for example a server handling a low-latency application might need an expensive CPU, even if it is idle for most of the time, just to keep the latency low.

There are all kinds of other things you can sometimes infer from CPU usage, but it is better to use a tool dedicated for the job. For example, if you are looking into the performance of a specific program, a profiler will be less likely to produce misleading results than just looking at the CPU usage.

For a specific example of CPU usage being confusing, consider fully saturating the CPU for a period of 1 second. With the 3 second update interval of top, the observed CPU usage will be at most 33,(3)%, due to the usage averaging out over the whole 3 second period. It will be this high only if all of the workload felt into the same update interval and the value can be as low as 16.(6)% if it equally overlapped two intervals. You can verify this by looking at top -d 3, where the -d 3 makes sure your top uses our example update interval, and by trying to intentionally run stress (manpage, homepage and sources) at different times within the interval:

# Generate a CPU spike by calculating sqrt() of random numbers for 1
# second
stress -c 1 -t 1

Averaging the CPU usage over multiple CPUs/cores can create another layer of confusion. In top, you can press 1 to get the CPU usages for individual CPUs, usually making the results much easier to interpret. If you are configuring a dashboard, consider displaying the CPU usages for each CPU separately. With a lot CPUs, displaying minimum, maximum and std. dev across CPUs, for each of the user, system, … classes, might be better than just showing averages.

CPU usage will also completely fail to be useful if there is a dependency of the workload on the timer tick. The CPU could have switched many times between different processes during the period of one tick, could have switched between user mode and system mode multiple times, processes could have migrated between CPUs, and so on. Nevertheless, on each CPU, one process and one class of system activity, which happened to be interrupted by the timer, will be accounted for the whole tick. If the work happens to be done precisely between timer interrupts, Linux will report all the CPU time as idle, despite potentially a nearly fully saturated CPU. This is discussed in the Linux kernel documentation in Documentation/cpu-load.txt, which was created as a result of this LKML conversation.

What do user, system, idle, iowait, etc. mean?

We will now look at each of the 10 CPU time classes the kernel tracks in detail. Monitoring tools universally base their CPU usage reporting on those classes, but they might differ in how they aggregate some of the classes together, so you might want to check how your tool of choice does this.

User and Nice CPU time

User CPU time is accounted when a process is interrupted while executing its own executable, and its niceness is less than or equal to 0. For processes with niceness greater than 0, Nice CPU time is used instead. You can verify this by monitoring in top each of the three workloads below:

# stress -c computes sqrt() of random numbers in an infinite loop
nice -n -1 stress -c 1 # High 'us' in top, kill with Ctrl-C
nice -n 0 stress -c 1 # High 'us' in top
nice -n 1 stress -c 1 # High 'ni' in top

System CPU time

System CPU time is accounted when kernel is interrupted while executing system calls on behalf of some process. This can happen for as many reasons as there are system calls, but one example is heavy memory allocation, as illustrated by this test:

# stress -m will malloc() memory, write over it, then free() it
stress -m 4

To find the process responsible for high system CPU time, you can use pidstat from sysstat, which will show per-process user and system times. Once you are ready to zoom-in on a specific process, you can use perf to see what specific system calls the process is spending its time in. strace is also a possibility, but it slows down the traced process more than perf. You can practice using both on stress:

# Example 1:
# Run stress for 2 seconds, from the start with perf instrumentation.
# It will display a summary of how much time was spent and of how many
# calls was done to each syscall:
sudo perf trace -s stress -m 1 -t 2

# Example 1, strace version:
sudo strace -f -c stress -m 1 -t 2

# Example 2:
# Run stress in background, attach perf to running process, display
# summary after Ctrl-C:
stress -m 1 &
sudo perf trace -s -p `pgrep -d, stress`

# Example 2, strace version:
stress -m 1 &
sudo strace -c -p `pgrep -d, stress`

Idle CPU time

Idle CPU time is accounted when no runnable process is present in the system. In this case the kernel will actually begin executing a special idle process. This is nicely covered in this article.

Iowait CPU time

Iowait is a form of idle time, accounted when there are no runnable processes, but used instead of idle when there is at least one process that was put to sleep by the kernel due to excessive I/O, as a throttling mechanism. Iowait indicates free CPU time that could be used for something CPU-intensive but not I/O-intensive. High iowait might also indicate excess I/O, but this works only in one direction, high iowait implies high I/O, not vice versa - when you have runnable processes in addition to the processes put to sleep due to I/O, iowait will not be high.

You can see it for yourself by monitoring the two workloads below in top, on an otherwise idle system. Use the number of CPU cores you have in your system instead of 4. This will create a pure I/O workload, causing high iowait:

# stress -d will write() random data to a temporary file
stress -d 4

This will combine the same I/O workload with additional CPU-intensive threads, causing high user CPU usage but not high iowait:

stress -d 4 -c 4

Again it might be worth to look at CPU usages separetely for each CPU to see what is going on.

The I/O in iowait is of course typically disk I/O. Note however that as you write to a file using the write system call, you interact with the Linux page cache, rather than directly with the disk. Linux has a configurable threshold of dirty pages, pages in page cache that were modified in memory but which were not yet written out back to the disk drive. After crossing this threshold, processes doing the I/O will be put to sleep, and the actual write out to disk will begin. This is the specific path in the kernel that tends to spike up iowait. This article describes the page cache settings and approaches to tuning them, although often rather than tuning the cache, you just have to figure out how to reduce the I/O workload of the problematic process. If you have problems identifying which process is responsible, iotop might be of interest.

Irq and softirq CPU time

Hardirq and softirq CPU time is time spent servicing hardware and software interrupts. I cannot think of a workload that would cause one or the other number to be substantial. Software interrupt handlers, and hardware interrupt handlers even more so, are designed to finish as fast as possible, so any substantial amount of time spent here most likely indicates either a hardware problem or a bug in one of the kernel drivers. You can monitor the IRQ counts like this:

watch -n 1 cat /proc/interrupts

Guest CPU time

Guest CPU time is only visible in a KVM hypervisor, and counts time spent running a KVM guest. The kernel distingushes guest time from “guest nice” time, following the logic of user CPU time and nice CPU time described earlier. In fact guest CPU time is included in user CPU time, and “guest nice” CPU time in nice CPU time.

Steal CPU time

Steal CPU time is only visible inside a KVM or Xen virtual machine guest, but has to be calculated by the hypervisor. It counts the time when the VM process in which the steal time is visible, was runnable, but was waiting in the runqueue of the (real) CPU, while the hypervisor was busy executing another process. When this is high for prolonged periods of time, it indicates an overloaded hypervisor. It will not be accounted at all if the VM is idle and not waiting to be executed in the first place, even if the hypervisor is under very heavy load.

How does the kernel track CPU time, in detail?

From this point on we dive in directly into the source code of the kernel and top, to see how all of it really works. These are details, for the really interested.

In the kernel, the system-wide CPU time counters are stored in a struct called kernel_cpustat, defined in include/linux/kernel_stat.h, and accessed throughout the kernel using the kcpustat_this_cpu and kcpustat_cpu macros. The struct just wraps an array called cpustat, with valid indexes given by the cpu_usage_stat enum:

enum cpu_usage_stat {
	CPUTIME_USER,
	CPUTIME_NICE,
	CPUTIME_SYSTEM,
	CPUTIME_SOFTIRQ,
	CPUTIME_IRQ,
	CPUTIME_IDLE,
	CPUTIME_IOWAIT,
	CPUTIME_STEAL,
	CPUTIME_GUEST,
	CPUTIME_GUEST_NICE,
	NR_STATS,
};

struct kernel_cpustat {
	u64 cpustat[NR_STATS];
};

struct kernel_stat {
	unsigned long irqs_sum;
	unsigned int softirqs[NR_SOFTIRQS];
};

DECLARE_PER_CPU(struct kernel_stat, kstat);
DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);

/* Must have preemption disabled for this to be meaningful. */
#define kstat_this_cpu this_cpu_ptr(&kstat)
#define kcpustat_this_cpu this_cpu_ptr(&kernel_cpustat)
#define kstat_cpu(cpu) per_cpu(kstat, cpu)
#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)

The cpustat array tracks CPU time spent on each class of system activity, in nanoseconds. The entries in the cpustat array are subtotals, so that a tick will be accounted to just one class. CPUTIME_GUEST and CPUTIME_GUEST_NICE are the only exception, those were added later then the rest and are incremented respectively in tandem with CPUTIME_USER and CPUTIME_NICE, to keep backwards compatibility.

The cpustat values from kernel_cpustat are exposed in the /proc/stat file. Reading the file results in a call to the show_stat function, which basically writes out the contents of the cpustat array for each CPU. The original values of the nanoseconds counters also end up being expressed as hundreths of a second. The relevant part of the file looks like this:

cpu  135529 460 28104 590696 891 0 697 0 0 0
cpu0 34607 54 7232 361747 708 0 264 0 0 0
cpu1 32505 206 6374 76775 66 0 77 0 0 0
cpu2 35986 105 7876 75509 56 0 278 0 0 0
cpu3 32429 93 6622 76664 60 0 77 0 0 0

The job of something like top is to re-read this file every few seconds, per the configurable update interval, and to keep track of how much each of the counters increased since the previous read. top does this in the cpus_refresh method. When presenting results in the summary_hlp method, the difference between previous and current counter for each CPU time class is divided by the sum of the differences over all the classes, and you get the eight CPU usage percentages. vmstat does the same thing, but aggregates some of the numbers together, so that for example CPUTIME_USER and CPUTIME_NICE are added up and presented as us.

Analogously, per-process CPU time is stored in a task_struct , the kernels main representation of a process, in the utime and stime fields:

struct task_struct {
    ...
    u64 utime;
    u64 stime; 
    ...
}

utime counts the nanoseconds spent executing the process directly, in user mode, stime the time spent running kernel code, like system calls, on behalf of the process. This information is exposed for a given PID in /proc/[PID]/stat. do_task_stat is the method that writes out the contents of this file when it is read. The file itself looks like this:

1 (systemd) S 0 1 1 0 -1 4194560 246566 5973817 82 2458 328 818 134760 23461 20 0 1 0 2 225857536 2207 18446744073709551615 1 1 0 0 0 0 671173123 4096 1260 0 0 0 17 0 0 0 2877980 0 0 0 0 0 0 0 0 0 0

utime and stime, written out here among many other fields, are totals since the process has started. ps is able to present those numbers, but since it runs instanteously rather than continuously, it is also only capable of showing the same totals in a more readable form. For some reason, not many tools will display per-process user and system CPU usage as it evolves in time, e.g. I have not found a way to get this in top, but pidstat from the excellent sysstat package can do so. The getrusage system call can also get you this information.

The system-wide counters in kernel_cpustat and the process-specific ones in task_struct get updated by the same kernel control flow, that starts with the timer interrupt handler. Note that the kernel has a host of infrastructure for managing clocks and timers, so it is not trivial to trace through this part of the code. /proc/timer_list is helpful here, in it you can see the various clock devices and timers queued for each CPU.

I have to cover here one omission from our discussion of timer ticks: the Linux kernel can be configured at compile-time to stop producing ticks in certain circumstances:

  • when the system is idle (CONFIG_NO_HZ_IDLE)
  • when the system is idle or there is just one runnable process (CONFIG_NO_HZ_FULL)

This is called a tickless kernel although it is tickless only in those specific circumstances. Stopping the tick when the system is idle allows the CPU to enter a sleep state and can result in significant power savings, so CONFIG_NO_HZ_IDLE is widely enabled. The NO_HZ options are documented in detail under Documentation/timers/NO_HZ.txt.

A kernel with one of the NO_HZ options enabled, will call tick_sched_handle from the timer interrupt handler, while a traditional kernel will call tick_periodic. Both methods end up doing an equivalent of this:

update_process_times(user_mode(get_irq_regs()));

The value of user_mode(get_irq_regs()) is what decides whether the tick will be counted towards user time or system time. For x86, user_mode is defined in arch/x86/include/asm/ptrace.h, and looks like this:

#ifdef CONFIG_X86_32
	return ((regs->cs & SEGMENT_RPL_MASK) | (regs->flags & X86_VM_MASK)) >= USER_RPL;
#else
	return !!(regs->cs & 3);
#endif

The check relies on the fact that kernel code executes in CPU protection ring 0, and user code in ring 3. The CS register tells you the ring you are currently in, and regs is a struct that stores the register values from right before the timer interrupt. The details of how the rings work are explained in more detail in this article.

update_process_times calls account_process_tick, with user_tick holding the result of the user_mode() check we discussed:

/*
 * Account a single tick of cpu time.
 * @p: the process that the cpu time gets accounted to
 * @user_tick: indicates if the tick is a user or a system tick
 */
void account_process_tick(struct task_struct *p, int user_tick)
{
	u64 cputime, steal;
	struct rq *rq = this_rq();

	if (vtime_accounting_cpu_enabled())
		return;

	if (sched_clock_irqtime) {
		irqtime_account_process_tick(p, user_tick, rq, 1);
		return;
	}

	cputime = TICK_NSEC;
	steal = steal_account_process_time(ULONG_MAX);

	if (steal >= cputime)
		return;

	cputime -= steal;

	if (user_tick)
		account_user_time(p, cputime);
	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
		account_system_time(p, HARDIRQ_OFFSET, cputime);
	else
		account_idle_time(cputime);
}

We will cover the relevant parts from top to bottom. steal_account_process_time updates CPUTIME_STEAL, in the guest, using paravirt_steal_clock method, which uses the generic paravirtualization interface of pv_time_ops to return a result. In case of KVM, the steal_clock is a pointer to kvm_steal_clock and the steal time is updated, on the hypervisor, by vcpu_enter_guest via record_steal_time. record_steal_time uses the run_delay field of sched_info struct, which is a member of task_struct. run_delay stores the total time the process was waiting for execution in the CPU runqueue, for all processes in the system, not just for this purpose. In case of Xen, the role of the kernel is reduced to reading data that is maintained by the Xen hypervisor from the guest, see e.g. xen_steal_clock.

Moving on through account_process_tick, if the process is running in user mode, account_user_time is called, and based on task_nice(p) > 0 will decide between accounting CPUTIME_NICE and CPUTIME_USER:

/*
 * Account user cpu time to a process.
 * @p: the process that the cpu time gets accounted to
 * @cputime: the cpu time spent in user space since the last update
 */
void account_user_time(struct task_struct *p, u64 cputime)
{
	int index;

	/* Add user time to process. */
	p->utime += cputime;
	account_group_user_time(p, cputime);

	index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;

	/* Add user time to cpustat. */
	task_group_account_field(p, index, cputime);

	/* Account for user time used */
	acct_account_cputime(p);
}

If the process is running in system mode, we are left with the condition in account_process_tick:

if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))

rq represents the CPU run queue of the current CPU, the one for which the timer interrupt we are handling here. The rq struct is defined in kernel/sched/sched.h.rq->idle is the task_struct of the idle process.

(irq_count() != HARDIRQ_OFFSET) checks if we are currently handling a hardware interrupt. irq_count() is defined in include/linux/preempt.h and based on the value of preempt_count defined in include/asm-generic/preempt.h. preempt_count packs three different counters and a flag in the bits of one int. irq_count strips off the bits not related to counting hardware irqs, but does not subtract the integer offset that results from the lower bits counting other things, so that the actual value of just the irq counter itself is equal to irq_count() - HARDIRQ_OFFSET. irq_count() != HARDIRQ_OFFSET ends up testing if any interrupt is being processed at the moment.

All in all, if we are in system mode, and either the current process is not the idle process or we are processing an interrupt, we end up in account_system_time:

/*
 * Account system cpu time to a process.
 * @p: the process that the cpu time gets accounted to
 * @hardirq_offset: the offset to subtract from hardirq_count()
 * @cputime: the cpu time spent in kernel space since the last update
 */
void account_system_time(struct task_struct *p, int hardirq_offset, u64 cputime)
{
	int index;

	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
		account_guest_time(p, cputime);
		return;
	}

	if (hardirq_count() - hardirq_offset)
		index = CPUTIME_IRQ;
	else if (in_serving_softirq())
		index = CPUTIME_SOFTIRQ;
	else
		index = CPUTIME_SYSTEM;

	account_system_index_time(p, cputime, index);
}

account_system_time is mostly based on the counters based on preempt_count we just discussed, in_serving_softirq also being part of include/linux/preempt.h. The PF_VCPU flag which is used as condition for accounting guest time, is set by vcpu_enter_guest via guest_enter_irqoff.

Finally, if the current process is the idle process and we are not servicing an interrupt, we end up in account_idle_time:

/*
 * Account for idle time.
 * @cputime: the cpu time spent in idle wait
 */
void account_idle_time(u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;
	struct rq *rq = this_rq();

	if (atomic_read(&rq->nr_iowait) > 0)
		cpustat[CPUTIME_IOWAIT] += cputime;
	else
		cpustat[CPUTIME_IDLE] += cputime;
}

rq->nr_iowait is maintained by the scheduler based on the in_iowait flag in task_struct. This flag is set in other parts of the kernel via io_schedule_timeout in io_schedule_prepare. This is done in quite a few places, mostly related to disk I/O. For the previously mentioned writeback of dirty pages it is used near the end of balance_dirty_pages.

As we covered all the counters, this concludes the article, I hope you have learnt something useful reading it :)

References

Other than the resources linked throughout the text, the Linux Kernel Development book by Robert Love was very useful in understanding various kernel subsystems, along with the linux-insides git book. Brendan Gregg has a ton of great resources on performance and I certainly used one thing or another in this article that I learn from reading his various articles.