Linux Troubleshooting – Why Is the Server So Slow? (Running Out of CPU, RAM, and Disk I/O)

Probably one of the most common problems you will face on a system is that it is so slow to being unresponsive. Often this can be caused by network issues, but in this guide we will discuss some local troubleshooting steps, you can use to get the difference between a network and a machine.

When a machine is sluggish, it is often because you have consumed all of particular resources on the system. The main system resources are CPU, RAM, disk I/O, and network statistics. Due to overuse of any of these resources, a system can bog down to the point that often the only you have to do a quick reboot of your system. If you can log in to the system there are a number of tools you can use to identify the root cause.

System Load

System load average is probably the most fundamental metric you start from while troubleshooting a sluggish system. One of the first commands I usually run when I do troubleshoot a slow system is uptime.

$ uptime
13:35:03 up 105 days, 10 min, 4 users, load average: 2.01, 20.15, 15.07

The three numbers after load average—2.01, 20.15, and 15.07—represent the 1-, 5-, and 15-minutes load averages on the machine, respectively. System’s load average is usually equal to the average number of processes in a runnable or uninterruptible state. Runnable processes are either using the CPU or waiting for and uninterruptible processes are waiting for I/O resources.

A single-CPU system with a load average of 1 means that the single CPU is under constant load. If that single-CPU system has a load average of 3, there is three times the load on the system than it can handle, so two out of three processes are waiting for resources. So the load average reported on a system cannot be tweaked based on the number of CPUs you have, so if you have got a two-CPU system with a load average of 1, one of your two CPUs is loaded at all times—that is you are having 50% load. So load of 1 on a single-CPU system is the same as load of 4 on a four-CPU system in terms of the amount of available resources.

What is a High Load Average?

A fair question to ask when load average is considered to be high. The short answer is “It depends on what is causing it.” Because the load describes the average number of active processes that are using resources, a spike in load could mean a few things. It is important to determine whether the load is CPU-bound (processes waiting for CPU resources), RAM-bound (specifically, high RAM usage that has moved into swap) or I/O-bound (processes fighting for disk or network I/O).

Typically it is observed systems seem to be more responsive when under CPU-bound load than when under I/O-bound load. We have seen earlier systems with CPU-bound  loads in the hundreds and we could still run diagnostic tools on those systems with good response times. Whereas, We have seen systems with relatively low I/O-bound loads on which we just tried to logging in took minutes because the disk I/O was completely saturated. If a system runs out of RAM resources often due to I/O-bound load, since the system starts using swap storage on the disk, it can consume disk resources and cause a processes slow to a halt.

Diagnose Load Problems with top Command

One of the first tools I prefer to diagnose high load is top. When you type top on the command line and press Enter, you will see a lot of system information all at once. This data continually updates so that you see live information on the system, including system uptime, the load average, how many total processes are running on the system and  how much memory you got – total, used, and free – and finally a list of processes on the system and how many resources these are using. You probably won’t be able to see every process that is currently running on your system with top because they wouldn’t all fi t on the screen. By default top shows the processes according to how much CPU they use.

top command linux

What if you notice a process consuming all of your CPU and you need to kill it? The first column for processes in top output is labeled PID (Process ID) and shows a program’s process ID – a unique number assigned to every process on a system. To kill a process press K key in your keyboard and then type the PID you wish to kill. Then press Enter when prompted to kill with signal 15.

Top command runs in an interactive mode by default which is fine unless you want to view information that doesn’t fit on the screen. If you want to view the full output of top command or redirect it to a file, you can run it in batch mode. The -b option is for batch mode and the -n option lets you control how many times top will update before it closes.

$ top -b -n 1 > top_output

If you need to view the top output and redirect output to a file at the same time then you could use the handy command-line tool tee:

$ top -b -n 1 | tee top_output

Make Sense of top command Output

Once you use top command to diagnose load, the basic steps are to examine the top output to identify what resources you are running out of CPU, RAM, or disk I/O). Once you have figured that out, you can try to identify what processes are consuming those resources the most.

top - 14:08:25 up 40 days, 8:02, 1 user, load average: 1.71, 1.77, 1.68
Tasks: 107 total, 4 running, 104 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.4%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1024176k total, 997407k used, 26768k free, 85520k buffers
Swap: 1004052k total, 4361k used, 999692k free, 286040k cached

PID     USER    PR   NI  VIRT    RES   SHR   S  %CPU %MEM  TIME+   COMMAND
9463   mysql    16    0     687m   111m  3328  S   53           5.5      56:17.64  mysqld
18749 nagios   16    0     141m    134m  1868  S   12           6.6      1345:01  nagios2db_status
24636 nagios   17    0    34661   10m    712     S   8             0.5     1195:15    nagios
22442 nagios   24   0     6047    2024   1452  S   8             0.1      0:00.04  check_time.pl

The first line of output is the same as you see from the uptime command. As you can see the machine isn’t too loaded heavily for a four-CPU machine:

top - 14:08:25 up 38 days, 8:02, 1 user, load average: 1.71, 1.77, 1.68

Top command provides you extra with additional metrics beyond standard system load. For example, the Cpu(s) line shows you information about what the CPUs are currently doing:

Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.4%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st

Read More : Troubleshoot with Top Command in Linux with Examples

In previous example, you saw that the system is over 50% idle which matches a load of 1.70 on a four-CPU system. When you diagnose a slow system, one of the first values you should look at is I/O wait so you can rule out disk I/O.

  • If I/O wait is less then you can look at the idle percentage.
  • If I/O wait is high then the next step is to diagnose what is causing high disk I/O which we will cover momentarily.
  • If I/O wait and idle time are low then you will likely see a high user time percentage so you must diagnose what is causing it.
  • If the I/O wait is less and the idle percentage is high then you know any sluggishness is not because of CPU resources and you will have to start troubleshooting elsewhere.

Diagnose High User Time

A common problem is to diagnose high load due to a high percentage of user CPU time. This is most common issue since the services on your machine take the bulk of the system load and they are user processes. If you observe high user CPU time but low I/O wait times then you need to identify which processes on the system are consuming the most CPU. By default, top sorts all of the processes by their CPU usage:

PID     USER     PR  NI  VIRT   RES    SHR    S   %CPU %MEM  TIME+       COMMAND
9463   mysql     16    0    687m  111m   3328   S     54           5.5     59:17.64       mysqld
18749 nagios     1     0    141m   134m   1868   S     13           6.6     1345:01         nagios2db_status
24636 nagios   17    0    34661  10m     712      S     8            0.5     1195:15          nagios
22442 nagios   24   0    6047    2024   1452    S     8            0.1     0:00.04         check_time.pl

Here In this example, the mysqld process is consuming 54% of the CPU and the nagios2db_status process is consuming 13%. Make a note this is the percentage of a single CPU, so if you have a four-CPU machine then you could possibly see more processes consuming 99% CPU.

The most common high-CPU-load situations you will observe like all of the CPUs being consumed either by one or two processes or by  large number of processes. In this case, to resolve the issue you could simply kill the process that is consuming the CPU (hit K and then type in the PID number for the process).

In case of multiple processes, you might have one system doing too many things. For instance, you might have a large number of Apache processes running on a web server along with some log parsing scripts that run using cron. All of these processes can be consuming more or less the same amount of CPU. The resolution to this problems can be trickier for the long term. As in the web server you need all of those Apache processes to run, yet you might need the log parsing programs as well. In the short term, you can kill (or possibly postpone) some processes until the load comes down, but in the long term, you might need to consider increasing the resources on your machine or splitting some of the functions across more servers.

Diagnose Out-of-Memory Issues

The next two lines in the top command output shows valuable information about RAM usage. Before diagnosing specific system problems it’s important to be able to rule out memory issues.

Mem: 1024176k total, 997407k used, 26768k free, 85520k buffers
Swap: 1004052k total, 4361k used, 999692k free, 286040k cached

The first line shows us how much physical RAM is available, used, free, and buffered. The second line describes us similar information about swap usage along with how much RAM is used by the Linux file cache. At first  it might look as if the machine is almost out of RAM since the machine reports that only 26,768k is free. Most troubleshooters are misled by the used and free RAM because of the Linux file cache. Once Linux loads a fi le into RAM, it doesn’t necessarily remove it from RAM when a program is done with it. If there is RAM available, Linux will cache the fi le in RAM so that if a program accesses the fi le again, it can do so much more quickly. If the system does need RAM for active processes, it won’t cache as many files.

Diagnose High I/O Wait

When you face high I/O wait issue, one of the first things you should check is whether the machine is using a lot of swap. Since hard drive is slower than RAM and when a system runs out of RAM and starts using swap in your server, the performance of the server suffers. Anything that wants to access the disk, it has to compete with swap for disk I/O. So you first diagnose whether you are out of memory and if so manage the problem there. If you have plenty of RAM available in your server, you will need to figure out which program is getting the most I/O. Sometimes it is difficult to figure out exactly which process is taking high I/O, but if you have multiple partitions in your system, you can narrow it down by figuring it out which partition got most of the I/O. To do this, you will need the iostat program, which is provided by the sysstat package in both RedHat and Debian-based systems. if it isn’t installed, you can install it by your package manager.

you can run iostat command without any arguments to see an overall glimpse of your system:

$ sudo iostat
Linux 2.6.24-19-server (hostname) 01/31/2009
avg-cpu: %user %nice %system %iowait %steal %idle
                   5.73   0.07      2.03        0.53       0.00   91.64

Device: tps     Blk_read/s   Blk_wrtn/s     Blk_read   Blk_wrtn
sda        9.82   417.96             27.53               30227262   1990625
sda1      6.55   219.10              7.12                  15845129    515216
sda2     0.04   0.74                 3.31                   53506          239328
sda3     3.24    198.12            17.09                 14328323    1236081

The first line of output gives CPU information similar to what you would see in top. Here is what each of the columns represents:

  • tps – This shows the transfers i.e I/O requests sent to the device per second to device.
  • Blk_read/s – This column shows the number of blocks read from the device per second.
  • Blk_wrtn/s – This column shows number of blocks written to the device per second.
  • Blk_read – This column shows the total number of blocks read from the device.
  • Blk_wrtn – This column shows the total number of blocks written to the device.

When you face issue heavy I/O load in your system, the first step is to look at each of the partitions and identify which partition is getting the heaviest I/O load. Say, for instance, that you have a database server and the database itself is stored on /dev/sda3. If you see that the bulk of the I/O is coming from there, you have got a good clue that the database is likely
consuming the I/O.

In addition to iostat, we have another much simpler tool available in newer distributions called iotop. In effect, it is a combination of top and iostat and it shows you all of the running processes on the system sorted by their I/O statistics. You can run iotop as root and see output like the following:

$ sudo iotop
Total DISK READ: 189.51 K/s | Total DISK WRITE: 0.00 B/s
Actual DISK READ:  189.51 K/s | Actual DISK WRITE: 0.00 B/s 
TID    PRIO  USER DISK  READ DISK WRITE SWPIN    IO>       COMMAND
8069  be/4   root      189.52 K/s       0.00 B/s        0.00 %    0.00 %    rsync --server --se
4245  be/4   ravi       0.00 B/s           3.79 K/s        0.00 %    0.00 %     cli /usr/lib/gnome-
4246  be/4   ravi       0.00 B/s           3.79 K/s        0.00 %    0.00 %     cli /usr/lib/gnome-
10       be/4   root      0.00 B/s          0.00 B/s        0.00 %    0.00 %     init

In this example, you can clearly see that there is rsync process tying consume your read I/O.

Read Also : Top 20 Linux Find Command Practical Examples

Share on:

Ravindra Kumar

Hi, This is Ravindra. I am founder of TheCodeCloud. I am AWS Certified Solutions Architect Associate & Certified in Oracle Cloud as well. I am DevOps and Data Science Enthusiast.

Recommended Reading..

Leave a Comment