Linux why is load high
Clearly some process is saturating CPU, but it doesn't seem to show up on the process list. Some sort of hidden Linux filesystem checker? Any ideas how to find the process in question?
Load average doesn't mean what you think it means. It's not about instant CPU usage, but rather how many processes are waiting to run. Usually that's because of lots of things wanting CPU, but not always. A common culprit is a process waiting for IO - disk or network. This is from the ps manpage, so you can find more detail there - R and D processes are probably of particular interest. Sign up to join this community.
The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. How to troubleshoot high load, when there are no obvious processes [duplicate] Ask Question. Asked 7 years, 5 months ago. Prior to 1. How much? That's pretty close to the 1. This is a system where one thread tar plus a little more some time in kernel worker threads are doing work, and Linux reports the load average as 1.
If it was measuring "CPU load averages", the system would have reported 0. Likewise, which load averages? CPU load averages? System load averages? Clarifying it this way lets me make sense of it like this:.
Perhaps one day we'll add additional load averages to Linux, and let the user choose what they want to use: a separate "CPU load averages", "disk load averages", "network load averages", etc. Or just use different metrics altogether. Some people have found values that seem to work for their systems and workloads: they know that when load goes over X, application latency is high and customers start complaining. But there aren't really rules for this.
It's somewhat ambiguous, as it's a long-term average at least one minute which can hide variation. One system with a ratio of 1. I once administered a two-CPU email server that during the day ran with a CPU load average of between 11 and 16 a ratio of between 5.
Latency was acceptable and no one complained. As for Linux's system load averages: these are even more ambiguous as they cover different resource types, so you can't just divide by the CPU count.
It's more useful for relative comparisons: if you know the system runs fine at a load of 20, and it's now at 40, then it's time to dig in with other metrics to see what's going on. When Linux load averages increase, you know you have higher demand for resources CPUs, disks, and some locks , but you aren't sure which. You can use other metrics for clarification. For example, for CPUs:. The first two are utilization metrics, the last three are saturation metrics.
Utilization metrics are useful for workload characterization, and saturation metrics useful for identifying a performance problem. These allow you to calculate the magnitude of a performance problem, eg, the percent of time a thread spent in scheduler latency.
Measuring the run queue length instead can suggest that there is a problem, but it's more difficult to estimate the magnitude. The schedstats facility was made a kernel tunable in Linux 4. Delay accounting exposes the same scheduler latency metric, which is in cpustat and I just suggested adding it to htop too, as that would make it easier for people to use.
Apart from CPU metrics, you can also look for utilization and saturation metrics for disk devices. I focus on such metrics in the USE method , and have a Linux checklist of these. While there are more explicit metrics, that doesn't mean that load averages are useless.
They are used successfully in scale-up policies for cloud computing microservices, along with other metrics. With these policies it's safer to err on scaling up costing money than not to scale up costing customers , so including more signals is desirable. If we scale up too much, we'll debug why the next day. The one thing I keep using load averages for is their historical information.
If I'm asked to check out a poor-performing instance on the cloud, then login and find that the one minute average is much lower than the fifteen minute average, it's a big clue that I might be too late to see the performance issue live.
But I only spend a few seconds contemplating load averages, before turning to other metrics. In , a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages.
These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves. The use of the uninterruptible state has since grown in the Linux kernel, and nowadays includes uninterruptible lock primitives.
If the load average is a measure of demand in terms of running and waiting threads and not strictly threads wanting hardware resources , then they are still working the way we want them to. In this post, I dug up the Linux load average patch from — which was surprisingly difficult to find — containing the original explanation by the author. This visualization provides many examples of uninterruptible sleeps, and can be generated whenever needed to explain unusually high load averages.
I also proposed other metrics you can use to understand system load in more detail, instead of load averages. You can comment here, but I can't guarantee your comment will remain here forever: I might switch comment systems at some point e.
Systems Performance 2nd Ed. Brendan Gregg's Blog home. If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing. If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem it depends. History The original load averages show only CPU demand: the number of processes running plus those waiting to run. The load average is an average of the number of runnable processes over a given time period. For example, an hourly load average of 10 would mean that for a single CPU system at any time during that hour one could expect to see 1 process running and 9 others ready to run i.
The Three Numbers These three numbers are the 1, 5, and 15 minute load averages. Here is that experiment, graphed: Load average experiment to visualize exponential damping The so-called "one minute average" only reaches about 0. Why, exactly, did Linux do this? The load average does not only involve the CPU. I agree with Trenton that this article is inaccurate. CPU is only of the factors that can cause high load.
This article is incorrect. Have a question or suggestion? Please leave a comment to start the discussion. Please keep in mind that all comments are moderated and your email address will NOT be published.
Save my name, email, and website in this browser for the next time I comment. Notify me of followup comments via e-mail. You can also subscribe without commenting. This site uses Akismet to reduce spam.
Learn how your comment data is processed. Load average — is the average system load calculated over a given period of time of 1, 5 and 15 minutes. Note that: All if not most systems powered by Linux or other Unix-like systems will possibly show the load average values somewhere for a user.
0コメント