Page 1 of 1

Sudden high load average on X300

Posted: Mon Oct 06, 2008 1:33 pm
by kalle
Hi,

running Debian Lenny on our X300, we experienced a weird problem in the past few weeks, which we could not address with ideas found on the web. Since we have little knowledge about the actual source of the problem (it's our boss's notebook and the error - as usual - first pops up at home when he has nobody around to have a look on it), I hope that someone of you might help us.

Now to the problem: After several hours of being idle and/or several hours to days of uptime, the System suddenly runs into a high load average, but interestingly without any processes creating a high CPU usage or similar. This causes the system only to proceed as long as interrupts are thrown (move the mouse and a process continues executing, stop moving and the kernel shows no tend to finish that program in the background).

One typical symptom of this constellation is that mouse and keyboard start to stutter, i.e. it becomes in general very hard to interact with the system.

As far as we can judge, only a reboot solves this problem.

With kernel 2.6.25, this problem occurs after about 1-2 days of uptime and/or about 5 hours of being idle, with 2.6.26, it is even more frequent (about 10 hours of uptime and/or 2-3 hours of being idle).

What we have done so far:
- Checked dmesg/messages -> no hint
- disabled ACPI -> might have circumvented the problem, at least none of these freezes for several days (but no battery level, video playback, ...)
- checked for D state processes -> none
- Updated ACPI stuff, kernel, xorg-related stuff, ... -> no change
- Changed the cpufreq policy to maximal -> nothing
- noapic bootflag -> nothing new
- BIOS options for powersave options -> no change

To the time of the crashes, only few programs are run by the user, including Firefox (Iceweasel), most probably with Flash plugin, Acroread and few xterms. Window manager is fvwm.

At this point, we are actually quite stumped and do not know at all how to proceed. One particular problem in this actual case is that we seldomly have the opportunity to hold a currently broken system in hands, and if so, there is usually not much time left to start longer examinations.

Has anybody of you ever experienced a similar behaviour, or can give us some hint potentially pointing us into the correct direction? Any help is greatly appreciated.

Sorry in advance if I will not answer your suggestions immediately, since I am frequently away from the office, but I will definitely read and test your ideas :-)

Thanks a lot!!
Kalle

Posted: Mon Oct 06, 2008 1:52 pm
by tarvoke
for me, sometimes intensive disk access will result in weird slowdowns without actually spiking load or cpu usage. actually this happens a lot when my machines run e.g. updatedb (file indexing) or rkhunter (rootkit detection). bumping cpu from dynamic to max doesn't seem to help.

it is very strange, as if the ahci module or perhaps the i/o elevator gets hung up somehow. if I am copying many gb of files from one sata drive to another, sometimes I see this but other times not at all.

alternately, flash plugin and acroread are definite culprits. both of these can be extremely unpredictable and eat up resources, although normally that should still show up in top. also: closing firefox may not kill the flash engine, and closing acrobat may not close that app either; you could try seeing if there are processes hanging around that should not be there.

Posted: Mon Oct 06, 2008 4:55 pm
by kalle
Thanks for your reply!

Indeed, having firefox shut down and killing the various plugins (which sometimes even continue in occupying 'their' screen space) happens very often on my private PCs, when I need to give firefox a SIGKILL after flash hangs once again.

Unfortunately, this did not fix the issue on the X300, meaning, even if you assure no such processes are running in the background, still high load averages can be observed. The question is really, can it perhaps be that some other process you might not think about at first glance is set into some undefined state by any of the web-related programs, and consequently needs to be killed?

Once, we basically closed any user task not immediately related to the X session, and started unloading potentially problematic kernel modules (usb, wifi, ...). This went well but did not change anything until we approached the cdrom module, which finally froze the kernel - and thereby destroyed our testing scenario...

Posted: Mon Oct 06, 2008 10:43 pm
by lightweight
Like tarvoke high load not explained by high CPU processes is almost always disk io for me. Is the machine swapping? iostat will give you io load regardless and I think you should look there. You'll need to grab sysstat from your repo if you don't have it already (its in all official Debian repositories but not in Lenny's base).

Posted: Tue Oct 07, 2008 2:39 am
by kalle
Thanks for the hint!

Hopefully I can soon access the system again while its hanging and can check that.

What I am wondering about, though, is that this almost always happens when there is no active user interaction with the machine, which would imho be rather unlikely to coincide with heavy swapping processes running. But... I fully agree that it might be somehow i/o related... I'll see whether I can read out something on the next hang :-)