LinuxDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


Linux System Failure Post-Mortem

by Jennifer Vesperman
11/01/2001

Your Linux machine has just died, and your high up-time is wrecked. How do you tell what happened, and more importantly, how do you prevent a recurrence?

This article doesn't discuss user space programs -- few of them will crash the box without a chance of recovery; the only one I know of which reliably does that is crashme. Most crashes are caused by kernel "oopses," or hardware failures.

What is a kernel oops, anyway?

A kernel oops occurs when the kernel code gets into an unrecoverable state. In most cases, the kernel can write its state to the drive, which allows you to determine what happened if you have the correct tools. In a few cases, such as the Aiee, killing interrupt handler crash, the kernel is unable to write to the drive. With no interrupt handler, interrupt-driven I/O is impossible.

Even in the worst cases, some data can be retrieved and the cause can often be determined.

Tools and postmortem info

Diagnostic tools are a necessary part of kernel error recovery. The most obvious tool is the system log. dmesg is a useful tool for extracting the relevant data from the system and kernel logs.

There is also a specialist tool for tracing kernel oopses. ksymoops requires the system to be configured the same way it was when it crashed, and should be used as soon as possible after the crash. It traces the function chain, and displays the function and offset which the kernel was in when it crashed.

With the information from the system log, or (for more precision) from ksymoops, a system admininistrator can determine which function the kernel was trying to run when it crashed. It is then much easier to determine whether to change a hardware driver, swap in a different loadable module -- or post an error report to the relevant kernel developer.

dmesg

If your system is still running, you can run dmesg to grab the kernel diagnostic information that is normally on the console. The messages are also written to /proc/kmsg, but dmesg allows you to copy it to a file for later perusal or for posting to a kernel expert. dmesg can also be read by most users, /proc/kmsg has limited permissions.

dmesg > filename

Useful arguments:

-nlevel
set level of messages to appear in the message log. 1 is kernel panic messages only. 7 is everything, including developer debug messages.
-sbufsize
limit the size of the message buffer.
-c
print, then empty the message buffer.

syslogd & klogd

syslogd and klogd are the system loggers. klogd handles kernel logging, but is often bundled in with syslogd and configured with it. The loggers themselves are useless for after the fact debugging -- but can be configured to log more data for the next crash.

Comment on this articleAre there other open-source tools you use to find out what caused a system crash? Please share your experiences with other diagnostic tools.
Post your comments

Use /etc/syslogd.conf to determine where the system log files are, and to see where the kernel log files are, if /proc/kmsg doesn't exist.

If you are running loadable modules, klogd must be signaled when the modules are loaded or unloaded. The sysklogd source includes a patch for the modules-2.0.0 package to ensure that the module loaders and unloaders signal klogd correctly.

From modutils 2.3.1, module logging is built in. To use the logging, create /var/log/ksymoops, owned by root and set to mode "644" or "600". The script insmod_ksymoops_clean will delete old versions, and should be run as a cron job.

Pages: 1, 2

Next Pagearrow




Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!


Linux Resources
  • Linux Online
  • The Linux FAQ
  • linux.java.net
  • Linux Kernel Archives
  • Kernel Traffic
  • DistroWatch.com


  • Sponsored by: