February 19, 2007

CLI Magic: Linux troubleshooting tools 101

Author: M. Shuaib Khan

When something goes wrong with your Linux-based system, you can try to diagnose it yourself with the many troubleshooting tools bundled with the operating system. Knowing about these tools, and how to effectively use them, can help you overcome many of the common problems on your system. Here's a list of some of the weapons in your arsenal against Linux problems.

Strace

When an application you successfully compiled fails during run time, it usually gives you an error. On a lucky day, the error message might contain details of what went wrong, and give you clues about what to do to fix the problem. But this is not what usually happens. Often, error messages are obscure and of little help in figuring out what went wrong.

Strace can come in handy in such situations. This utility traces the system calls a program uses during its run time. A system call is a Linux kernel function that provides secure access to a system's resources, such as memory, disk, and network.

Strace is easy to use -- just pass the name of the executable you want to run as an argument to the strace application. As an example, check out what output you get when you trace the following simple "Hello, world!" program:

#include
int main()
{
printf("Hello, world!\n");
return 0;
}
$gcc -o hello hello.c
$strace ./hello

execve("./hello", ["./hello"], [/* 94 vars */]) = 0
brk(0)                                  = 0x804b000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7eff000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/i686/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls/i686/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls/i686", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/i686/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/i686/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/i686", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=186839, ...}) = 0
mmap2(NULL, 186839, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7ed1000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 3

.
.
.

write(1, "Hello, world!\n", 14Hello, world!
)         = 14
exit_group(0)                           = ?
Process 6006 detached

In the above output, you can see that to run this simple program, a good number of system calls were made to open, read, write, close, etc. Notice that there were a large number of unsuccessful calls to open the libc.so.6 library. That's because the run time linker is looking in several places to find the library. The only successful call to open the library is when the linker looks for it in the /lib location, as shown by the line shown in bold letters in the output, where the open system call returns a value of '3,' which is an indication of successful opening. If we could somehow make the loader look in /lib first, we could save a lot of unsuccessful calls for the library search. And of course we can, by bringing the string /lib to the beginning of the environment variable LD_LIBRARY_PATH, which the run time linker uses to search for the libraries required by the running program.

$export LD_LIBRARY_PATH=/lib

The output of strace can be quite unwieldy when it's dumped to the console. It is common to redirect this text to a file by using the command's -o option. Another common option is -p, or PID, which allows you to connect to a running program and see its output. This is useful in the case of long-running daemons which you cannot restart easily, or which need to be monitored very rarely.

A nice example of how useful strace can get comes from a user who had installed multimedia codecs, including libdvdcss, which allowed him to play encrypted DVDs. But when he tried to use his movie player to play DVDs, he got strange errors. On tracing the movie player with strace, he figured out that the run time linker was looking in the wrong places for the installed codecs. After searching for the required library and putting it in a directory where the linker could find it, he was able to run the movie player to play his DVDs.

ltrace

ltrace is a sister application of strace. It works just like strace, but instead of tracing the system calls executed during the run time of a program, it traces the dynamic library calls. If we ltrace the previous "Hello, world!" program, here is what we get as the ouput:

$ltrace ./hello
__libc_start_main(0x80483b4, 1, 0xbfacb0d4, 0x80483f0, 0x80483e0
puts("\001"Hello, world!
)                                                                         = 14
+++ exited (status 0) +++

The output shows that the executable "hello" uses only one library function -- namely "puts" to put the string "Hello, world!\n" on the output console.

ltrace isn't as commonly used as strace. It is preferred when a detail trace of a program is required, especially when we are interested in the details of the dynamic library functions the program uses, such as malloc(), gethostbyname(), and setenv().

lsof

The lsof tool is used to list all the files open on a Linux system. Remember that in true Unix spirit, almost everything is a file. You access your hardware through files located in /dev, information about CPU, memory, and other devices is located in files on /proc, and network connections, a.k.a. sockets, are also sometimes represented as files.

lsof becomes really handy when you want to know what files a process has currently opened, or which processes are currently acting on a certain file:

$lsof
COMMAND    PID       USER   FD      TYPE     DEVICE     SIZE       NODE NAME
init         1       root  cwd       DIR        8,1     4096          2 /
init         1       root  rtd       DIR        8,1     4096          2 /
init         1       root  txt       REG        8,1   533224    1658100 /sbin/init
init         1       root   10u     FIFO       0,14                2941 /dev/initctl
migration    2       root  cwd       DIR        8,1     4096          2 /
migration    2       root  rtd       DIR        8,1     4096          2 /

lsof lists the running command, its process ID, the user to whom the process belongs, file descriptor of the opened file, type of the file opened, major and minor device numbers of the file, size of the file, node number of its inode, and the name of the file opened or the mount point of the device being acted on.

To list files opened by process belonging to a particular user, use:

$lsof -u user

To see a list of files opened by a particular process, use:

$lsof -p pid

Sometimes, you are unable to unmount a particular device because the system reports it as busy, even though you think it is not used by any process. To see what process is still using it, use:

$lsof /dev/mount-point

This will give you the list of processes using the device. Kill them, and you are ready to unmount the device.

top

Top lists the top processes running on a system at any specific time. The criteria for top could be top CPU consumers, top memory consumer, etc.

$top
top - 18:21:33 up  1:40,  4 users,  load average: 0.30, 0.21, 0.27
Tasks: 155 total,   2 running, 148 sleeping,   0 stopped,   5 zombie
Cpu(s):  6.9%us,  2.7%sy,  0.0%ni, 80.5%id,  9.6%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:    506908k total,   492384k used,    14524k free,    12900k buffers
Swap:  1052248k total,    39836k used,  1012412k free,   144944k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      15   0   744  124   80 S    0  0.0   0:01.37 init
    2 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/1
    5 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/1

Top can be useful when you want to know what process is consuming how much of a system's resources. In particular, if a certain process is consuming too much memory, you can locate it through top and take appropriate measures to bring it down, if it's not critical.

Traceroute

Traceroute is a network troubleshooting tool. For a network packet to reach a remote computer from your machine, it has to go through different routers on the network. Sometimes, even though both the local and the remote machines are functioning properly and connected to the network, they can't communicate with each other because of a problem somewhere in between the two machines. To trace where the packet is dropped on the network, use traceroute:

$traceroute google.com
Hop	(ms)	(ms)	(ms)		IP Address	Host name
1	0	0	0		66.98.244.1	gphou-66-98-244-1.ev1servers.net
2	0	1	0		66.98.241.16	gphou-66-98-241-16.ev1servers.net
.
.
.
13	29	28	28		72.14.232.57	-
14	34	35	36		64.233.175.42	-
15	28	28	29		64.233.167.99	py-in-f99.google.com

The output shows that the packet had to go through 15 different machines before successfully reaching google.com. It lists the IP addresses and names (if available) of all the intermediate machines the packet went through.

ping

Ping can help you figure out if a remote machine on the network is up and connected. Ping sends ICMP messages to the remote machine, and prints the details if it gets a reply from the remote machine. Sometimes system administrators disable ICMP messages on their machines, which means that a ping won't get a reply from that particular machine, even it is present on the network, so be sure that the remote machine you're interested in does reply to ICMP messages before assuming that it is down.

$ping google.com
PING google.com (72.14.207.99) 56(84) bytes of data.
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=1 ttl=238 time=265 ms
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=2 ttl=238 time=269 ms
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=3 ttl=238 time=272 ms
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=4 ttl=238 time=263 ms

hexdump

The hexdump utility is useful for seeing the contents of a binary file in a human-readable format, which can be ASCII, hexadecimal, octal, or decimal. For example, to see what the contents of the executable /bin/ls looks like in hex and ASCII, use:

$hexdump -C /bin/ls
00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 03 00 01 00 00 00  80 9c 04 08 34 00 00 00  |............4...|
00000020  0c 5c 01 00 00 00 00 00  34 00 20 00 0a 00 28 00  |.\......4. ...(.|
00000030  1f 00 1e 00 06 00 00 00  34 00 00 00 34 80 04 08  |........4...4...|
00000040  34 80 04 08 40 01 00 00  40 01 00 00 05 00 00 00  |4...@...@.......|
.
.
.

The information on the left is the contents of the file in hex, while the text between the bars is the ASCII representation.

Hexdump is useful for searching text strings within an executable file for which source code might not be available. It can help you locate specific error messages and where they occur in a file.

Conclusion

Troubleshooting Linux is an art, but these tools can help you master it. You can read more usage details about these tools on their respective man pages. Remember that knowing how to use a tool is not the same as knowing when to use it. As you encounter different problems and tackle them, you'll eventually learn the art of diagnosing trouble and fixing problems on your Linux system.