Search This Blog

Tuesday, May 08, 2007

Linux Troubleshooting

http://people.redhat.com/alikins/troubleshooting/

This is a guide to basic, and not so basic troubleshooting and
debugging on Red Hat linux systems. Goals include description
and useage of common tools, how to find information, etc.
Basically, info that may be helpful to someone diagnosing
a problem. Emphasis will be on software issues, but
might include hardware as well.



Enviroment settings
- Allowing Core Files

"core" files are dumps of a processes memory. When a program crashes
it can leave behind a core file that can help determine what
was the cause of the crash by loading the core file in a debugger.

By default, most linuxes turn off core file support by setting the
maximum allowed core file size to 0.

In order to allow a segfaulting application to leave a core, you need
to raise this limit. This is done via `ulimit`. To allow core
files to be of an unlimitted size, issue:

ulimit -c unlimited

See the section on GDB for more information on what to do
with core files.

- LD_ASSUME_KERNEL

LD_ASSUME_KERNEL is an enviroment variable used by the dynamic
linker to decide what implementation of libraries are used. For
most cases, the most important lib is the c lib, or "libc" or
"glibc".

The reason "glibc" is important is because it contains the
thread implentation for a system.

The values you can set LD_ASSUME_KERNEL to equate to linux
kernel versions. Since glibc and the kernel are tighly bound,
it's neccasary for glibc to change it's behaviour based on
what kernel version is installed.

For properly written apps, there sould be no reason to use
this setting. However, for some legacy apps that depend
on a particular thread implementation in glibc, LD_ASSUME_KERNEL
can be used to force the app to use an older implementation.

The primary targets fore LD_ASSUME_KERNEL=2.4.20 for use
of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 use the
implementation in /lib/i686 (newer LinuxTrheads).
LD_ASSUME_KERNEL=2.2.5 or older uses the implementation
in /lib (old LinuxThreads)

For an app that requires the old thread implentation, it
can be launch as:

LD_ASSUME_KERNEL=2.2.5 ./some-old-app

see http://people.redhat.com/drepper/assumekernel.html for
more details.

- glibc enviroment variables.

Theres a wide variety of enviroment varibles that
glibc uses to alter it's behaviour, many of which are
useful for debugging or troubleshoot purposes.

A good refence on these variables is at:
http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html

Some interesting ones:

LANG and LANGUAGE

LANG sets what message catalog to use, while LANGUAGE
sets LANG and all the LC_* variables. These are
control
the locale specific parts of glibc.

Lots of programs are written expecting to be one in
one local, and can break in other locales. Since
locale
settings can change things like sort order
(LC_COLLATE),
and the time formats (LC_TIME), shells scripts are
particularly prone to problems from this.

A script that assumes the sort order of something is
a good example.

A common way to test this is to try running the
troublesome app with the locale set to "C", or the
default locale.

LANGUAGE=C ls -al

If the app starts behaviour when ran that way, there
is probably something in the code that is assuming
"C" local (sorted lists and timeformats are strong
candidates).


- glibc malloc stuff
- all the glibc env variable stuff


Tools
Effiently debugging and troubleshooting is often a matter
of knowing the right tools for the job, and how to use
them.

- strace
- simple useage
- filtering output
- examples
- use as profiling
- see what files are open
- network connections

Strace is one of the most powerful tools available
for troubleshooting. It allows you to see what an application
is doing, to some degree.

strace display all the system calls that an application
is makeing, what arguments it passes to them, and what the
return code is. A system call is generally something that
requires the kernel to do something. This generally means
I/O of all sorts, process management, shared memory and
IPC useage, memory allocation, and network useage.

- examples

The simplest example of using strace is as follows:
strace ls -al

This starts the strace process, which then starts `ls -al`
and shows every system call. For `ls -al` this is mostly
I/O related calls. You can see it stat'ing files, opening
config files, opening the libs it is linked against, allocatin
memory, and write()'ing out the contents to the screen.


- what files is this thing trying to open

A common troubleshooting technique is to see
what files an app is reading. You might want to make
sure it's reading the proper config file, or looking
at the correct cache, etc. strace by default shows
all file i/o operations.

But to make it a bit easier, you can filter
strace output. To see just file open()'s

strace -eopen ls -al

- whats this thing doing to the network

To see all network related system calls (name
resolution, opening sockets, writing/reading to sockets, etc)

strace -e trace=network curl --head http://www.redhat.com

- rudimentary profiling

One thing that strace can be used for that is useful for
debugging performance problems is some simple profiling.

strace -c ls -la

Invoking strace with '-c' will cause a cumulative report of
system call useage to be printed. This includes approximate
amount of time spent in each call, and how many times a
system call is made.

This can sometimes help pinpoint performance issues, especially
if an app is doing something like repeatedly opening/closing
the same files.

strace -tt ls -al

the -tt option causes strace to print out the time each call
finished, in microseconds.

strace -r ls -al

the -r option causes strace to print out the time since the
last system call. This can be used to spot where a process
is spending large amounts of time in user space, or especially
slow syscalls.

- following forks and attaching to running processes

Often is difficult or impossible to run a command under
strace (an apache httpd for instance). In this case, it's
possible to attach to an already running process.

strace -i 12345

where 12345 is the PID of a process. This is very finding
for trying to determine why a process has stalled. Many
times a process might be blocking while waiting for I/O.
with strace -p, this is easy to detect.

Lots of processes start other processes. It is often desireable
to see a strace of all the processes.

strace -f /etc/init.d/httpd start

will strace not just the bash process that runs the script, but
any helper utilities executed by the script, and httpd itself.

Since strace output is often a handy way to help a developer
solve a problem, it's useful to be able to write it to a file.
The easiest way to do this is with the -o option.

strace -o /tmp/strace.out program

Being somewhat familar with the common syscalls for linux
is helpful in understanding strace output. But most of the common
ones are simple enough to be able to figure out on context.

A line in strace output is essentially, the system call name,
the arguments to the call in parens (sometimes truncated...), and then
the return status. A return status for error is typically -1, but varies
sometimes. For more information about the return status of a typically
system call is by `man 2 syscallname`. Usually the return status will
be documented in the "RETURN STATUS" section.

Another thing to note about strace it is often shows "errno"
status. If your not familar with unix system programming, errno is a global
variable that gets sets to specific values when some commands execute. This
variable gets set to different values based on the error mode of the command.
More info on this can be found in `man errno`. But typically, strace will
show the brief description for any any errno values it gets. ie:

open("/foo/bar", O_RDONLY) = -1 ENOENT (No such file or directory)


strace -s X

the -s option tells strace to show the first X digits of strings.
The default is 32 characters, which sometimes isnt enough. This will
increase the info available to the user.


- ltrace
- simple useage
- filtering output

ltrace is very similar to strace, except ltrace focuses on
tracing library calls.

For apps that use a lot of libs, this can be a very powerful
debugging tool. However, because most modern apps use libs
very heavily, the output from ltrace can sometimes be
painfully verbose.

There is a distinction between what makes a "systemcall"
and a call to a library functions. Sometimes the line between the
two is blurry, but the basic difference is that system calls are
essentially communicating to the kernel, and library calls are just
running more userland code. System calls are usually require for
things like I/O, process controll, memory management issues,
and other "kernel" things.

Library calls are by bulk, generaly calls to the standard
C library (glibc..), but can of course be calls to any library,
say Gtk,libjpeg, libnss, etc. Luckily most glibs functions
are well documented and have either man or info pages. Documentation
for other libraries varies greatly.

ltrace supports the -r, -tt, -p, and -c options the same
as strace. In addition it supports the -S option which
tells it to print out system calls as well as library
calls.

One of the more useful options is "-n 2" which will
indent 2 spaces for each nested calls. This can make it
much easier to read.

Another useful option is the "-l" option, which
allows you to specify a specific library to trace, potentionall
cutting down on the rather verbose output.



- gdb

`gdb` is the GNU debugger. A debugger is typically used by developers
to debug applications in development. It allows for a very detailed
examination of exactly what a program is doing.

That said, gdb isnt as useful as strace/ltrace for troubleshooting/sysadmin
types of issues, but occasionally it comes in handy.

For troubleshooting, its useful for determining what cause
core files. (`file core` will also typically show you this information too).
But gdb can also show you "where" the file crashed. Once you determine
the name of the app that caused the failure, you can start gdb with:

gdb filename core

then at the prompt type
`where`

The unfortunate thing is that all the binaries we ship are
stripped of debuggig symbols to make them smaller, so this often returns
less than useful information. However, starting in Red Hat Enterprise Linux
3, and included in Fedora, there are "debuginfo" packages. These
packages include all the debugging symbols. You can install them the
same as any other rpm, so `rpm`, `up2date`, and `yum` all work.

The only difficult part about debuginfo rpms is figuring out
which ones you need. Generally, you want the debuginfo package
for the src rpm of the package thats crashing.

rpm -qif /path/to/app

Will tell you the info for the binary package the app is part of.
Part of that info include the src.rpm. Just use the package name
of the src rpm plus "-debuginfo"


- python debugging

- perl debugging
- `splain`
- `perl -V`
- perldoc -q
- perldoc -l

- sh debugging

- bugbuddy etc

- top
`top` is a simple text based system monitoring tool. It packs
a lot of information unto the screen, which can be helpful
troubleshooting problems, particularly performance related
problems.

The top of the "top" output includes a basic summary of the system.
The top line is current time, uptime since the last reboot, users
logged, and load average. The load average values here are the
load for the last 1, 5,and 15 minutes. A load of 1.0 is considerd
100% cpu utilization, so loads over one typically means stuff
is having to wait. There is a lot of leeway and approxiation in
these load values however.

The memory line shows the total physical ram available
on the system, how much of it is used, how much free, and how
much is shared, along with the amount of ram in buffers. These
buffers are typically file system cachine, but can be other things.
On a system with a significant uptime, expect the buffer value to
take up all free physical ram not in use by a process.
The swap line is similar.

Each of the entries viewable in the system contain several
fields by default. The most interesting are RSS, %CPU, and
time. RSS shows the amount of physical ram the process is consuming.
%CPU shows the percentage of the available processor time a process
is taking,and time shows the total amount of processor time the process
has had. A processor intensive program can easily have more "time"
in just a few seconds than a long running low cpu process.

Sorting the output:

M : sorts the output by memory useage. Pretty handy for figuring
out which version of openoffice.org to kill.

P : sorts the process by the % of cpu time they are using.

T : sorts by cumulative time

A : sorts by age of the process, newest process first
Command line options:
The only really useful command line options are:

b [batch mode] writes the standard top output to
stdout. Useful for a quick "system monitoring hack".

ie, top d 360 b >> foo.output
to get a snapshot of the system appended to foo.output every
six minutes.

- ps
`ps` can be thought of as a one shot top. But it's a bit
more flexible in it's output than top.

As far as `ps` commandline options go, it can get pretty
hairy. The linux version of `ps` inherits ideas from both
the BSD version, and the SYSV version. So be warned.

The `ps` man page does a pretty good job of explaining
this, so look there for more examples.

some examples:

ps aux

shows all the process on the system is a "user" oriented
format. In this case meaning the username of the owner
of the process is shown in the first column.

ps auxww

the "w" option, when used twice, allows the output to be
of unlimited width. For apps started with lots of commandline
options, this will allow you to see all the options.

ps auxf

the 'f" option, for "forest" tries to present the list
of processes in a tree format. This is a quick and easy
way to see which process are child processes of what.

ps -eo pid,%cpu,vsz,args,wchan

This is an interesting example of the -eo option. This
allows you to customize the output of `ps`. In this
case, the interesting bit is the wchan option, which
attempts to show what syscall the process is in which
`ps` checks.

For things like, apache httpds, this can be useful
to get an idea what all the processes are doing
at one time. See the info in the strace section
on understand system call info for more info

- systat/sar

Systat works with two steps, a daemon process that
collects information, and a "monitoring" tool.

The daemon is called "systat", and the monitoring
tool is called `sar`
To start it, start the systat daemon:

./systat start

To see a list of `sar` options, just try `sar --help`

Things to note. There are lots of commandline options.
The last one is always the "count", meaning the time between
updates.
sar 3
Will run the default sar stuff every three seconds.

For a complete summary, try:

sar -A

This generates a big pile of info ;->

To get a good idea of disk i/o activity:

sar -b 3

For something like a heavily used web server, you
may want to get a good idea how many processes
are being created per second:

sar -c 2

Kind of surprising to see how many process can
be created.

Theres also some degree of hardware monitoring builtin.
Monitoring how many times a IRQ is triggered can also
provide good hints at whats causing system performance problems.

sar -I SUM 3

Will show the total number of system interrupts

sar -I 14 2

Watch the standard IDE controller IRQ every two seconds.

Network monitoring is in here too:

sar -n DEV 2

Will show # of packets sent/receiced. # of bytes transfered, etc

sar -n EDEV 2

Will show stats on network errors.

Memory usege can be monitoring with something like:

sar -r 2

This is similar to the output from `free`, except more easily
parsed.

For SMP machines, you can monitor per CPU stats with:

sar -U 0
where 0 is the first processor. The keyword ALL will show
all of them.

A really useful one on web servers and other configurations
that use lots and lots of open files is:
sar -v

This will show number of used file handles, %of available
filehandles available, and same for inodes.

To show the number of context switches ( a good indication
of how much time a process is wasting..)

sar -w 2

- vmstat

This util is part of the procps package, and can provide lots of useful
info when diagnosing performance problems.

Heres a sample vmstat output on a lightly used desktop:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 5416 2200 1856 34612 0 1 2 1 140 194 2 1 97

And heres some sample output on a heavily used server:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
16 0 0 2360 264400 96672 9400 0 0 0 1 53 24 3 1 96
24 0 0 2360 257284 96672 9400 0 0 0 6 3063 17713 64 36 0
15 0 0 2360 250024 96672 9400 0 0 0 3 3039 16811 66 34 0

The interesting numbers here are the first one, this is the number of
the process that are on the run queue. This value shows how many process are
ready to be executed, but can not be ran at the moment because other process
need to finish. For lightly loaded systems, this is almost never above 1-3,
and numbers consistently higher than 10 indicate the machine is getting
pounded.

Other interseting values include the "system" numbers for in and cs. The
in value is the number of interupts per second a system is getting. A system
doing a lot of network or disk I/o will have high values here, as interupts
are generated everytime something is read or written to the disk or network.

The cs value is the number of context switches per second. A context
switch is when the kernel has to take off of the executable code for a program
out of memory, and switch in another. It's actually _way_ more complicated
than that, but thats the basic idea. Lots of context swithes are bad, since it
takes some fairly large number of cycles to performa a context swithch, so if
you are doing lots of them, you are spending all your time chainging jobs and
not actually doing any work. I think we can all understand that concept.


- tcpdump/ethereal
- netstat
Netstat is a app for getting general information about the status
of network connections to the machine.
netstat
will just show all the current open sockets on the machine. This will
include unix domain sockets, tcp sockets, udp sockets, etc.
One of the more useful options is:

netstat -pa

The `-p` options tells it to try to determine what program has the
socket open, which is often very useful info. For example, someone nmap's
their system and wants to know what is using port 666 for example. Running
netstat -pa will show you its satand running on that tcp port.

One of the most twisted, but useful invocations is:

netstat -a -n|grep -E "^(tcp)"| cut -c 68-|sort|uniq -c|sort -n

This will show you a sorted list of how many sockets are in each connection
state. For example:

9 LISTEN
21 ESTABLISHED


- what process is doing what and
to whom over the network
- number of sockets open
- socket status

- lsof

/usr/sbin/lsof is a utility that checks to see what all open
files are on the system. Theres a ton of options, almost none
of which you ever need.

This is mostly useful for seeing what processes
have what file open. Useful in cases where you need to umount a partion,
or perhaps you have deleted some file, but its space wasnt reclaimed
and you want to know why.


The EXAMPLES section of the lsof man page includes many
useful examples.

- fuser
- ldd

ldd prints out shared library depenencies.

For apps that are reporting missing libraries, this is a handy
utility. It shows all the libraries a given app or library is
linked to.

For most cases, what you will be looking for is missing libs.
in the ldd output, they will show something like:

libpng.so.3 => (file not found)

In this case, you need to figure out why libpng.so.3 isnt
being found. It might not be in the standard lib paths,
or perhaps not in a path in /etc/ld.so.conf. Or you
need to run `ldconfig` again to update the ld cache.

ldd can also be useful when tracking down cases where
a app is finding a library, but it finding the wrong
library. This can happen if there are two libraries
with the same name installed on a system in different
paths.

Since the `ldd` output includes the full path to
the lib, you can see if anything is pointing at
at a wrong paths. One thing to look for when
scanning for this, is one lib thats in a different
lib path than the rest. If an app uses apps from
/usr/lib, except for one from /usr/local/lib, theres
a good chance thats your culprit.

- nm
- file

`file` is a simple utility that tries to figure out
what kind of file a given file is. It does this
by magic(5).

Where this sometimes comes in handy for troubleshooting
is looking for rogue files. A .jpg file that is
actually a .html file. A tar.gz thats not actually
compressed. Cases like those can sometimes cause
apps to behave very strangely.

- netcat
- to see network stuff
- md5sum
- verifying files
- verifying iso's

- diff

diff compares two files, and shows the difference
between the two.

For troubleshooting, this is most often used on
config files. If one version of a config file works,
but another does not, a `diff` of the two files
can often be enlightening. Since it can be very
easy to miss a small difference in a file, being
able to see jus the differences is useful.

For debugging during development, diff (especially
the versions built into revision control systems
like cvs) is invaluable. Seeing exactly what
changed between two versions is a great help.

For example, if foo-2.2 is acting weird, where
foo-2.1 worked fine, it's not uncommon to `diff`
the source code between the two versions to
see if anything related to your problem changed.

- find

For troubleshooting a system that seems to have
suddenly stopped working, find has a few tricks
up its sleeve.

When a system stops working suddenly, the first
question to ask is "what changed?".

find / -mtime -1

That command will recursively list all the
file from / that have changed in the last
day.

find /usr/lib -mmin -30

Will list all the files in /usr/lib that
changed in the last 30 minutes.

Similar options exist for ctime and atime.

find /tmp -amin -30

Will show all the files in /tmp that have
been accessed in the last 30 minutes.

The -atime/-amin options are useful when trying
to determine if an app is actually reading
the files it is supposed. If you run the app,
then run that command where the files are, and
nothing has been accessed, something is wrong.

If no "+" or "-" is given for the time value,
find will match only exactly that time. This
is handy in several cases. You can determine
what files were modified/created at the
same time.

A good example of this is cleaning up
from a tar package that was unpacked into
the wrong directory. Since all the files
will have the same access time, you can
use find and -exec to delete them all.

- executables

`find` can also find files with particular
permisions set.

find / -perm -0777

will find all world writeable files from
/ down.


find /tmp -user "alikins"

will find all files in /tmp owned
by "alikins"

- used in combo with grep to find
markers (errors, filename, etc)

When troubleshooting, there are plenty of
cases where you want to find all instances of
a filename, or a hostname, etc.

To recursievely grep a large number of files,
you can use find and it's exec options

find . -exec grep foo {} \;

This will grep for "foo" on all files from
the current working directory and down.

Note that in many cases, you can also
use `grep -r` to do this as well.

- ls/stat
- finding [sym|hard] links
- out of space
- df

Running out of disk spaces causes so
many apps to fail in weird and bizarre
ways, that a quick `df -h` is a pretty
good troubleshooting starting point.

Use is easy, looks for any volume thats
100% full. Or in the case of apps that
might be writing lots of data at once,
reasonably close to being filled.

It's pretty common to spend more time
that anyone would like to admit debugging
a problem to suddenly here someone yell
"Damnit! It's out of disk space!".

A quick check avoids that problem.

In addition to running out of space,
it's possible to run out of file system
inodes. A `df -h` will not show this,
but a `df -i` will show the number of
inodes available on each filesystem.

Being out of inodes can cause even
more obscure failures than being
out of space, so something to
keep in mind.

- watch
- used to see if process output changes
- free, df, etc

- ipcs/iprm
- anything that uses shm/ipc
- oracle/apache/etc

- google
- googling for error messages can
be very handy

- source code

For Red Hat Linux, you have the source code,
so it can often be useful to search though
the code for error messages, filenames, or
other markers related to the problem. In many
cases, you don't really need to be able to
understand the programming language to
get some useful info.

Kernel drivers are an great example for this, since they
often include very detailed info about which hardware
is supported, whats likely to break, etc.

- strings
`strings` is a utility that will search though a
file and try to find text strings. For troubleshooting
sometimes it is handy to be able to look for strings
in an executable.

For an example, you can run strings on a binary to
see if it has any hard coded paths to helper utilities.
If those utils are in the wrong place, that app may
fail.

Searching for error messages can help we well,
especially in cases where you not sure what
binary is reporting an error message.

It some ways, it's a bit like grepping though
source code for error messages, but a bit
easier. Unfortunately, it also provide far
less info.

- syslog/log levels
- what goes to syslog
- how to get more stuff there

- ksymoops
- get somewhat meaning info out of kernel
traces
- netdump?

- xev
- debugging keycode/mouseclick weirdness, etc

Logs
- messages, dmesg, lastlog, etc
- log filtering tools?

Using RPM to help troubleshoot
- package verify
- missing deps

Types Of Problems
- Things are missing.

This type of problem occurs in many forms. Shell scripts that
expect an executable to exist that doesn't. Applications linked
against a library that can not be found. Applications expecting
a config file to be found that isnt.

It can get even more subtle when file permisions are involved.
An app can report a file as "not found" when it reality, it
exists, but the permissions are wrong.

- Missing Files

Often an app will fail because of missing files, but will
not be so helpful as to tell which file is missing. Or it
reports the error in a vague manner like "config file not found"

For most of these cases where something is missing, but you
are not sure _what_, strace is the best tool.

strace -eopen trouble_causing_app

That commandline will list all the files that app is
trying to open up, and if it succedded or not. The type
of line to look for is something like:

open("/path/to/some/file/", O_RDONLY) = -1 ENOENT (No such file or directory)

That indicates the file wasn't found. In many cases, these errors
are harmless. For example, some apps will try to open config files
in the users home directory, in addition to system config files.
If the user config file doesn't exist, the app might just continue.

- Missing Libs

For missing libraries, the same approach will work. Another
approach is to run `ldd` against the app, and see if any
shared libraries show up as missing. See the `ldd` section
for more details.

- File Permissions

For cases where it's the file permision thats causing
the problem, you are looking for a line like:

open("/path/to/file/you/cant/read", O_RDONLY) = -1 EACCES (Permission denied)

Something about that file is not letting you read it. So the
permisions need to be checked, or perhaps elevated privilidges
obtained (aka, does the app require running it as root?)


- networking

On modern systems, having networking problems is crippling
at times. Troubleshooting whats causing them can be just
as painful at times.

Some common problems include firewall issues (both on
the client and external), kernel/kernel module issues,
routing problems, name resolution, etc.

Name resolution issues deserve there own category, so
see the name resolution section for more info.

- firewall checks

When seeing odd network behaviour, these days, local
firewalls are a pretty good suspect. Client side
firewalls are getting more and more aggressive.

If you see issues using a network service, especially
a non standard service, it's possible local firewall
rules are causing it.

Insert infor about seeing what firewall rules
are up.

Insert info about increasing log levels to see
firewall rejections in system logs.

Insert info about temprorarily dropping firewalls
to diagnose problems.

- Crappy Connections

A common problem is connections that are having
problems. A few easy things to look for to see
if an external connection might be having issues.

- ping to a remote host

`ping` is very simple, and very low level, so
it's a good tool to get an idea if an interface
or route is working correctly.

ping www.yahoo.com

That will start pinging www.yahoo.com and
reporting ping times. Stopping it with ctrl-c
will show a report of any missed packets.

Generally healthy links will have 0 dropped
packets, so anything higher than that is something
to be worried about.

- traceroute

traceroute www.yahoo.com

Attempts to gather info about each node in
the connections. Generally these map to
physical routers, but in these days of
VPN's, it's hard to tell.

If a traceroute stalls at some point,
it usually indicates a problems. Also
look for high ping times, particularly
any node that seems much slower than the
others.

- /sbin/ifconfig

ifconfig does a lot. It can control
and configure network interfaces of all
types. See the man page.

When trying to determine if theres networking
issues, run `ifconfig` and look for the interface
showing issues. If there is a high "error" count,
there could be physical layer issues, or possibly
overloaded routers etc.

That said, with modern networks, it's pretty
rare to see interface errors, but it's still
something to take a quick look at.

- Bandwidth Useage
When the available network bandwidth runs dry,
it can be difficult to find the culprits. Theres
a couple subtle variations of this. One being a
client machine that has some process using a lot
lot of bandwidth. Another is a server application
that has one or more clients using a lot of
bandwidth.


- /sbin/ifconfig

ifconfig reports the number of packets
sent/received on a network interface, so
this can be a quick way to get an idea
what interface is out of bandwidth.

- sar

As mentioned in the section on sar, `sar -n DEV`
can be used to see info about the amount of
packages each interface is sending at a
given time.

- trafshow

I don't know anything
about trafshow

- ntop/intop
havent used in ages


- netstat

`netstat` wont show bandwith useage, but it
is a quick way to see what applications have
open network connections, which is often a
good start to finding bandwidth hogs.

See the netstat section for more info.

- tcpdump/ethereal

tcpdump and ethereal are both applications
to monitor network traffic. tcpdump is pretty
standard, but ethereal is more featureful.

ethereal also has a nice graphical user
interface which can be very handy when
attempting to digest the large amouts of
data a network trace can deliver.

The basic approach is to fire up ethereal,
start a capture, let whatever weird networking
your trying to diagnose happen, then stop
capture.

Ethereal will display all the connections
it traced during the capture. There are
a couple ways to look for bandwidth hogs.

The "Statitics" menu has a couple useful
options. The "Protocol Hierarchy" shows
what % of packets in the trace is from
each type of protocol. In the case of
a bandwith hog, at least what protocol
is the culprit should be easy to spot
here.

The "Conversations" screen is also helpful
for looking for bandwidth hogs. Since you
can sort the "conversations" by number of
packets, the culprit is likely to hop to
the top. This isn't always the case, as it
could easily be many small connections killing
the bandwidth, not one big heavy connection.

As far as tcpdump goes, the best way to spo
bandwidth hogs is just to start it up. Since
it pretty much dumps all traffic to the screen
in a text format, just keep your eyes peel for
what seems to be coming up a lot.

- using iptables

iptables can log how much traffic is flowing though
a given rule.


something like:
iptables -nLx



- routing issues
- kernel module flakyeness
- dropped connections


- tcpdump/ethereal
- netcat
- netstat

- Programs Crashing

You just finished the last page in your 1200 page
novel about how aliens invaded Siberia in the
19th century and made everyone depressed. *boom*
the word processor disappears off the screen faster
than it really should. It segfaulted. Your work
is lost.

Crashing applications are annoying to say the
least. But sometimes, there are ways to figure
out why they crashed. And if you can figure out
why, you might be able to avoid it next time.

- Crash Catchers

Most GNOME/KDE apps now are linked against libs
that include a cratch catching utility. Basically,
whenever the app gets a segfault, the hander for
it invokes a process, attaches to it with a debugger,
gets a stacktrace, and offeres to upload it to a
bug tracking system.

Since these include the option to see the stack
trace, it can be handy way to get a stack trace.
Once you have a stack trace, it should point
you to where the app is crashing. Figure out
why it crashed varies greatly in complexity.

- strace

`strace` can also be handy for tracking down
crashes. It doesn't provide as muct detail
as ltrace or gdb, but it is commonly available.

The idea being to start the app under trace,
wait for it to crash, and see what the last
few things it did. Some things to look for
include recently opened files (maybed the app
is trying to load a corrupted file), memory
management calls (maybe something is causing
it to use large amounts of ram), failed network
connections (maybe the app has poor error handling).

- ltrace

`ltrace` is a bit more useful for debuggin crashing
apps, as it can give you an idea what function
an app was in when it crashed. Not as useful
as a real stack trace, but its easier.

- gdb

When it comes to figuring out all the gory details
of why an app crashed, nothing is better than
`gdb`.

For basic useage, see the gdb section in the
tools section of this document.

For more detail useage, see the gdb documentation.

need some examples here

- debuginfo packages

One caveat with using gdb on most apps, is that
they are stripped of debugging information. You can
still get a stack trace, but it will not be as meaningful
as one with the debug information available.

In the past, this meant recompiling the application with
debugging turned on, and "stripping" turned off. Which
can at times, be a slow and painful process.

In Red Hat Enterprise Linux and later, you can install
the "debuginfo" packages.

See the gdb section in the tools section for more info
on debug packages.

- core files

If an application has crashed, and left a core
file (see the "Allowing Core Files" section under
the "Enviroment Settings" section for info on how
to do this), you can use gdb to debug the core file.

Invocation is easy:

gdb /path/to/core/file

After loading the core file, you can issue `bt`
to get a backtrace.

See the gdb section above for infomation about
"debuginfo" packages.

- configs screwed up
An incorrect, missing, or corrupt config file can wreak
havoc. Well coded apps will usually give you some idea
if a config file is bogus, but thats not always the case.

- Finding the config files

The first thing is figureing out if an app
uses a config file and what it is. Theres
a couple ways to do this.

- finding config files with rpm

If a package as installed from an rpm, it
should have the config files flagged as such.
To query a package to see what it's config files
are, issue the command:

rpm -q --configfiles packagename

While you are using rpm, you should see if
the config files have been modified from
the defaults.

rpm -V packagename

That command will list all the files in that
packaged that have been changed in someways.
The rpm man page includes more details on what
the output means, but the basics are:

if there is a "S", the files size has changed.
if there is a "5", the file has been modified.
if there is a "M", the files permissions have changed.

- strace

Using `strace -eopen process` is a good way to see what
files a process is opening, including any config files.

- documentation

If all else fails, try reading the docs. Often the
man pages or docs describe where and what the config
files are.

- Verifying the Config Files

Once you know what the config files, then
you need to verify they are correct. This is
highly application dependent.

- diff'ing against known good files

If you have a known good config file,
diffing the old file and the new one
can often be useful.

- look for for .rpmnew or .rpmorig files.

In some cases, rpm will install a new config
file along side the existing one. This happens
on package upgrades where the default config
file has changed between the two packages, and
the version on disk is different from either
version.

The idea being, if the default config file is
different, then it's possible the config file
format changed. Which means the previous on
disk config file may not work with the new
version, so a .rpmnew version is installed
alongside the existing one.

So if an app is suddenly behaving oddly,
especially after a package update, see
if there are any .rpmnew or .rpmorig file.
If so, you may need to update the existing
config file to use the new format.

- stat/ls

If an app is behaving oddly, and you belive
is it because of a config file, you should
check to see when that file was modified.

stat /path/to/config/file

The `stat` command will give you the last
modified and last accessed times. If the
file seems to have changed later than you
think, it's possibly something or someone
has changed it more recently.

See the information on the `find` utility
for ways to look for all files modified
at/since/before a certain time.

- gconf
- The config file has changed but the app is ignoring it
- is it the correct config file?

Often an application will look for config
files in several places. In some cases,
some versions of the config file have
precedence over other versions.

A common example is for an app to
a default config file, a per system
config file, and a per user config file.
With the user and system runs overriding
the default one. For some apps, individual
config items have there own inheritance
rules.

So for example, if your modifying a system
wide config file, make sure there isnt
a per user config file that overrides the
change.

- is it a daemon?

daemon and server processes typically
only read there config file when they
start up.

Sometimes a restart of the process is
required. Sometimes it is possible to
send a "HUP" signal to an app to force
it to reload configs. To send a "HUP"
signal:

kill -HUP $pid

Where $pid is the process id of the
running process.

Sometimes init scripts will have
options for reloading config files.

Apache httpd's init script has
a reload option.

service httpd reload

- shell config?

Some process, user shells in particular,
have fairly complicated rules about when
some of it's config files are read.

See the "INVOCATION" section of the
bash man page for an example of when
the various bash config files get loaded.


- kernel issues
- single user
- init=/bin/bash
- bootloader configs
- log levels

- stuff not writing to disk
- out of space
You run a command to write a file, or save a file from an
app. When you go to look at the file, it's not there, or
it's empty. Or the app complains that is "unable to write to device"
Whats going on?

More than likely, the system doesnt not have any storage space
for the file. The file system that the app is trying to write
to is full, or nearly full.

This case can cause a wide variety of problems. The
easiest way to check to see if this is the case is
the `df` command. See the df section in the tools
section for more info on df.

One thing to keep in mind is that the correct
filesystem has space. Just because something in
`df` shows free space, doesn't mean the app
can use it.

- out of inodes

`df -i` can catch this one as well. It's fairly
uncommon these days, but it can still happen.

- file permissions

Check the file permissions for the file, and
directory the app is trying to write to.

You can use strace to see where it's writing to
if nothing else tells you.

- ACL's

If the system is using ACL's, you need to
verify the user/app is in the proper ACL's.

- selinux

selinux can control what app can write where
and how. So verify the selinux perms are
correct.

need more info on tracking down
selinux issues

- quotas

If the system has file system quotas enabled,
it's possible the user is over quota.

`quota`

That command will show the current quota
useage, and indicate if quotas are in
effect or not.

- read-only mounts

Network file systems in particular tend
to mount shared partions read-only. The
mount options overrides any file permisions
on the file system that is being shared.

- read only media

cd-roms are read-only media. The app isn't
trying to write to it is it?

- chattr/lsattr

One feature of ext2/3 is the ability to
`chattr` files. There are per file attributes
beyond standard unix permissions.

See the chattr/lsattr section of the tools
section for more details.

If a file has had the command `chattr +i` run
on it, then the file is "immutable" and nothing
can modify it or delete it, including the root
user. The only way to change it is to run `chattr -i`
on it. Then it can be treated as a normal file.


- files doing weird stuff
The app is reading the right file. The file _looks_
correct, but it is still behaving weirdly. A few
things to look for.

- hidden chars

Sometimes a file can get hidden characters in
it that give parsers headaches. This is increasingly
common as support for more characters encoding
become common.

- dos style carriage returns
- embedded tabs
- high byte chars

One good approach is to open the file with
vi in bin mode:

vi -b filename

Then to put vi into 'setlist" mode. Do this
by hitting escape and entering ":setlist".

This should show any non ascii chars, new
lines, tabs, etc.

the `od` utility can be useful for viewing
files "in the raw" as well. some
useful od invocations

- ending new line

Some apps are picking about having any
extra new lines at the end of files. So
something to look for.

- trailing spaces

A particular hard to spot circumstance
that can break some parsers. Trailing
spaces after a string. This can be
particularly difficult to spot in
cases where it's a space, then a
new line.

This seems to be particularly common
for config options for usernames and
passords "foobar" != "foobar "


- env stuff
The users "enviroment" can often cause problems
for applications. Some typical cases and
how to detect them. X DISPLAY settings,
the PATH, HOME, TERM settings etc can
cause issues.

- things work as user/not root, vice versa

There can be any number reason something
works as root, but not as a user. Most
of them related to permissions (on
files or devices).

Another common cause is PATH. On Red Hat
at least, users do not have /sbin:/usr/sbin
in there PATH by default. So some scripts
or commands can fail because they are not
in the PATH. Even having the PATH order
being different between root/user can
cause problems.

X forwarding crap

- env

The easiest way to see enviroment
variables is just to run:

env

- what basic env stuff means
- su/sudo issues
- env -i to launch with clean env

If a app seems to be having issues
that is enviroment dependent, one
thing that can be useful when trouble
shooting is to launch it with `env -i`.
Something like:

env -i /bin/someapp

`env -i` basically strips all enviroment
variables, so the app can launch with
nothing set.

- su -, etc

If `su` is being used to gain root,
one thing to keep in mind is the difference
between `su` and `su -`. The '-' tells su
to start up a new login shell. In practice,
this means `su -` essentialy gets a shell
with the same enviroment a normal root
shell would have.

A shell created with `su` still has the
users SHELL, PATH, USERNAME, HOME, and
other variables from the users shell.

A shell created with `su -` has the
same variables a root shell has.

This can often cause weird behavior
on apps that depend on the enviroment.

- sudo -l

- shell scripting

Scripting in bash or sh is often
the quickest and easiest way to solve
a problem. It's also heavily used in
the configuration and startup of Red Hat
Linux systems.

Unfortunately, debugging shell scripts
can be quite painful. Debugging shell
scripts someone else wrote a decade
ago is even worse.

- echo

The bash builtin "echo" is often the
best debugging tool. A common trick
is just to add "echo" to the begining
of a line of code that you believe is
doing something incorrect.

This will just print out the line, but
after variable expansion. Particularly
handy if the line in question is using
lots of shell variables.

- sh -x

Bash includes some support for getting
more verbose information out of
scripts as they run. Invoking
a shell script as follows:

sh -x /path/to/someshell.sh

That command will at least trying
to print out every line it as it
executes them.

- trap
- bash debugger

There is a bash debugger available
at http://bashdb.sourceforge.net/.

It's essentially a "gdb" style debugger
but for bash. Including support for
step debugging, breakpoints, and
watchpoints.

- DNS/name resolution

Once a network is up and going, name resolution
can continue to be a source of problems. Since
so many applications expect reliable name resolution,
when it fails.

- useage of dig

`dig` is probably the most useful tool for
tracking down DNS issues.

insert useful dig examples

- /etc/hosts

Check /etc/hosts for spurious entries. It's
not uncommon for "temporary" /etc/hosts entries
to become permanent, and when the host ip does
change, things break.

- nscd

nscd is a name service caching daemon. It's
very useful when using name services info
like hesiod and ldap. But it can also
cache DNS as well.

Most of the time, it just works. But
it's been known to break in odd and
mysterious ways before. So trying
DNS with and without nscd running
is a good idea.


- /etc/nsswitch.conf
- splat names/host typos

"*" matching on DNS servers is pretty
common these days. It normally doesn't
cause any problems, as much as it can
make certain types of errors harder
to track down.

A typo in a hostname will get redirected
to another server (typically a web server)
instead of giving an name resolution error.

Since the obvious "host not found" errors
don't happen, tracking down these kind
of problems can be compounded if used
with "wildcard" DNS.

- auth info
- getent
- ypwhich/match/cat

- certificate/crypto issues
- ssl CA certs
- gpg keys/signatures
- rpm gpg keys
- ssltool
- curl

- Network File Systems
- NFS causes weird issues
- timestamps
- perms/rootsquash/etc
- weird inode caching
- samba
- it touches windows stuff, icky

- Some app/apps is rapidly forking shortlived process
- gah, what a PITA to troubleshoot
- psacct?
- sar -X?
- watching pids grow?
- dump-acct + parsing?

App specific
- apache
- scorecard stuff
- module debugging
- log files
- init file "configtest"
- -X debug mode

- php
-

- gtk apps
- event debuging stuff?

- X apps
- nosync stuff
- X log

- ssh
- debug flags
- sshd -d -d

- pam/auth/nss
- logging options?
- getent

- sendmail

Credits

Comments, suggestions, hints, ideas, critisicms, pointers,
and other useful info from various folks including:

Mihai Ibanescu
Chip Turner
Chris MacLeod
Todd Warner
Nicholas Hansen
Sven Riedel
Jacob Frelinger
James Clark
Brian Naylor
Drew Puch

No comments: