http://people.redhat.com/alikins/troubleshooting/
This is a guide to basic, and not so basic troubleshooting and
debugging on Red Hat linux systems. Goals include description
and useage of common tools, how to find information, etc. 
Basically, info that may be helpful to someone diagnosing 
a problem. Emphasis will be on software issues, but
might include hardware as well.
Enviroment settings
   - Allowing Core Files
                                                                                                  
        "core" files are dumps of a processes memory. When a program crashes
    it can leave behind a core file that can help determine what
    was the cause of the crash by loading the core file in a debugger.
                                                                                                  
        By default, most linuxes turn off core file support by setting the
        maximum allowed core file size to 0.
                                                                                                  
        In order to allow a segfaulting application to leave a core, you need
        to raise this limit. This is done via `ulimit`. To allow core
    files to be of an unlimitted size, issue:
                                                                                                  
                ulimit -c unlimited
                                                                                                  
        See the section on GDB for more information on what to do
        with core files.
  - LD_ASSUME_KERNEL
                                                                                                  
        LD_ASSUME_KERNEL is an enviroment variable used by the dynamic
        linker to decide what implementation of libraries are used. For
        most cases, the most important lib is the c lib, or "libc" or
        "glibc".
                                                                                                  
        The reason "glibc" is important is because it contains the
        thread implentation for a system.
                                                                                                  
        The values you can set LD_ASSUME_KERNEL to equate to linux
        kernel versions. Since glibc and the kernel are tighly bound,
        it's neccasary for glibc to change it's behaviour based on
        what kernel version is installed.
                                                                                                  
        For properly written apps, there sould be no reason to use
        this setting. However, for some legacy apps that depend
        on a particular thread implementation in glibc, LD_ASSUME_KERNEL
        can be used to force the app to use an older implementation.
                                                                                                  
        The primary targets fore LD_ASSUME_KERNEL=2.4.20 for use
        of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 use the
        implementation in /lib/i686 (newer LinuxTrheads).
        LD_ASSUME_KERNEL=2.2.5 or older uses the implementation
        in /lib (old LinuxThreads)
                                                                                                  
        For an app that requires the old thread implentation, it
        can be launch as:
                                                                                                  
                LD_ASSUME_KERNEL=2.2.5 ./some-old-app
                                                                                                  
        see http://people.redhat.com/drepper/assumekernel.html for
        more details.
                                                                                                  
     - glibc enviroment variables.
                                                                                                  
                Theres a wide variety of enviroment varibles that
        glibc uses to alter it's behaviour, many of which are
        useful for debugging or troubleshoot purposes.
                                                                                                  
        A good refence on these variables is at:
                http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html
                                                                                                  
        Some interesting ones:
                                                                                                  
                LANG and LANGUAGE
                                                                                                  
                        LANG sets what message catalog to use, while LANGUAGE
                        sets LANG and all the LC_* variables. These are
control
                        the locale specific parts of glibc.
                                                                                                  
                        Lots of programs are written expecting to be one in
                        one local, and can break in other locales. Since
locale
                        settings can change things like sort order
(LC_COLLATE),
                        and the time formats (LC_TIME), shells scripts are
                        particularly prone to problems from this.
                                                                                                  
                        A script that assumes the sort order of something is
                        a good example.
                                                                                                  
                        A common way to test this is to try running the
                        troublesome app with the locale set to "C", or the
                        default locale.
                                                                                                  
                                LANGUAGE=C ls -al
                                                                                                  
                        If the app starts behaviour when ran that way, there
                        is probably something in the code that is assuming
                        "C" local (sorted lists and timeformats are strong
                        candidates).
   - glibc malloc stuff
     - all the glibc env variable stuff
Tools
    Effiently debugging and troubleshooting is often a matter
    of knowing the right tools for the job, and how to use
    them.
                                                                                                 
   - strace
 - simple useage
 - filtering output
 - examples
 - use as profiling
 - see what files are open
 - network connections
                Strace is one of the most powerful tools available
        for troubleshooting. It allows you to see what an application
        is doing, to some degree.
                                                                                                  
                strace display all the system calls that an application
        is makeing, what arguments it passes to them, and what the
        return code is. A system call is generally something that
        requires the kernel to do something. This generally means
        I/O of all sorts, process management, shared memory and
        IPC useage, memory allocation, and network useage.
                                                                                                  
                - examples
                                                                                                  
                The simplest example of using strace is as follows:
                        strace ls -al
                                                                                                  
                This starts the strace process, which then starts `ls -al`
        and shows every system call. For `ls -al` this is mostly
        I/O related calls. You can see it stat'ing files, opening
        config files, opening the libs it is linked against, allocatin
        memory, and write()'ing out the contents to the screen.
                                                                                                  
                                                                                                  
                - what files is this thing trying to open
                                                                                                  
                  A common troubleshooting technique is to see
         what files an app is reading. You might want to make
        sure it's reading the proper config file, or looking
        at the correct cache, etc. strace by default shows
        all file i/o operations.
                                                                                                  
                But to make it a bit easier, you can filter
        strace output. To see just file open()'s
                                                                                                  
                strace -eopen ls -al
                                                                                                  
                - whats this thing doing to the network
                                                                                                  
                To see all network related system calls (name
        resolution, opening sockets, writing/reading to sockets, etc)
                                                                                                  
                strace -e trace=network curl --head http://www.redhat.com
                                                                                                  
                - rudimentary profiling
                                                                                                  
                One thing that strace can be used for that is useful for
        debugging performance problems is some simple profiling.
                                                                                                  
                strace -c  ls -la
                                                                                                  
        Invoking strace with '-c' will cause a cumulative report of
        system call useage to be printed. This includes approximate
        amount of time spent in each call, and how many times a
        system call is made.
        This can sometimes help pinpoint performance issues, especially
        if an app is doing something like repeatedly opening/closing
        the same files.
                                                                                                  
                strace -tt ls -al
                                                                                                  
        the -tt option causes strace to print out the time each call
        finished, in microseconds.
                                                                                                  
                strace -r ls -al
                                                                                                  
        the -r option causes strace to print out the time since the
        last system call. This can be used to spot where a process
        is spending large amounts of time in user space, or especially
        slow syscalls.
                                                                                                  
                - following forks and attaching to running processes
                                                                                                  
        Often is difficult or impossible to run a command under
        strace (an apache httpd for instance). In this case, it's
        possible to attach to an already running process.
                                                                                                  
                strace -i 12345
                                                                                                  
        where 12345 is the PID of a process. This is very finding
        for trying to determine why a process has stalled. Many
        times a process might be blocking while waiting for I/O.
        with strace -p, this is easy to detect.
                                                                                                  
        Lots of processes start other processes. It is often desireable
        to see a strace of all the processes.
                                                                                                  
                strace -f /etc/init.d/httpd start
                                                                                                  
        will strace not just the bash process that runs the script, but
        any helper utilities executed by the script, and httpd itself.
 Since strace output is often a handy way to help a developer
 solve a problem, it's useful to be able to write it to a file.
 The easiest way to do this is with the -o option.
  strace -o /tmp/strace.out program
 Being somewhat familar with the common syscalls for linux
 is helpful in understanding strace output. But most of the common
 ones are simple enough to be able to figure out on context. 
 A line in strace output is essentially, the system call name,
 the arguments to the call in parens (sometimes truncated...), and then
 the return status. A return status for error is typically -1, but varies
 sometimes. For more information about the return status of a typically
 system call is by `man 2 syscallname`. Usually the return status will
 be documented in the "RETURN STATUS" section. 
 Another thing to note about strace it is often shows "errno"
 status. If your not familar with unix system programming, errno is a global
 variable that gets sets to specific values when some commands execute. This
 variable gets set to different values based on the error mode of the command.
 More info on this can be found in `man errno`. But typically, strace will
 show the brief description for any any errno values it gets. ie:
   open("/foo/bar", O_RDONLY) = -1 ENOENT (No such file or directory)
 strace -s X
 the -s option tells strace to show the first X digits of strings.
 The default is 32 characters, which sometimes isnt enough. This will
 increase the info available to the user. 
 
   - ltrace
 - simple useage
 - filtering output
        ltrace is very similar to strace, except ltrace focuses on
        tracing library calls.
                                                                                                  
        For apps that use a lot of libs, this can be a very powerful
        debugging tool. However, because most modern apps use libs
        very heavily, the output from ltrace can sometimes be
        painfully verbose.
        
 There is a distinction between what makes a "systemcall"
 and a call to a library functions. Sometimes the line between the
 two is blurry, but the basic difference is that system calls are
 essentially communicating to the kernel, and library calls are just
 running more userland code. System calls are usually require for
 things like I/O, process controll, memory management issues,
 and other "kernel" things. 
 Library calls are by bulk, generaly calls to the standard
 C library (glibc..), but can of course be calls to any library,
 say Gtk,libjpeg, libnss, etc. Luckily most glibs functions
 are well documented and have either man or info pages. Documentation
 for other libraries varies greatly. 
                                                                                          
        ltrace supports the -r, -tt, -p, and -c options the same
        as strace. In addition it supports the -S option which
        tells it to print out system calls as well as library
        calls.
                                                                                                  
        One of the more useful options is "-n 2" which will
        indent 2 spaces for each nested calls. This can make it
        much easier to read.
                                                                                                  
        Another useful option is the "-l" option, which
        allows you to specify a specific library to trace, potentionall
        cutting down on the rather verbose output.
   - gdb
 `gdb` is the GNU debugger. A debugger is typically used by developers
      to debug applications in development. It allows for a very detailed
 examination of exactly what a program is doing.
 That said, gdb isnt as useful as strace/ltrace for troubleshooting/sysadmin
 types of issues, but occasionally it comes in handy.  
 For troubleshooting, its useful for determining what cause
 core files. (`file core` will also typically show you this information too).
 But gdb can also show you "where" the file crashed. Once you determine
 the name of the app that caused the failure, you can start gdb with:
  gdb filename core 
  then at the prompt type
  `where`
 The unfortunate thing is that all the binaries we ship are
 stripped of debuggig symbols to make them smaller, so this often returns
 less than useful information. However, starting in Red Hat Enterprise Linux
 3, and included in Fedora, there are "debuginfo" packages. These
 packages include all the debugging symbols. You can install them the
 same as any other rpm, so `rpm`, `up2date`, and `yum` all work. 
 The only difficult part about debuginfo rpms is figuring out
 which ones you need. Generally, you want the debuginfo package
 for the src rpm of the package thats crashing. 
  rpm -qif /path/to/app
 Will tell you the info for the binary package the app is part of. 
 Part of that info include the src.rpm. Just use the package name
 of the src rpm plus "-debuginfo"
   - python debugging
   - perl debugging
 - `splain`
 - `perl -V`
 - perldoc -q
 - perldoc -l
   - sh debugging
   - bugbuddy etc
   - top
 `top` is a simple text based system monitoring tool. It packs
 a lot of information unto the screen, which can be helpful
 troubleshooting problems, particularly performance related
 problems.
  The top of the "top" output includes a basic summary of the system.
 The top line is current time, uptime since the last reboot, users
 logged, and load average. The load average values here are the
 load for the last 1, 5,and 15 minutes. A load of 1.0 is considerd
 100% cpu utilization, so loads over one typically means stuff
 is having to wait. There is a lot of leeway and approxiation in
 these load values however.
                                                                                                                                                                                     
        The memory line shows the total physical ram available
 on the system, how much of it is used, how much free, and how
 much is shared, along with the amount of ram in buffers. These
 buffers are typically file system cachine, but can be other things.
 On a system with a significant uptime, expect the buffer value to
 take up all free physical ram not in use by a process.                                                                                                                                                                             
 The swap line is similar.
                                                                                                                                                                                     
 Each of the entries viewable in the system contain several
 fields by default. The most interesting are RSS, %CPU, and
 time. RSS shows the amount of physical ram the process is consuming.
 %CPU shows the percentage of the available processor time a process
 is taking,and time shows the total amount of processor time the process
 has had. A processor intensive program can easily have more "time"
 in just a few seconds than a long running low cpu process.
 Sorting the output:
         M  : sorts the output by memory useage. Pretty handy for figuring
       out which version of openoffice.org to kill.
        
  P : sorts the process by the % of cpu time they are using.
         T : sorts by cumulative time
         
  A : sorts by age of the process, newest process first                                                                                                                                                                     
 Command line options:                                                                                                                                                                             
         The only really useful command line options are:
                                                                                                                                                                                     
         b [batch mode] writes the standard top output to
         stdout. Useful for a quick "system monitoring hack".
                                                                                                                                                                                     
         ie, top d 360 b >> foo.output
         to get a snapshot of the system appended to foo.output every
         six minutes.
   - ps 
 `ps` can be thought of as a one shot top. But it's a bit
 more flexible in it's output than top.
 As far as `ps` commandline options go, it can get pretty
 hairy. The linux version of `ps` inherits ideas from both
 the BSD version, and the SYSV version. So be warned.
 The `ps` man page does a pretty good job of explaining
 this, so look there for more examples.
 some examples:
  ps aux
 shows all the process on the system is a "user" oriented
 format. In this case meaning the username of the owner
 of the process is shown in the first column.
  ps auxww
 the "w" option, when used twice, allows the output to be
 of unlimited width. For apps started with lots of commandline
 options, this will allow you to see all the options. 
  ps auxf
 the 'f" option, for "forest" tries to present the list
 of processes in a tree format. This is a quick and easy
 way to see which process are child processes of what.
  ps -eo pid,%cpu,vsz,args,wchan
 This is an interesting example of the -eo option. This
 allows you to customize the output of `ps`. In this
 case, the interesting bit is the wchan option, which
 attempts to show what syscall the process is in which
 `ps` checks. 
 For things like, apache httpds, this can be useful
 to get an idea what all the processes are doing
 at one time. See the info in the strace section
 on understand system call info for more info
   
   - systat/sar
        Systat works with two steps, a daemon process that
        collects information, and a "monitoring" tool.
        
 The daemon is called "systat", and the monitoring
        tool is called `sar`
                                                                                                                                                                  To start it, start the systat daemon:
                
  ./systat start
        
 To see a list of `sar` options, just try `sar --help`
        
 Things to note. There are lots of commandline options.
        The last one is always the "count", meaning the time between
        updates.                                                                                                                                                                             
                sar 3                                                                                                                                                                     
        Will run the default sar stuff every three seconds.
       
  For a complete summary, try:
                
  sar -A
        
 This generates a big pile of info ;->
        
 To get a good idea of disk i/o activity:
                
  sar -b 3
        
 For something like a heavily used web server, you
        may want to get a good idea how many processes
        are being created per second:
                
  sar -c 2
        
 Kind of surprising to see how many process can
        be created.
        
 Theres also some degree of hardware monitoring builtin.
        Monitoring how many times a IRQ is triggered can also
        provide good hints at whats causing system performance problems.
                
  sar -I SUM 3
        
 Will show the total number of system interrupts
                
  sar -I 14 2
        
 Watch the standard IDE controller IRQ every two seconds.
        
 Network monitoring is in here too:
                
  sar -n DEV 2
        
 Will show # of packets sent/receiced. # of bytes transfered, etc
                sar -n EDEV 2
        
 Will show stats on network errors.
        
 Memory usege can be monitoring with something like:
                
  sar -r 2
        
 This is similar to the output from `free`, except more easily
        parsed.
        
 For SMP machines, you can monitor per CPU stats with:
                
  sar -U 0
        where 0 is the first processor. The keyword ALL will show
        all of them.
        
 A really useful one on web servers and other configurations
        that use lots and lots of open files is:                                                                                                                                                                             
                sar -v
                                                                                                                                                  
        This will show number of used file handles, %of available
        filehandles available, and same for inodes.
       To show the number of context switches ( a good indication
        of how much time a process is wasting..)
       
                sar -w 2
   - vmstat 
      This util is part of the procps package, and can provide lots of useful
       info when diagnosing performance problems.
       Heres a sample vmstat output on a lightly used desktop:
   procs                      memory    swap          io     system  cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
 1  0  0   5416   2200   1856  34612   0   1     2     1  140   194   2   1 97
       And heres some sample output on a heavily used server:
   procs                      memory    swap          io     system  cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
16  0  0   2360 264400  96672   9400   0   0     0     1   53    24   3   1 96
24  0  0   2360 257284  96672   9400   0   0     0     6 3063 17713  64  36 0
15  0  0   2360 250024  96672   9400   0   0     0     3 3039 16811  66  34 0
       The interesting numbers here are the first one, this is the number of
 the process that are on the run queue. This value shows how many process are
 ready to be executed, but can not be ran at the moment because other process
 need to finish. For lightly loaded systems, this is almost never above 1-3,
 and numbers consistently higher than 10 indicate the machine is getting
 pounded.
       Other interseting values include the "system" numbers for in and cs. The
 in value is the number of interupts per second a system is getting. A system
 doing a lot of network or disk I/o will have high values here, as interupts
 are generated everytime something is read or written to the disk or network.
 The cs value is the number of context switches per second. A context
 switch is when the kernel has to take off of the executable code for a program
 out of memory, and switch in another. It's actually _way_ more complicated
 than that, but thats the basic idea. Lots of context swithes are bad, since it
 takes some fairly large number of cycles to performa a context swithch, so if
 you are doing lots of them, you are spending all your time chainging jobs and
 not actually doing any work. I think we can all understand that concept.
   - tcpdump/ethereal
   - netstat 
      Netstat is a app for getting general information about the status
 of network connections to the machine.                                                                                                                                                             
          netstat                                                                                                                                       
        will just show all the current open sockets on the machine. This will
 include unix domain sockets, tcp sockets, udp sockets, etc.
                                                                                                                                                                  One of the more useful options is:
                   
         netstat -pa
        
 The `-p` options tells it to try to determine what program has the
 socket open, which is often very useful info. For example, someone nmap's
 their system and wants to know what is using port 666 for example. Running
 netstat -pa will show you its satand running on that tcp port.
 One of the most twisted, but useful invocations is:
  netstat -a -n|grep -E "^(tcp)"| cut -c 68-|sort|uniq -c|sort -n
 This will show you a sorted list of how many sockets are in each connection
 state. For example:
       9  LISTEN      
      21  ESTABLISHED 
 
 - what process is doing what and
   to whom over the network
 - number of sockets open
 - socket status
   - lsof
 /usr/sbin/lsof is a utility that checks to see what all open
 files are on the system. Theres a ton of options, almost none
 of which you ever need.
       
 This is mostly useful for seeing what processes
 have what file open. Useful in cases where you need to umount a partion,
 or perhaps you have deleted some file, but its space wasnt reclaimed
 and you want to know why.
 The EXAMPLES section of the lsof man page includes many
 useful examples.
   - fuser
   - ldd
 ldd prints out shared library depenencies. 
 For apps that are reporting missing libraries, this is a handy
 utility. It shows all the libraries a given app or library is
 linked to. 
 
 For most cases, what you will be looking for is missing libs.
 in the ldd output, they will show something like:
  libpng.so.3 => (file not found)
 In this case, you need to figure out why libpng.so.3 isnt
 being found. It might not be in the standard lib paths,
 or perhaps not in a path in /etc/ld.so.conf. Or you
 need to run `ldconfig` again to update the ld cache.
 ldd can also be useful when tracking down cases where
 a app is finding a library, but it finding the wrong
 library. This can happen if there are two libraries
 with the same name installed on a system in different
 paths. 
 
 Since the `ldd` output includes the full path to
 the lib, you can see if anything is pointing at
 at a wrong paths. One thing to look for when
 scanning for this, is one lib thats in a different
 lib path than the rest. If an app uses apps from
 /usr/lib, except for one from /usr/local/lib, theres
 a good chance thats your culprit.
   - nm
   - file
 `file` is a simple utility that tries to figure out
 what kind of file a given file is. It does this
 by magic(5). 
 Where this sometimes comes in handy for troubleshooting
 is looking for rogue files. A .jpg file that is
 actually a .html file. A tar.gz thats not actually
 compressed. Cases like those can sometimes cause
 apps to behave very strangely. 
   - netcat
 - to see network stuff
   - md5sum
 - verifying files
 - verifying iso's
   - diff
 diff compares two files, and shows the difference
 between the two. 
 For troubleshooting, this is most often used on
 config files. If one version of a config file works,
 but another does not, a `diff` of the two files
 can often be enlightening. Since it can be very
 easy to miss a small difference in a file, being
 able to see jus the differences is useful. 
 For debugging during development, diff (especially
 the versions built into revision control systems
 like cvs) is invaluable. Seeing exactly what
 changed between two versions is a great help.
 For example, if foo-2.2 is acting weird, where
 foo-2.1 worked fine, it's not uncommon to `diff`
 the source code between the two versions to
 see if anything related to your problem changed. 
   - find
 For troubleshooting a system that seems to have
 suddenly stopped working, find has a few tricks
 up its sleeve. 
 When a system stops working suddenly, the first
 question to ask is "what changed?". 
  find / -mtime -1 
 That command will recursively list all the
 file from / that have changed in the last 
 day. 
  find /usr/lib -mmin -30
 Will list all the files in /usr/lib that
 changed in the last 30 minutes. 
 Similar options exist for ctime and atime.
 
  find /tmp -amin -30
 Will show all the files in /tmp that have
 been accessed in the last 30 minutes.
 The -atime/-amin options are useful when trying
 to determine if an app is actually reading
 the files it is supposed. If you run the app,
 then run that command where the files are, and
 nothing has been accessed, something is wrong.
 If no "+" or "-" is given for the time value,
 find will match only exactly that time. This
 is handy in several cases. You can determine
 what files were modified/created at the
 same time. 
 A good example of this is cleaning up
 from a tar package that was unpacked into
 the wrong directory. Since all the files
 will have the same access time, you can
 use find and -exec to delete them all.  
 - executables
 `find` can also find files with particular
 permisions set. 
  find / -perm -0777
 will find all world writeable files from
 / down. 
  find /tmp -user "alikins"
 will find all files in /tmp owned
 by "alikins"
 
 - used in combo with grep to find
   markers (errors, filename, etc)
 When troubleshooting, there are plenty of
 cases where you want to find all instances of
 a filename, or a hostname, etc. 
 To recursievely grep a large number of files,
 you can use find and it's exec options
  find . -exec grep foo {} \;
 This will grep for "foo" on all files from
 the current working directory and down.
 Note that in many cases, you can also
 use `grep -r` to do this as well. 
   - ls/stat
        - finding [sym|hard] links
        - out of space
   - df
 Running out of disk spaces causes so
 many apps to fail in weird and bizarre
 ways, that a quick `df -h` is a pretty
 good troubleshooting starting point. 
 Use is easy, looks for any volume thats
 100% full. Or in the case of apps that
 might be writing lots of data at once,
 reasonably close to being filled. 
 It's pretty common to spend more time
 that anyone would like to admit debugging
 a problem to suddenly here someone yell
 "Damnit! It's out of disk space!". 
 A quick check avoids that problem.
 In addition to running out of space,
 it's possible to run out of file system
 inodes. A `df -h` will not show this,
 but a `df -i` will show the number of
        inodes available on each filesystem.
 Being out of inodes can cause even
        more obscure failures than being
 out of space, so something to
 keep in mind. 
   - watch
        - used to see if process output changes
        - free, df, etc
   - ipcs/iprm
       - anything that uses shm/ipc
         - oracle/apache/etc
   - google
 - googling for error messages can
 be very handy
   - source code
 For Red Hat Linux, you have the source code,
 so it can often be useful to search though
 the code for error messages, filenames, or 
 other markers related to the problem. In many
 cases, you don't really need to be able to
 understand the programming language to
 get some useful info.
  Kernel drivers are an great example for this, since they
 often include very detailed info about which hardware
 is supported, whats likely to break, etc.
    - strings
 `strings` is a utility that will search though a
 file and try to find text strings. For troubleshooting
 sometimes it is handy to be able to look for strings
 in an executable. 
 For an example, you can run strings on a binary to
 see if it has any hard coded paths to helper utilities.
 If those utils are in the wrong place, that app may
 fail.
 Searching for error messages can help we well,
 especially in cases where you not sure what
 binary is reporting an error message.
 It some ways, it's a bit like grepping though
 source code for error messages, but a bit
 easier. Unfortunately, it also provide far
 less info. 
    - syslog/log levels
 - what goes to syslog
 - how to get more stuff there
   - ksymoops
 - get somewhat meaning info out of kernel
  traces
 - netdump?
   - xev
 - debugging keycode/mouseclick weirdness, etc
Logs
   - messages, dmesg, lastlog, etc
   - log filtering tools?
 
Using RPM to help troubleshoot
   - package verify
   - missing deps
Types Of Problems
   - Things are missing.
 This type of problem occurs in many forms. Shell scripts that
 expect an executable to exist that doesn't. Applications linked
 against a library that can not be found. Applications expecting
 a config file to be found that isnt. 
 It can get even more subtle when file permisions are involved.
 An app can report a file as "not found" when it reality, it
 exists, but the permissions are wrong. 
        - Missing Files
 Often an app will fail because of missing files, but will
 not be so helpful as to tell which file is missing. Or it
 reports the error in a vague manner like "config file not found"
 For most of these cases where something is missing, but you
 are not sure _what_, strace is the best tool. 
 
  strace -eopen trouble_causing_app
 That commandline will list all the files that app is
 trying to open up, and if it succedded or not. The type
 of line to look for is something like:
 open("/path/to/some/file/", O_RDONLY) = -1 ENOENT (No such file or directory)
 That indicates the file wasn't found. In many cases, these errors
 are harmless. For example, some apps will try to open config files
 in the users home directory, in addition to system config files. 
 If the user config file doesn't exist, the app might just continue.
 - Missing Libs
 For missing libraries, the same approach will work. Another
 approach is to run `ldd` against the app, and see if any
 shared libraries show up as missing. See the `ldd` section
 for more details.
 - File Permissions
 For cases where it's the file permision thats causing
 the problem, you are looking for a line like:
 open("/path/to/file/you/cant/read", O_RDONLY) = -1 EACCES (Permission denied)
 Something about that file is not letting you read it. So the
 permisions need to be checked, or perhaps elevated privilidges
 obtained (aka, does the app require running it as root?)
 
  - networking
 
 On modern systems, having networking problems is crippling
 at times. Troubleshooting whats causing them can be just
 as painful at times.
 Some common problems include firewall issues (both on
 the client and external), kernel/kernel module issues,
 routing problems, name resolution, etc. 
 Name resolution issues deserve there own category, so
 see the name resolution section for more info. 
    - firewall checks
 
 When seeing odd network behaviour, these days, local
 firewalls are a pretty good suspect. Client side 
 firewalls are getting more and more aggressive. 
 If you see issues using a network service, especially
 a non standard service, it's possible local firewall
 rules are causing it. 
  Insert infor about seeing what firewall rules
 are up.
 Insert info about increasing log levels to see
 firewall rejections in system logs. 
 Insert info about temprorarily dropping firewalls
 to diagnose problems. 
    - Crappy Connections
 A common problem is connections that are having
 problems. A few easy things to look for to see
 if an external connection might be having issues.
 - ping to a remote host
   `ping` is very simple, and very low level, so
   it's a good tool to get an idea if an interface
   or route is working correctly. 
  ping www.yahoo.com
   That will start pinging www.yahoo.com and
   reporting ping times. Stopping it with ctrl-c
   will show a report of any missed packets. 
   Generally healthy links will have 0 dropped
   packets, so anything higher than that is something
   to be worried about. 
 - traceroute
  traceroute www.yahoo.com
   Attempts to gather info about each node in
   the connections. Generally these map to
   physical routers, but in these days of
   VPN's, it's hard to tell.  
   If a traceroute stalls at some point,
   it usually indicates a problems. Also
   look for high ping times, particularly
   any node that seems much slower than the
          others. 
 - /sbin/ifconfig
   ifconfig does a lot. It can control
   and configure network interfaces of all
   types. See the man page.
   When trying to determine if theres networking
   issues, run `ifconfig` and look for the interface
   showing issues. If there is a high "error" count,
   there could be physical layer issues, or possibly
   overloaded routers etc. 
   That said, with modern networks, it's pretty
   rare to see interface errors, but it's still
   something to take a quick look at. 
    - Bandwidth Useage 
 When the available network bandwidth runs dry,
 it can be difficult to find the culprits. Theres
 a couple subtle variations of this. One being a
 client machine that has some process using a lot
 lot of bandwidth. Another is a server application
 that has one or more clients using a lot of
 bandwidth. 
 - /sbin/ifconfig
 ifconfig reports the number of packets
 sent/received on a network interface, so
 this can be a quick way to get an idea
 what interface is out of bandwidth.
 - sar
 As mentioned in the section on sar, `sar -n DEV`
 can be used to see info about the amount of
 packages each interface is sending at a 
 given time. 
 - trafshow
   I don't know anything
 about trafshow
 - ntop/intop
   havent used in ages
 - netstat
   `netstat` wont show bandwith useage, but it
    is a quick way to see what applications have
    open network connections, which is often a
    good start to finding bandwidth hogs. 
    See the netstat section for more info.  
 - tcpdump/ethereal
   tcpdump and ethereal are both applications
   to monitor network traffic. tcpdump is pretty
   standard, but ethereal is more featureful.
   ethereal also has a nice graphical user
   interface which can be very handy when 
   attempting to digest the large amouts of
   data a network trace can deliver. 
   The basic approach is to fire up ethereal,
   start a capture, let whatever weird networking
   your trying to diagnose happen, then stop 
   capture. 
   Ethereal will display all the connections
   it traced during the capture. There are
   a couple ways to look for bandwidth hogs.
   The "Statitics" menu has a couple useful
   options. The "Protocol Hierarchy" shows
   what % of packets in the trace is from
          each type of protocol. In the case of
   a bandwith hog, at least what protocol
   is the culprit should be easy to spot
   here. 
   The "Conversations" screen is also helpful
   for looking for bandwidth hogs. Since you
   can sort the "conversations" by number of
   packets, the culprit is likely to hop to
   the top. This isn't always the case, as it
   could easily be many small connections killing
   the bandwidth, not one big heavy connection. 
   As far as tcpdump goes, the best way to spo
   bandwidth hogs is just to start it up. Since
   it pretty much dumps all traffic to the screen
   in a text format, just keep your eyes peel for
          what seems to be coming up a lot. 
 - using iptables
   iptables can log how much traffic is flowing though
   a given rule. 
   
  something like:
  iptables -nLx 
  
   
    - routing issues
    - kernel module flakyeness
    - dropped connections
    - tcpdump/ethereal
    - netcat
    - netstat
  - Programs Crashing
 You just finished the last page in your 1200 page
 novel about how aliens invaded Siberia in the
 19th century and made everyone depressed. *boom*
 the word processor disappears off the screen faster
 than it really should. It segfaulted. Your work
 is lost. 
 Crashing applications are annoying to say the
 least. But sometimes, there are ways to figure
 out why they crashed. And if you can figure out
 why, you might be able to avoid it next time. 
 - Crash Catchers
   Most GNOME/KDE apps now are linked against libs
   that include a cratch catching utility. Basically,
   whenever the app gets a segfault, the hander for
   it invokes a process, attaches to it with a debugger,
   gets a stacktrace, and offeres to upload it to a
          bug tracking system. 
   Since these include the option to see the stack
   trace, it can be handy way to get a stack trace.
   Once you have a stack trace, it should point 
   you to where the app is crashing. Figure out
   why it crashed varies greatly in complexity.
 
 - strace
   `strace` can also be handy for tracking down
   crashes. It doesn't provide as muct detail 
   as ltrace or gdb, but it is commonly available.
   The idea being to start the app under trace,
   wait for it to crash, and see what the last
   few things it did. Some things to look for
   include recently opened files (maybed the app
   is trying to load a corrupted file), memory
   management calls (maybe something is causing 
   it to use large amounts of ram), failed network
   connections (maybe the app has poor error handling).
 - ltrace
   `ltrace` is a bit more useful for debuggin crashing
    apps, as it can give you an idea what function 
    an app was in when it crashed. Not as useful
           as a real stack trace, but its easier.
 - gdb
   When it comes to figuring out all the gory details
   of why an app crashed, nothing is better than
   `gdb`. 
   For basic useage, see the gdb section in the
   tools section of this document. 
   For more detail useage, see the gdb documentation. 
   need some examples here
  - debuginfo packages
    One caveat with using gdb on most apps, is that
    they are stripped of debugging information. You can
    still get a stack trace, but it will not be as meaningful
    as one with the debug information available. 
    In the past, this meant recompiling the application with
    debugging turned on, and "stripping" turned off. Which
    can at times, be a slow and painful process.
    In Red Hat Enterprise Linux and later, you can install
    the "debuginfo" packages.
    See the gdb section in the tools section for more info
    on debug packages.
    - core files
    If an application has crashed, and left a core
    file (see the "Allowing Core Files" section under
           the "Enviroment Settings" section for info on how
           to do this), you can use gdb to debug the core file.
    Invocation is easy:
   gdb /path/to/core/file
    After loading the core file, you can issue `bt`
    to get a backtrace. 
    See the gdb section above for infomation about
    "debuginfo" packages. 
  - configs screwed up
     An incorrect, missing, or corrupt config file can wreak
     havoc. Well coded apps will usually give you some idea 
     if a config file is bogus, but thats not always the case.
     
     - Finding the config files
 The first thing is figureing out if an app
 uses a config file and what it is. Theres
 a couple ways to do this.
 - finding config files with rpm
  
   If a package as installed from an rpm, it
   should have the config files flagged as such.
   To query a package to see what it's config files
   are, issue the command:
 
  rpm -q --configfiles packagename
   While you are using rpm, you should see if
    the config files have been modified from
   the defaults.
 
  rpm -V packagename
   That command will list all the files in that
   packaged that have been changed in someways.
   The rpm man page includes more details on what
   the output means, but the basics are:
   if there is a "S", the files size has changed.
   if there is a "5", the file has been modified.
   if there is a "M", the files permissions have changed. 
 
       - strace
 
  Using `strace -eopen process` is a good way to see what
  files a process is opening, including any config files. 
       - documentation
  
  If all else fails, try reading the docs. Often the
  man pages or docs describe where and what the config
  files are. 
      - Verifying the Config Files
 Once you know what the config files, then
 you need to verify they are correct. This is
 highly application dependent. 
 - diff'ing against known good files
   If you have a known good config file,
   diffing the old file and the new one 
   can often be useful. 
 - look for for .rpmnew or .rpmorig files.
   In some cases, rpm will install a new config
   file along side the existing one. This happens
   on package upgrades where the default config
   file has changed between the two packages, and
   the version on disk is different from either 
   version. 
   The idea being, if the default config file is
   different, then it's possible the config file
   format changed. Which means the previous on
   disk config file may not work with the new
   version, so a .rpmnew version is installed
   alongside the existing one. 
   So if an app is suddenly behaving oddly, 
   especially after a package update, see
   if there are any .rpmnew or .rpmorig file.
   If so, you may need to update the existing
   config file to use the new format.  
 - stat/ls
 
   If an app is behaving oddly, and you belive
   is it because of a config file, you should
   check to see when that file was modified.
  stat /path/to/config/file
   The `stat` command will give you the last
   modified and last accessed times. If the
   file seems to have changed later than you
   think, it's possibly something or someone
   has changed it more recently. 
   See the information on the `find` utility
   for ways to look for all files modified
   at/since/before a certain time. 
   - gconf
   - The config file has changed but the app is ignoring it
 - is it the correct config file?
   Often an application will look for config
   files in several places. In some cases, 
   some versions of the config file have
   precedence over other versions. 
   A common example is for an app to
   a default config file, a per system
   config file, and a per user config file.
   With the user and system runs overriding
   the default one. For some apps, individual
   config items have there own inheritance 
   rules.
   So for example, if your modifying a system
   wide config file, make sure there isnt
   a per user config file that overrides the
   change. 
 
 - is it a daemon?
   daemon and server processes typically
   only read there config file when they
   start up. 
   Sometimes a restart of the process is
   required. Sometimes it is possible to
   send a "HUP" signal to an app to force
   it to reload configs. To send a "HUP"
   signal:
  kill -HUP $pid
 
   Where $pid is the process id of the 
   running process. 
   Sometimes init scripts will have
          options for reloading config files.
   Apache httpd's init script has
   a reload option.
  service httpd reload
 - shell config?
   Some process, user shells in particular,
   have fairly complicated rules about when
   some of it's config files are read. 
   See the "INVOCATION" section of the
          bash man page for an example of when
   the various bash config files get loaded.
  - kernel issues
   - single user
   - init=/bin/bash
   - bootloader configs
   - log levels
 
  - stuff not writing to disk
     - out of space
 You run a command to write a file, or save a file from an
 app. When you go to look at the file, it's not there, or
 it's empty. Or the app complains that is "unable to write to device"
 Whats going on?
 More than likely, the system doesnt not have any storage space
 for the file. The file system that the app is trying to write
 to is full, or nearly full. 
 This case can cause a wide variety of problems. The
 easiest way to check to see if this is the case is
 the `df` command. See the df section in the tools
 section for more info on df. 
 One thing to keep in mind is that the correct
 filesystem has space. Just because something in
 `df` shows free space, doesn't mean the app
 can use it.
     - out of inodes
 `df -i` can catch this one as well. It's fairly
 uncommon these days, but it can still happen. 
     - file permissions
 Check the file permissions for the file, and
 directory the app is trying to write to. 
 You can use strace to see where it's writing to
 if nothing else tells you. 
     - ACL's
 
 If the system is using ACL's, you need to 
 verify the user/app is in the proper ACL's.
     - selinux
 selinux can control what app can write where
 and how. So verify the selinux perms are
 correct.
  need more info on tracking down
 selinux issues 
     - quotas
 If the system has file system quotas enabled,
 it's possible the user is over quota. 
  `quota`
 
 That command will show the current quota 
 useage, and indicate if quotas are in
 effect or not.
     - read-only mounts
 Network file systems in particular tend
 to mount shared partions read-only. The
 mount options overrides any file permisions
 on the file system that is being shared.
     - read only media
 cd-roms are read-only media. The app isn't
 trying to write to it is it?
     - chattr/lsattr
 One feature of ext2/3 is the ability to
 `chattr` files. There are per file attributes
 beyond standard unix permissions. 
 See the chattr/lsattr section of the tools
 section for more details. 
 If a file has had the command `chattr +i` run
 on it, then the file is "immutable" and nothing
 can modify it or delete it, including the root
 user. The only way to change it is to run `chattr -i`
 on it. Then it can be treated as a normal file. 
  - files doing weird stuff
 The app is reading the right file. The file _looks_
 correct, but it is still behaving weirdly. A few
 things to look for.
 - hidden chars
 
  Sometimes a file can get hidden characters in
  it that give parsers headaches. This is increasingly
  common as support for more characters encoding 
  become common. 
  - dos style carriage returns
  - embedded tabs
  - high byte chars
  One good approach is to open the file with
  vi in bin mode:
  
  vi -b filename
  Then to put vi into 'setlist" mode. Do this
  by hitting escape and entering ":setlist".
  This should show any non ascii chars, new
  lines, tabs, etc. 
  the `od` utility can be useful for viewing
  files "in the raw" as well.  some
  useful od invocations
    - ending new line
 Some apps are picking about having any
 extra new lines at the end of files. So
 something to look for.
    - trailing spaces
 
 A particular hard to spot circumstance
 that can break some parsers. Trailing
 spaces after a string. This can be
 particularly difficult to spot in
 cases where it's a space, then  a
 new line.
 
 This seems to be particularly common
 for config options for usernames and
 passords "foobar" != "foobar "
  - env stuff
      The users "enviroment" can often cause problems
      for applications. Some typical cases and
      how to detect them. X DISPLAY settings,
      the PATH, HOME, TERM settings etc can
      cause issues.
    - things work as user/not root, vice versa
 There can be any number reason something
 works as root, but not as a user. Most
 of them related to permissions (on
 files or devices). 
 Another common cause is PATH. On Red Hat
 at least, users do not have /sbin:/usr/sbin
 in there PATH by default. So some scripts
 or commands can fail because they are not
 in the PATH. Even having the PATH order
 being different between root/user can
 cause problems.
  X forwarding crap
    - env
 The easiest way to see enviroment
 variables is just to run:
  env
    - what basic env stuff means
    - su/sudo issues
    - env -i to launch with clean env
 If a app seems to be having issues
 that is enviroment dependent, one
 thing that can be useful when trouble
 shooting is to launch it with `env -i`.
 Something like:
  env -i /bin/someapp
 `env -i` basically strips all enviroment
 variables, so the app can launch with
 nothing set. 
    - su -, etc 
 If `su` is being used to gain root,
 one thing to keep in mind is the difference
 between `su` and `su -`. The '-' tells su
 to start up a new login shell. In practice,
 this means `su -` essentialy gets a shell
 with the same enviroment a normal root
 shell would have. 
 A shell created with `su` still has the
 users SHELL, PATH, USERNAME, HOME, and
 other variables from the users shell. 
 A shell created with `su -` has the
 same variables a root shell has. 
 This can often cause weird behavior
 on apps that depend on the enviroment.
 
    - sudo -l
  - shell scripting
 Scripting in bash or sh is often
 the quickest and easiest way to solve
 a problem. It's also heavily used in
 the configuration and startup of Red Hat
 Linux systems. 
 Unfortunately, debugging shell scripts
 can be quite painful. Debugging shell
 scripts someone else wrote a decade
 ago is even worse.
 
   - echo
 The bash builtin "echo" is often the
 best debugging tool. A common trick
 is just to add "echo" to the begining
 of a line of code that you believe is
 doing something incorrect. 
 This will just print out the line, but
 after variable expansion. Particularly
 handy if the line in question is using
 lots of shell variables.
   - sh -x 
 
 Bash includes some support for getting
 more verbose information out of
 scripts as they run. Invoking
 a shell script as follows:
  sh -x /path/to/someshell.sh
 That command will at least trying
 to print out every line it as it
 executes them. 
   - trap 
   - bash debugger
 There is a bash debugger available
 at http://bashdb.sourceforge.net/.
 It's essentially a "gdb" style debugger
 but for bash. Including support for
 step debugging, breakpoints, and 
 watchpoints. 
  - DNS/name resolution
 
 Once a network is up and going, name resolution
 can continue to be a source of problems. Since
 so many applications expect reliable name resolution,
 when it fails. 
 
     - useage of dig
 `dig` is probably the most useful tool for
 tracking down DNS issues.
 
 insert useful dig examples
    
     - /etc/hosts
 
 Check /etc/hosts for spurious entries. It's
 not uncommon for "temporary" /etc/hosts entries
 to become permanent, and when the host ip does
 change, things break. 
     - nscd
 nscd is a name service caching daemon. It's
 very useful when using name services info
 like hesiod and ldap. But it can also
 cache DNS as well.
 Most of the time, it just works. But
 it's been known to break in odd and
 mysterious ways before. So trying
 DNS with and without nscd running
 is a good idea. 
 
     - /etc/nsswitch.conf
     - splat names/host typos
 "*" matching on DNS servers is pretty
 common these days. It normally doesn't
 cause any problems, as much as it can
 make certain types of errors harder
 to track down.
 A typo in a hostname will get redirected
 to another server (typically a web server)
 instead of giving an name resolution error.
 Since the obvious "host not found" errors
 don't happen, tracking down these kind
 of problems can be compounded if used
 with "wildcard" DNS. 
  - auth info
   - getent
   - ypwhich/match/cat
   
  - certificate/crypto issues
   - ssl CA certs
   - gpg keys/signatures
   - rpm gpg keys
   - ssltool
   - curl
  - Network File Systems
    - NFS causes weird issues
      - timestamps
      - perms/rootsquash/etc
      - weird inode caching
    - samba
 - it touches windows stuff, icky
   
  - Some app/apps is rapidly forking shortlived process
    - gah, what a PITA to troubleshoot
    - psacct?
    - sar -X?
    - watching pids grow?
    - dump-acct + parsing?
App specific
   - apache
     - scorecard stuff
     - module debugging
     - log files
     - init file "configtest"
     - -X debug mode
  - php
 - 
  - gtk apps
    - event debuging stuff?
  - X apps
    - nosync stuff
    - X log
  - ssh
    - debug flags
    - sshd -d -d 
  - pam/auth/nss
    - logging options?
    - getent
  - sendmail  
Credits
  Comments, suggestions, hints, ideas, critisicms, pointers,
  and other useful info from various folks including:
  Mihai Ibanescu
  Chip Turner
  Chris MacLeod
  Todd Warner
  Nicholas Hansen
  Sven Riedel
  Jacob Frelinger
  James Clark
  Brian Naylor
  Drew Puch
 
 
No comments:
Post a Comment