Loadrunner and the rstat daemon

Loadrunner (and other monitoring tools) needs the rstat daemon to extract data from the kernel. Installing this daemon is a bit tricky, as it is part of a larger package. Here’s how you do it:

# up2date -i rusers-server

To start the daemon: # service rstatd start

NOTE: Ensure the portmap service is running first.

Linux troubleshooting steps

1) Are the CPUs overtaxed?
2) Are one or more processes using most of the CPU power?
3) Is swap space usage increasing?
5) Is the system spending time waiting for I/O?
6) Is the system writing or reading from disk too often?
7) Is the system saturating the network card?
8) Is a process using too much RAM? (Is its Resident Set Size increasing over time?)
9) What type of memory is a process using? (cat /proc//status)
10) Is the process stack size increasing? (VmStk)
11) Is shared memory use increasing?

Find the size of a drive

$ parted /dev/cciss/c0d0 print
Disk geometry for /dev/cciss/c0d0: 0.000-69459.609 megabytes
Disk label type: msdos
Minor Start End Type Filesystem Flags
1 0.016 199.218 primary ext3 boot
2 199.219 69459.609 primary lvm

Sar for more than 7 days

By default, the sar (system activity report) utility only keeps 7 days worth of records. Seven days is usually not long enough to capture enough data for meaningful trend analysis. You can therefore change the the duration of time that sar tracks system activity by editing the /etc/sysconfig/sysstat file from this:

# How long to keep log files (days), maximum is a month

# How long to keep log files (days), maximum is a month

This will give you 30 days of log files to chew on. You could also setup a cron job that runs a script to copy the previous days data file to another directory, renaming the files to (YYYYMMDD) instead of just the day (DD) each time. This will give you even more than 30 days worth of data to review.

Noatime explained

Linux records information about the last time a file was read (atime), the last time its contents were changed (mtime), and the last time its file permissions were changed (ctime). By default, Linux updates the last-time-read attribute of any file during a read operation. Lets watch this in operation by running the stat command against the /etc/fstab file:

# stat /etc/fstab
File: `/etc/fstab’
Size: 787 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 886347 Links: 1
Access: (0644/-rw-r–r–) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2006-11-20 11:29:00.000000000 -0500
Modify: 2006-11-20 11:29:00.000000000 -0500
Change: 2006-11-20 11:29:00.000000000 -0500

Note the date and time of the Access field. If you immediately issue the same command again, you will note that the Access time field is not updated:

# stat /etc/fstab
File: `/etc/fstab’
Size: 787 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 886347 Links: 1
Access: (0644/-rw-r–r–) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2006-11-20 11:29:00.000000000 -0500
Modify: 2006-11-20 11:29:00.000000000 -0500
Change: 2006-11-20 11:29:00.000000000 -0500

Now “cat” the /etc/fstab file to read its contents:

# cat /etc/fstab
/dev/VolGroup00/LogVol00 / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
none /dev/pts devpts gid=5,mode=620 0 0
none /dev/shm tmpfs defaults 0 0
none /proc proc defaults 0 0
none /sys sysfs defaults 0 0
/dev/VolGroup00/LogVol01 swap swap defaults 0 0

Now that we have read the file’s contents, state the file once again:

# stat /etc/fstab
File: `/etc/fstab’
Size: 787 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 886347 Links: 1
Access: (0644/-rw-r–r–) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2006-11-20 11:29:37.000000000 -0500
Modify: 2006-11-20 11:29:00.000000000 -0500
Change: 2006-11-20 11:29:00.000000000 -0500

See a difference? The timestamp of the Access field has changed! This update happens every time the file is read. In other words, each time a file is read, Linux makes an “expensive” write to the filesystem to update the Access time.

As you might imagine, there is generally no reason to care when a file was last read, either by a user or a system process, but the I/O cost of any disk write is high.

Fortunately, Linux lets you mount a partition with the “noatime” attribute. Noatime prevents the kernel from updating the last-time-read (atime) attribute of a file during a read operation. This boosts the performance of the filesystem, as fewer disk writes need to be done.

To set the noatime attribute to ensure no writes are generated from read accesses, edit the /etc/fstab file, and change the “defaults” directive to “rw,noatime” for a given partition. Here is what it looks like for the root partition:

/dev/VolGroup00/LogVol00 / ext3 defaults 1 1

(Note: in a long directory listing, the Modify (mtime) timestamp is displayed)
# ll /etc/fstab
-rw-r–r– 1 root root 1.1K Nov 20 10:49 /etc/fstab

Log the kernel

The /var/log/dmesg file is very handy for finding out what the kernel is doing. The problem? It gets recreated every time you reboot a Linux server. There is no persistence to this data, which makes troubleshooting harder. Furthermore, the dmesg log does not date and time-stamp the events it records, so figuring out “when” an event occurred is difficult. You can mitigate both of these problems by adding a single line to the /etc/syslog file:

kern.* /var/log/kernel.log

Then restart the syslog service to pickup the change (“service syslog restart”) This will log all debug-level messages from the kernel to the /var/log/kernel.log file. When a problem occurs, you can now see when and what occurred, like this:

Nov 13 16:30:07 usatl01lw207 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
Nov 13 18:07:31 usatl01lw207 kernel: warning: many lost ticks.
Nov 13 18:07:31 usatl01lw207 kernel: Your time source seems to be instable or some driver is hogging interupts

What libraries is process XYZ using?

Ever wonder what libraries a process is using? The pmap command will report memory map of any process. Just issue “pmap ” to find out. Here we examine the libraries being used by the Netbackup process:

# pmap 10345
10345: bpbkar -r 2678400 -ru root -dt 86459 -to 0 -sched Daily_Incr -st INCR -bpstart_to 1800 -bpend_to 1800 -read_to 7200 -use_otm -kl 5 -use_ofb
00b7c000 84K r-x– /lib/ld-2.3.4.so
00b91000 4K r-x– /lib/ld-2.3.4.so
00b92000 4K rwx– /lib/ld-2.3.4.so
08048000 364K r-x– /usr/openv/netbackup/bin/bpbkar
080a3000 28K rwx– /usr/openv/netbackup/bin/bpbkar
080aa000 1368K rwx– [ anon ]
b7a44000 2080K rwx– [ anon ]
b7c4c000 36K r-x– /lib/libnss_files-2.3.4.so
b7c55000 8K rwx– /lib/libnss_files-2.3.4.so
b7c59000 4K r-x– /usr/lib/gconv/ISO8859-1.so
b7c5a000 8K rwx– /usr/lib/gconv/ISO8859-1.so
b7c5c000 24K r-xs- /usr/lib/gconv/gconv-modules.cache
b7c63000 4K r-x– /usr/lib/locale/locale-archive
b7c64000 24K r-x– /usr/lib/locale/locale-archive
b7c6a000 180K r-x– /usr/lib/locale/locale-archive
b7c97000 2048K r-x– /usr/lib/locale/locale-archive
b7e97000 4K rwx– [ anon ]
b7e98000 1172K r-x– /lib/tls/libc-2.3.4.so
b7fbd000 4K r-x– /lib/tls/libc-2.3.4.so
b7fbe000 12K rwx– /lib/tls/libc-2.3.4.so
b7fc1000 8K rwx– [ anon ]
b7fc3000 8K r-x– /lib/libdl-2.3.4.so
b7fc5000 8K rwx– /lib/libdl-2.3.4.so
b7fc7000 60K r-x– /lib/libresolv-2.3.4.so
b7fd6000 8K rwx– /lib/libresolv-2.3.4.so
b7fd8000 8K rwx– [ anon ]
b7fda000 72K r-x– /lib/libnsl-2.3.4.so
b7fec000 8K rwx– /lib/libnsl-2.3.4.so
b7fee000 12K rwx– [ anon ]
b7ffd000 4K r-x– /lib/libcwait.so
b7ffe000 4K rwx– /lib/libcwait.so
bffbc000 272K rwx– [ stack ]
ffffe000 4K —– [ anon ]
total 7936K

Fix a reboot hang

Some IBM servers (IBM eServer xSeries 346, *cough*) do not reboot properly when issuing the “shutdown -r now” command; the server hangs after processing the last of the rc scripts. To mitigate this problem, append “reboot=b” to the kernel line in /boot/grub/grub.conf. This parameter tells the kernel to use the BIOS reboot function for the reset. This should allow the server to reboot successfully.

NMI (Non-Maskable Interrupt)

A server is always executing interrupts; giving attention to whatever needs it. These are so-called maskable interrupts. The CPU can mask, or temporarily ignore, any interrupt if it needs to in order to finish something else that it is doing. There are also non-maskable interrupts (NMI), which are used for serious conditions that demand the processor’s immediate attention.

When an NMI signal is received, the processor immediately drops whatever it was doing and attends to it. The NMI cannot be ignored by the server. The NMI signal is normally used for critical problem situations, such as serious hardware errors. The most common use of NMI is to signal a parity error from the memory subsystem.

HP management log from the command line.

If you have installed the HP PSP (Proliant Support Pack) on your Linux server, you can access most information via the web interface at https://localhost: 2381. However, if you at the command line, you can also access a wealth of server information. Thermal information, data on fan status and power, pci routing tables, and even the management logs are available at CLI. You can also enable or disable ASR (Automatic Server Recovery). Here’s how:

To see the integrated management log:
$ hplog -v

ID Severity Initial Time Update Time Count
0000 Repaired 14:23 06/06/2005 12:45 06/15/2005 0002
LOG: Network Adapter Link Down (Slot 3, Port 1)

0001 Caution 10:52 06/22/2005 10:52 06/22/2005 0001
LOG: System Power Supply: General Failure (Power Supply 1)

To check on the status of your power supplies:
$ hplog -p
1 Standard Pwr. Supply Bay Nominal Yes
2 Standard Pwr. Supply Bay Nominal Yes

To mark a particular item as repaired:
hplog -s REPAIRED -l 0001

To see all support options, type ” hplog -help”