At Applied Trust we run Nagios, the excellent open source monitoring system, to ensure the availability of our internal system infrastructure and that of our clients. The system monitors roughly 2,740 services on 468 hosts. It has performed flawlessly for the past six or seven years, but over the last few months we began experiencing performance problems (see the 12 month CPU usage graph below). In particular, we were being alerted regularly about excessive load on the monitoring system and, as a consequence of high load, false positive timeouts when connecting to services.

12 month CPU usage
We started debugging the problem by identifying which resource in particular was causing load problems. Using the iostat command, we noticed that both I/O and wait User CPU time seemed consistently high:
$ iostat 2 30
Linux 2.4.21-4.ELsmp ([snip]) 03/03/2009
avg-cpu: %user %nice %system %iowait %idle
30.77 0.00 5.29 12.24 51.71High I/O wait can be caused by a variety of problems. Several layers of complexity are eliminated in our case since the monitoring system uses local storage (e.g. not attached to a SAN/NAS/some other remote storage system). However, the storage is configured in RAID1 with Linux software RAID, and dmesg revealed a problem with the mirror:
md: Autodetecting RAID arrays. [events: 0000002a] [events: 00000033] md: autorun ... md: considering sdb3 ... md: adding sdb3 ... md: adding sda3 ... md: created md0 md: bind<sda3,1> md: bind<sdb3,2> md: running: <sdb3><sda3> md: sdb3's event counter: 00000033 md: sda3's event counter: 0000002a md: superblock update time inconsistency -- using the most recent one md: freshest: sdb3 md: kicking non-fresh sda3 from array! md: unbind<sda3,1> md: export_rdev(sda3) md: RAID level 1 does not need chunksize! Continuing anyway. md0: max total readahead window set to 124k md0: 1 data-disks, max readahead per data-disk: 124k raid1: device sdb3 operational as mirror 1 raid1: md0, not all disks are operational -- trying to recover array raid1: raid set md0 active with 1 out of 2 mirrors md0: no spare disk to reconstruct array! -- continuing in degraded mode md: ... autorun DONE.
A consistency problem on the sda3 partition degraded the mirror. To determine if the drive was bad, we used smartctl tool:
$ sudo /usr/sbin/smartctl -Hc /dev/sda3 smartctl version 5.1-11 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ [^] SMART Sense: Ok! Current Drive Temperature: 31 C Drive Trip Temperature: 68 C
The drive seems fine, so we added the partition back to the array with mdadm /dev/md0 --add /dev/sda3 and monitored the rebuild process with watch 'cat /proc/mdstat'. Unfortunately, after the array was rebuilt and mdadm -D /dev/md0 showed both drives as active, the load problems persisted. This was unlikely to fix the problem anyway since a RAID mirror has very little overall impact on I/O performance. Since the storage subsystem seemed sane, we took a step back and looked at the application layer. Using lsof, it appeared that the serviceperf.log file, which holds performance metric data gathered along with service monitoring commands, was showing constant usage:
$ sudo /usr/sbin/lsof /usr/local/nagios COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME syslog-ng 6023 root 5w REG 58,3 8654863 81927 /usr/local/nagios/syslog/10.34.34.59.log nagios 6404 nagios txt REG 58,3 521120 32782 /usr/local/nagios/bin/nagios nagios 6404 nagios txt REG 58,3 8654863 81927 /usr/local/nagios/syslog/10.34.34.59.log nagios 6404 nagios txt REG 58,3 513157004 22182 /usr/local/nagios/var/serviceperf.log perl 6663 root cwd DIR 58,3 4096 32769 /usr/local/nagios/bin tail 6664 root cwd DIR 58,3 4096 32769 /usr/local/nagios/bin bash 11802 chris cwd DIR 58,3 4096 2 /usr/local/nagios atreporte 16000 apache cwd DIR 58,3 4096 180289 /usr/local/nagios/reporter atreporte 16000 apache 3r REG 58,3 15152 180230 /usr/local/nagios/reporter/cgi-lib.pl
This make sense since most of our service checks collect performance data for use with rrdtool for graphing and capacity planning purposes. The nagios.cmd file also appeared regularly in the lsof list (though not depicted above), but since it's a named pipe it does not have an impact on I/O wait time. To determine if the serviceperf.log was the true perpetrator, we removed performance processing from the Nagios configuration for a few hours. Load dropped dramatically, so we needed to reduce the time it took to read and write to the serviceperf.log file. Our first idea was to add the noatime and nodiratime mount option to the filesystem that the serviceperf.log file was on (in our case, /usr/local/nagios) to eliminate inode access time updates, which can have a small impact on filesystem performance. This is as simple as executing sudo mount -o remount,noatime,nodiratime. After remounting and monitoring for a few hours, the load did not drop significantly. The next tactic we tried was targeted at the other problem we identified: high user CPU time. Our graphs of CPU over the past year (see above) on the monitoring system indicated that user CPU was growing slowly but steadily. The blue line representing User CPU grew from 15% to 30% over a year. Anything running in user space, such as Nagios service checks, will all be classified as user CPU. At some point in the past year we may have added enough service monitoring that it finally impacted user CPU. Since many of our monitored services rely on a custom Perl script that would certainly execute in User mode, we tried migrating to the Nagios embedded Perl interpreter, which adds a number of advantages. In the process of migrating, we ran in to a goofy problem compiling Nagios with embedded Perl that caused debug messages to be printed. A number of other users reported that adding #undef DEBUG to the file include/epn_nagios.h fixed the problem. We compiled with ./configure --prefix=/usr/local/nagios --enable-embedded-perl --with-perlcache --with-command-group=nagios and waited. Although all of our custom Perl executed succesfully, the embedded interpreter did not have an impact. Back to the drawing board. We decided on two tactics for next steps: reduce the frequency between service checks (fewer over all checks would reduce the number of processes consuming user CPU time) and move the serviceperf.log file to a tmpfs filesystem. For most of our service checks we had previously used intervals between 2 and 5 minutes. We moved to a 10-15 minute interval for most checks. The serviceperf.log data is not useful once it is read and committed to an RRD file, so moving it to an in-memory file system is not risky. We installed memory in the server, created and mounted the tmpfs, and move the serviceperf.log to the new filesystem. These two steps finally solved our load problems. Compare the one week graph below to the 12 month graph above. It's not hard to see when the changes were put in place.

1 week CPU usage
Hopefully these troubleshooting steps are useful for someone else with a busy Nagios server.

Comments
I found something new for me! Thanks!
I think moving the service check timings to an interval of 10-15mins would have had an impact in improved performance. 10-15 mins can be ideal for a few devices but not something that I would recommend in a more dynamic environment. For example, for the WAN infras that we configure nagios 5mins checks themselves are too big an interval as a network/interface flap for a couple of mins can cause serious issues as a lot of HA systems would kickin. So, sooner the engineers know is better. And sometimes you may miss a whole flap completely in these 10-15mins interval
CPU 30-50% isn't an impact again.
@Kumaran - Yes, I absolutely agree. Some service checks need to be more granular (2-5 minutes) and others can be monitored on the 10-15 minute scale. It really depends on the service's purpose and the amount of downtime that can be tolerated. Ultimately, I'd say the interval should be up to the service owner.
Great article Ben! I like the way you draw the picture then present solutions.
We noticed an increase with our load averages over a period of time. My enterprise monitors over 1700 hosts and 3100 services and is growing. We developed the same type of solutions you propose in your article.
Performing a "lsof" we were able to identify directories being hit the hardest, so we placed those directories on SAN disks, which reduced the load dramatically. Directories included /usr/lib/mysql and /var/opt/nagios.
To help reduce the load even further we developed time periods for monitoring certain aspects of the enterprise. We only monitor development and test servers during business hours. We also only monitor services that impact production during the night time hours. We found that receiving a notification at 2 AM about a high CPU was not the answer, so we track these problems then work on solutions the following morning.
You may also want to turn-up monitoing on systems that the end users are reporting problems. You can create custom templates and even call them problem_child.cfg. You would add hosts to this template only when a need occurs to step up the monitoring.
Thanks Ben for the article and I found it quite insightful. Keep up the fine work.