Article mar2007.tar

RAM Nagios vs. HD Nagios: Performance Evaluation

Jonathan Krein

In the January issue of Sys Admin, I discussed the network monitoring potential of Nagios and Fruity and described some of the limits of Nagios and how its performance varies depending on platform. In this follow-up article, I'll take a closer look at the performance of Nagios.

When we first set up Nagios, it ran from the hard drive (HD); but for performance reasons (e.g., lower latency, could handle more service checks), it was later moved out onto a RAM partition (size: half of the available system RAM). The Nagios startup script was thus modified to copy the HD version into RAM and start the Nagios process from there. Shutting down Nagios would copy the RAM version back to the HD. However, changes made to either version since the last startup/shutdown were sometimes lost when Nagios failed unexpectedly or the server went down. To resolve this problem, a cron job was set to run every 5 minutes and rsync the two versions.

Latency Evaluation

We recently ran a latency evaluation on two identical versions of Nagios, one running from RAM and the other from the HD. The two instances of Nagios (RAM/HD) ran on different virtual servers, Brighton and Wonder, of which Wonder is a clone of Brighton. The two instances also ran concurrently. The following tables show the results of the evaluations.

Trial 1 (879 service checks). All timings are rounded to the nearest second.
                  RAM Version     HD Version
Total Run Time      Latency         Latency
--------------------------------------------
1'00"                75 sec         400 sec
1'20"                45 sec          36 sec
1'30"                21 sec          13 sec
1'40"                11 sec           9 sec
1'50"                10 sec          10 sec
2'00"                 5 sec           5 sec
--------------------------------------------
Trial 2 (1,395 service checks)
                  RAM Version     HD Version
Total Run Time      Latency         Latency        Delta
---------------------------------------------------------
15'40"            138.563 sec     108.010 sec     +30.526
15'50"            120.136 sec      93.351 sec     +26.785
16'00"            118.100 sec     102.386 sec     +15.714
16'10"            114.729 sec      81.194 sec     +33.535
16'20"            106.076 sec      85.776 sec     +20.300
---------------------------------------------------------
Analysis

In Trial 1, Nagios was not sufficiently loaded to show a performance discrepancy. The numbers are decreasing because Nagios takes time to settle out after it is initially started when there is no retention data. As expected, the HD version took longer to settle down. When it did, however, it ran comparably to the RAM version -- that is, both versions handled 879 service checks without a problem.

In Trial 2, Nagios was sufficiently loaded to see a discrepancy between the RAM and HD versions' performance. In this trial, the HD version consistently performed better. Seemingly, the rsync process that runs every five minutes on the RAM version, and the fact that half of the available RAM in the system is partitioned off, causes the RAM version to run slower than the HD version.

Because Linux caches frequently used files (which definitely includes the Nagios scripts), most -- if not all -- of Nagios is run from RAM anyway. Partitioning off half of the available RAM only reduces the available caching space. Furthermore, the rsync process runs often enough (every 5 minutes) to deserve its own space in the RAM cache. Thus, in the RAM version, there are likely at least two copies of Nagios competing with other files for the system's RAM.

Conclusions

It is important to note, though, that these results and conclusions are limited in that we only ran two trials. These results, for instance, could be the consequence of virtual servers competing for resources on their respective blade servers. Nevertheless, in light of these results and due to the simplicity of running Nagios from the HD rather than from RAM[1], Nagios will be moved back to running from the HD.

It is my opinion that moving the Nagios process back to the hard drive will, in the worst performance case, be fairly comparable to our current setup -- although it may run more efficiently. Additionally, the simplicity of running Nagios from the hard drive makes it a far more desirable option for future maintainability and scaling.

Endnote

1. The RAM version -- among other things -- requires a Cron job to keep it in sync with the hard drive copy, necessitates additional Fruity code for exporting, cannot be mounted by NFS for a distributed Nagios system, and requires still more code to prevent the rsync process from copying the mounted drives in a distributed system. It's messy.

Jonathan Krein is a Systems Management Engineer for the Office of Information Technology at Brigham Young University. He is currently working on his Bachelor of Science degree in Computer Science at BYU and can be reached at: jonathankrein@byu.net.