Article	Figure 1	Figure 2	Figure 3	Figure 4
Figure 5	jan2007.tar

Nagios and Fruity: What Is Their Monitoring Potential for Your Network?

Jonathan Krein

This article will provide a brief introduction to Brigham Young University's (BYU) implementation of Nagios and Fruity and present what I have learned about how these monitoring applications perform on both a dedicated server and in a virtual environment. In this article, I will describe the limits of Nagios and how its performance varies depending on platform. With the data presented here, you should be better able to project growth potential for a Nagios-Fruity implementation in your own network.

At BYU, the production versions of Nagios and Fruity are housed together on a single dedicated server. In fact, BYU's network has traditionally been hardware based. However, because of the benefits of virtual servers, our network is beginning to move in the virtual direction. Such a move raises several important questions for Nagios and Fruity:

Should we move Nagios and Fruity from their dedicated hardware to a virtual environment?
What would be the ramifications of such a move, and what losses in performance might we incur?
How much room would Nagios have to grow (i.e., monitoring more hosts and services) on either of these two platforms?
If we did move to a virtual environment, would we need to distribute Nagios's load over multiple servers?
In a distributed environment, how many servers would we need?

To answer these questions, first I will cover some background information about Nagios and Fruity. Next, I will talk about how these applications perform on a dedicated server. Then, I will show what they can do on a virtual server, and I'll present one server configuration option to be aware of, since it has a large impact on Nagios and Fruity's performance. I will conclude this discussion with a Nagios load analysis, which should give you a good indication of what Nagios is capable of, particularly in a virtual environment. Along the way, I hope you will also see what potential Nagios has to monitor a large network, as well as the benefits of a configuration file manager like Fruity.

Background Information

Nagios is a Linux-based, enterprise-class monitoring application distributed under an open source license. Nagios works from text configuration files that store information about hosts and services. A host represents a server or other device out on the network, while each service represents a different check that is being performed on its associated host (e.g., CPU utilization, memory usage, processes running). Service checks make up the bulk of Nagios's load, but hosts are also given a health check when one of their services reports a problem. The real power behind Nagios is that everything Nagios does, from running checks to responding to alerts, is based on commands that you define. When the status of a host or service changes, Nagios uses these commands -- which in turn do the real work -- to run scripts or other programs This means that Nagios can do just about anything you can program.

However, defining all of your hosts, services, commands, and other configuration options will create bulky configuration files for Nagios, especially if you have a large network. BYU's implementation currently performs over 2,800 service checks on nearly 1,900 network devices and servers. With so many hosts and services, it's easy to make a mistake in the configuration files, and a mistake in these files means downtime for Nagios. So, to store all of this data, find configuration errors, and handle the tedious formatting that Nagios requires, we use a configuration file manager called Fruity.

Fruity works from a MySQL database that stores all of the information Nagios needs to do its job. Fruity presents this information for updates, additions, and deletions through a Web interface. One of Fruity's strengths is that it organizes everything into logical components, thus making it simple to find a host or service, adjust its settings, and then export the changes to Nagios. The best part about Fruity is that it's also open source and therefore easily obtainable and fully customizable.

Running Nagios and Fruity on a Dedicated Server

Figure 1 outlines the dedicated server hardware that has been supporting our production versions of Nagios and Fruity. We have been running these applications on a Linux machine (Fremont) with two processors (hyper-threaded to four) and 4 GB of memory. For most of Fremont's life, Fruity has performed well, and Nagios's latency has been under one minute.

The term latency refers to how much later a check runs from the time it was scheduled. Because Nagios works sequentially, it will not start the next check until the previous one has finished. Therefore, latency begins to increase when the load gets too high and checks are scheduled more closely together than they can run.

Our goal at BYU is to keep Nagios's service check latency below one minute. However, we are now exceeding this limit on Fremont, and the latency continues to increase with each service check we add. Because hosts are only checked when one of their services reports a problem, and these checks are ushered to the front of the queue, host check latency has not yet been affected. Nevertheless, with service check latency now averaging over one minute and Fruity running slower every day, we have been forced to consider distributing the load. As part of this transition, we have been testing Nagios on virtual servers.

Nagios and Fruity on a Virtual Server

Figure 2 outlines one of the virtual servers (Brighton) on which we installed Nagios and Fruity. Brighton was first configured to have two virtual processors and a little more than 2 GB of memory. However, even with only 30% of Fremont's load, Nagios's service checks were running behind by more than 6 minutes. Obviously, a virtual server is not going to perform as well as dedicated hardware, but six times the latency on less than a third of the load is far outside of the expected and acceptable performance margins. We further experimented with the optimizations outlined in Nagios's documentation, but we did not see any improvement until we finally changed Brighton's virtual server configuration.

Tweaking the Virtual Server Configuration

Looking at the graphs that VMWare produces on CPU performance, we noticed that each of Brighton's two virtual processors was only running at 25% capacity; yet Nagios was buckling under its load, and Fruity was running extremely slowly. So we changed the configuration from two virtual processors to one and increased the granted memory to 3.6 GB (see Figure 3). The service check latency promptly dropped to between 0.5 and 15 seconds -- almost a 96% reduction -- and Fruity accordingly performed better.

What Happened?

Take a look at the performance graphs for Brighton in Figure 4. On June 5th, we increased the load on Brighton, and the active memory usage rose to almost 1 GB. However, notice that the CPU performance graph for that same day shows no significant change in processor utilization, which remained at about 50% for both processors (cumulative).

Could Something Be Holding These Processors Back?

Look at June 8th, when we made the configuration changes. Active memory did not increase, but the CPU performance rose to almost 70%. We later increased another virtual server's granted memory without modifying the number of processors and found no changes in CPU performance, active memory, or Nagios's latency. It seems that on virtual servers managed by VMWare, Nagios and Fruity perform much better with only one virtual processor rather than two. The number of processors is definitely something to keep in mind when setting up these applications, especially in a virtual environment.

Load Testing the Virtual Environment

At this point, having configured our virtual server for better performance, we wanted to find out how much load Nagios and Fruity could handle and what the latency curve would look like as we increased the number of service checks. Figure 5 shows a chart of our results -- Nagios's service check latency graphed against the number of service checks being run. Just as we saw with Fremont, Nagios's latency on Brighton remained low up to a point, after which the checks were being scheduled too closely together, and Nagios would get behind. One interesting thing to note is that the graph is almost linear -- a trend that we have also seen with Fremont. Additionally, from the blown-up area of the graph we can see that on a VMWare virtual server configured as outlined in Figure 3, Nagios is capable of about 870 service checks before the latency exceeds one minute.

It is important to note that this graph and the other data presented in this article represent our own experience with the performance of Nagios and Fruity and may differ to some extent from yours. Particularly, keep in mind that your server's check threshold for Nagios will depend on how long your service checks take to run and how frequently you run them.

This data is most valuable as a model. It clearly shows that a relationship exists between Nagios's performance and the number of virtual processors on which it is running. It also gives us an idea of what load Nagios and Fruity can handle on various platforms and how Nagios's latency will increase as load is added. This data should help you answer the questions posed at the beginning of this article and better plan a future for a Nagios-Fruity system in your own network.

Jonathan Krein is a Systems Management Engineer for the Office of Information Technology at Brigham Young University. He is currently working on his Bachelor of Science degree in Computer Science at BYU and can be reached at: JonathanKrein@byu.edu.