| jul2000.tar |
Gettemp -- Built-In Temperature Monitoring for Sun Enterprise Servers
Alek Komarnitsky Do you know what the temperature of your server room is right now? Do you have some way of being alerted when it gets too warm? If you have a room temperature probe, how do you know when a fan fails in your Sun Enterprise Server and the internal temperature starts to climb? These questions can be easily addressed with gettemp, a free software solution. Sun Enterprise servers have built-in sensors (depending on the model) that can report on the ambient room or various internal temperatures. You can access this via the following command:
/usr/platform/'uname -i'/sbin/prtdiag -vThis command, however, returns quite a bit of hardware-specific information and has different formats, so it can be difficult to use as a quick and dirty temperature check. gettemp comprises a couple of Perl/CGI scripts that format the output (depending on the server model and operating system) and generates a Web page showing the current temperatures. Optional alerting capability is available via a Perl stub routine, and you can also generate a viewable time history (see Figure 1). This shows temperature data gathered every 10 minutes (polling interval is user definable) from various hosts in an easy to understand Web page. There are a number of clickable links:
Installing and Configuring gettemp Installing and configuring gettemp is quite simple. It is composed of the following programs:
How does gettemp work in real life? It works quite well and has been in operational use for more than 6 months at a large (1,000+ UNIX nodes) site with more than 50 Sun Enterprise Servers scattered in various buildings administered by a variety of sys admins. Previously, when there was an air conditioning problem, it would not be known unless someone walked into the computer room and discovered that it was like a sauna, or the users complained because a machine shut down. Internal fan failures, while rare, were not typically caught until the machine actually shut itself down. With gettemp, you can take a quick look at all Sun Enterprise Servers by simply clicking on the Web page. When a server does get warm (either due to room or internal temperature), an alert is automatically routed to the appropriate sys admin so that they are aware of the issue and can call facilities, put fans in the server room, or shut the server down gracefully, if needed. The time history feature has also been quite useful for identifying when problems occurred and removing ambiguity. For instance, a particular server room seemed warm if visited early in the morning. A review of the logs showed that the temperature typically started to climb after midnight, and then went back down starting at 6:00 A.M. A call to facilities yielded the response that there was nothing wrong with the air-conditioning. A follow-up fax of the gettemp time history logs resulted in an admission that some air conditioning was turned off at night that should not have been. Some sites already have room temperature monitoring systems. gettemp can not only complement this, but can also provide monitoring of the internal temperature of the server. Also, since the alerting capability is simply a Perl stub routine, it is easier to ensure that the right type of notification occurs. Finally, the price is right. I'm aware of one site where the facilities group is spending over $50,000 to install computer room temperature monitoring, but it's not clear how or if admins will get notified when there is an issue! Summary At my site, gettemp has easily been incorporated into our 24/7 monitoring and alerting system, so there's also an overall warm fuzzy feeling of confidence that gettemp is monitoring things and the right sys admin will be alerted before things completely melt-down. gettemp has been downloaded hundreds of times by the Internet community and the email feedback has been very positive. It also resulted in polishing of the code, so feel free to send suggestions for further improvement. Future gettemp Work Future development is a somewhat bounded problem in that there should not be much more to really do. Obviously, if the output of prtdiag changes with new models and operating systems, some simple parsing code will need to be added. The time history is simply a text dump of the daily data -- you could hook MRTG, Cricket, or any of the graphical tools into this, but I was trying to keep things simple. Furthermore, software always has a few bugs lurking, but I think most of these have been fixed. gettemp could easily be expanded for other platforms and operating systems if an equivalent prtdiag command is available to query for temperature data. A tarball of gettemp with code, documentation, and examples can be found on the Sys Admin Web site () or at: http://www.komar.org/komar/alek/ -> Misc. Tech Stuff -> gettemp. The author can be reached at alek@komar.org and he welcomes any suggested enhancements, bug reports, fixes, or comments in general.
About the Author
Alek Komarnitsky has spent the past 5+ years as Chief Technologist for a large IS consulting/outsourcing firm and helps manage a network of over 1,000 UNIX workstations (and other assorted stuff) scattered from coast-to-coast supporting (literally) rocket scientists. Previously, he was the Network/Systems Manager for two Boulder County software start-ups where he built the computing infrastructures from scratch. His educational background includes an Aero/Astro Engineering undergraduate degree from the University of Washington and an MBA from CU-Boulder. He can be reached at: alek@komar.org.
|