Fixing broken code, the bane and blessing of Open Source projects

I work with monitoring software day in and day out.  As with everything else there are advantages with closed systems and advantages to open systems.  In this case the supposedly matured open system has obvious glaring problems that really piss me off.

Recently I began evaluating alternative solutions to the closed system we employ in the office.  I’ve been a Unix guy for literally decades but we live in a Windows world.  The systems we use must have a MS SQL backend to them which automatically excludes pretty much every open source project in the world that would even be worth considering.  Add to this their distrust of anything UDP-based and previous experience with absolutely hideous Unix-based open source programs in the past I can almost understand their shyness.   Horrid packages like NMIS, a gigantic monolithic Perl script that’s as awkward to use as any interface I’ve had the displeasure to use, and Big Brother, the admittedly better but still mind-alteringly annoying system monitoring package, have invaded their little Redmond-esque fiefdom.  They probably also made the mistake of trusting averages of averages of averages for billing, courtesy of MRTG, Cacti and other RRDTool based systems.  Of course you can reconfigure any of these to average across an arbitrary period rather than the default 1 hour boundary but reconfiguring these tools is not what they desire regardless of cost.  I can respect that because I also found that incredibly annoying.  So much so that I wrote a program for a previous employer that kept all the data in an unaggregated binary flat file.  Sadly that code is long gone but I’m confident that I could write a better, faster, and tighter bit of kit now compared to 15 years ago.

In my digging around the internet for alternatives to expensive, closed systems I’ve run across a few that piqued my interest.  One of which is the Nagios-based package Opsview.  Nagios has been around for quite a long time so I suspected that this would be a relative cake-walk.   Something I could throw down on any random VM on my home ESXi server and have running in no time flat, just like I did with the fore mentioned Cacti.  Sadly this was not the case.  After going through the exercise of actually getting Opsview running and figuring out the relatively clunky interface I finally settled down to add some hosts.  That’s where the drama really starts.

The installation worked pretty much as described, thankfully.   Adding devices to the system is less than intuitive.  At least with Zenoss you can usually figure out that the big ‘+’ means add a new thing.  Google came to the rescue for Opsview.  Getting SNMP-based metrics should be a fundamental feature.  Apparently the folks that wrote these Nagios modules didn’t get the memo.  A very simple script like check_snmp_perfstats should absolutely work out of the box.  Sadly it didn’t.  There is a call to return_snmpcommunity() that failed to retrieve the necessary information and simply and silently returned the default string of ‘public’.  Anyone that’s taken IT Security 101 or been in the service industry for more than a week should know to change the default community string for any SNMP enabled device.  I circumvented that function and used the community string passed by the -C option built in to the management service.  A similar problem existed with check_snmp_noprocesses with a similar solution.  It seems the remaining SNMP scripts probably work since they don’t call return_snmpcommunity() but I decided to avoid frustration and didn’t use them.

Finally that brings me to check_snmp_uptime where I will probably start a religious war.  It uses (used) iso::mgmt.sysUptime which returns the time since the SNMP agent was restarted.   That value is probably accurate enough for most people that just reboot their servers.  My servers are typically running for years at a time unless there is a kernel vulnerability that mandates a reboot.  Daemons may be restarted but I don’t often see a need to reboot my machines without provocation.  This isn’t Windows-land where the three-finger-salute solves 99% of your problems and uptime is usually less than a month.  I replaced that OID with a much more meaningful metric that returns the actual uptime of the server rather than the uptime of snmpd.  In this regard HOST-RESOURCE-MIB::hrSystemUptime is much more meaningful.

So, while I was optimistic that yet another open source product turned commercial was going to be my savior in the real world I was heartbroken.  My hope, sprung eternal, that yet another nail in the coffin of Windows-based closed-source commercial ware would be hammered home instead smashed into tiny bits of mostly fluff.  Nagios itself is well known and apparently loved by large parts of the community but I find it bloated, slow and a huge resource hog.  My Zenoss machine has all of 768MB of RAM on it and runs like a top.  The same exact endpoints being monitored by Opsview required twice as much CPU and 6 times as much RAM.  A sad, sad day that a Linux-based solution would require the resources of a Windows-based solution to monitor *ONLY* a half dozen endpoints.  I suspect that large portions of the code are terribly inefficient.  How else can you explain that when I wrote a Perl the could measure several thousand interfaces in less than half a minute on a dual proc 233MHz P-II.  A sad, sad day indeed.