A Pool Checking Plugin For Nagios

Nagios is a popular network monitoring tool that works by running various plugin programs.  I'm not going to talk about how to configure Nagios, so please refer to the website for its documentation.  The Nagios/XenAPI plugin is a nice simple example of XenAPI programming in Python, which may actually be useful to anyone administering a network containing XenServer hosts.

Definition of problem

A Nagios plugin should be a program which can be run from the command line to check on the status of a 'service'. It communicates through its return value, which should be 0 if the service is OK, 1 for a warning, 2 if the service is broken, and 3 for unknown or unexpected errors. It may also print a single line of text to standard output.

We will say that a resource pool of XenServer hosts is OK if the pool master reports that all slaves are live. If any are dead then we should signal a problem.

First try

Let's ignore the problem of passing in the server address, user name and password for the moment. The core of the program is: (fire up the python interpreter)

hostname, username, password = "ivory", "root", "password"

# usual boilerplate login

import XenAPI
session=XenAPI.Session('https://'+hostname)
session.login_with_password(username, password)
sx=session.xenapi
 
# partition the hosts according to whether they're alive or not
hosts=sx.host.get_all()
hosts_with_status=[(sx.host.get_name_label(x),sx.host_metrics.get_live( sx.host.get_metrics(x) )) for x in hosts]
live_hosts=[name for (name,status) in hosts_with_status if (status==True)]
dead_hosts=[name for (name,status) in hosts_with_status if not (status==True)]

 
#our one line of output
print "live hosts", live_hosts, "dead hosts", dead_hosts
#retcode is the value we should return to the system
retcode = 2 if (len(dead_hosts)<>0) else 0

A problem:

Occasionally, the pool master can change. (You can force this to happen from the server command line with xe pool-designate-new-master). If you run the above code and ivory is not the master, then you'll get this error:

XenAPI.Failure: ['HOST_IS_SLAVE', '10.80.224.105']

So a better way of logging in is to catch this exception, and then use the address which is returned (which is the real master) to login instead.

hostname, username, password = "ivory", "root", "password"
 
import XenAPI
 
try:
    session=XenAPI.Session('https://'+hostname)
    session.login_with_password(username, password)
except XenAPI.Failure, e:
    if e.details[0]=='HOST_IS_SLAVE':
        session=XenAPI.Session('https://'+e.details[1])
        session.login_with_password(username, password)
    else:
        raise

sx=session.xenapi

This should automatically redirect the call to whatever host has taken over as master. However this is probably worth a warning to nagios, since the administrator will probably want to fix whatever the problem is and put everything back to normal.

Putting it all together:

Now we need to take the first program, add the new login method with the possible redirect, and wrap it all in some code to accept command line arguments and return values to the system (nagios). This is my version:

usr/bin/python

#This is an example plugin for the popular network monitoring program nagios.

#Check if all the hosts in a pool are live.
#If we log in to a slave by mistake (the master can sometimes change)
#then redirect the request to the real master

#example command line: ./check_pool.py -H ivory -p password -l root

#So: return codes
# 0 : everything is ok
# 1 : named host is slave, but all hosts in pool are up
# 2 : some of the hosts in the pool are down
# 3 : unexpected error

#entire program wrapped in try/except so that we can send exit code 3 to nagios on any error
try:

    import XenAPI
    import sys

    from optparse import OptionParser

    #Parse command line options
    #Python's standard option parser won't do what I want, so I'm subclassing it.
    #firstly, nagios wants exit code three if the options are bad
    #secondly, we want 'required options', which the option parser thinks is an oxymoron.
    #I on the other hand don't want to give defaults for the host and password, because nagios is difficult to set up correctly,
    #and the effect of that may be to hide a problem.
    class MyOptionParser(OptionParser):
        def error(self,msg):
            print msg
            sys.exit(3)
        #stolen from python library reference, add required option check
        def check_required(self, opt):
            option=self.get_option(opt)
            if getattr(self.values, option.dest) is None:
                self.error("%s option not supplied" % option)

    parser = MyOptionParser(description="Nagios plugin to check whether all hosts in a pool are live")

    parser.add_option("-H", "--hostname", dest="hostname", help="name of pool master")
    parser.add_option("-l", "--login-name", default="root", dest="username", help="name to log in as (usually root)")
    parser.add_option("-p", "--password", dest="password", help="password")

    (options, args) = parser.parse_args()

    #abort if host and password weren't specified explicitly on the command line
    parser.check_required("-H")
    parser.check_required("-p")


    #get a session. set host_is_slave true if we need to redirect to a new master
    host_is_slave=False
    try:
        session=XenAPI.Session('https://'+options.hostname)
        session.login_with_password(options.username, options.password)
    except XenAPI.Failure, e:
        if e.details[0]=='HOST_IS_SLAVE':
            session=XenAPI.Session('https://'+e.details[1])
            session.login_with_password(options.username, options.password)
            host_is_slave=True
        else:
            raise
    sx=session.xenapi

    #work out which hosts in the pool are alive, and which dead
    hosts=sx.host.get_all()
    hosts_with_status=[(sx.host.get_name_label(x),sx.host_metrics.get_live( sx.host.get_metrics(x) )) for x in hosts]

    live_hosts=[name for (name,status) in hosts_with_status if (status==True)]
    dead_hosts=[name for (name,status) in hosts_with_status if not (status==True)]


    #log out
    session.logout()

    #nagios wants a single line of output
    print "live hosts", live_hosts, "dead hosts", dead_hosts,
    if host_is_slave:
        print "(%s is not the master)" % options.hostname,
    print

    #and an exit code
    if (len(dead_hosts)<>0):
        exitcode=2
    elif host_is_slave:
        exitcode=1
    else:
        exitcode=0

except Exception, e:
    print "Unexpected Exception [", e.__repr__(), "]"
    sys.exit(3) #Nagios wants error 3 if anything weird happens

sys.exit(exitcode)

An example unix command line would be:

$ ./nagios_plugin.py -H ivory -p password ; echo $?

which will print the return code as well.

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.