We used Orion at my previous job for monitoring but we had an issue with one of the main sites in production.

The site in question was a buggy web app that every so often would end up only displaying an asp error. Since the server was still up and responsive and still serving up http, our monitoring software wouldn't send an alert unless IIS or the server was actually down. Without an alert this resulted in prolonged downtime until we were notified the app wasn't working.

I worked with the applications admin to write a python script to, at first, notify us when the site was displaying the asp error but still up, and later to automate restarting IIS on the server.

Here is the link to the github page

The variables for the server, site, email etc are set manually at the beginning.


# Name of Webserver running the site
server = 'server.local.domain'

# URL of Page to Monitor
site = 'http://site.com/default.asp'

# Address Alerts are sent from
sender = 'alerts@local.domain'

# Mail Exceptions to
exception_recipient = 'admin@site.com'

# Alert recipient
recipient = 'group@site.com'

# SMTP Server to send mail through
smtp_serv = 'mail.local.domain'

# Log Location, use forward slashes instead of backslashes
log_file = 'c:/sitechecker/log.txt'

# Line to search for on site to verify the correct page is being displayed
# if you aren't sure, visit the page in a browser, open source and copy some text
web_string = '<header>Website</header>'

The main loop begins by resetting some variables and waiting for a minute and then using the check_site() function, if there isn't a pre existing session (one of the problems with the site was that it had to be set to leave sessions open for an hour before closing them, and any request would open up a new session), creating a new session and requesting the main page and verifying that the content is correct and not the ASP error.

If not successful the script will waits for a few seconds and tries again 3 times just in case the server is slow to respond.

def check_site():
   try:
       r = s.get(site)
   except requests.exceptions.RequestException:
       return False

   return r and web_string in r.content
while True:
    pingserver = ''
    stopresult = ''
    startresult = ''
    resolved = ''
    command_except = ''

    time.sleep(60)
    
    for _ in range(3):
        if check_site():
            break
            
        time.sleep(10)

If the request failed three times then script would log the event and try to ping the server and determine whether the issue was with the server itself or just the web app. If the server did respond to the pings, then the script would attempt to restart IIS.

    else:
        log_event('Site failed to return correct page 3 times')

        if not ping(server):
            pingserver = 'The server is not responding to pings. No further action will be taken.'
            email_alert(pingserver, stopresult, startresult, resolved)
            log_event(pingserver)
            time.sleep(1800)
            continue
            
        pingserver = 'The server is responding to pings.\nAttempting to stop and restart IIS...'
        log_event('The server is responding to pings')

        stop_command, start_command, command_except = restart_iis()

To restart IIS I used subprocess to send the iisreset command to the web server. Then set variables depending IIS was able to restart successfully or not.

def restart_iis():
    try:
        result = subprocess.check_output('iisreset ' + server, bufsize = -1)
    except:
        email_error('Restarting IIS')
        log_event('Exception Restarting IIS')

        stop = False
        start = False
        command_except = True

        return stop, start, command_except

    if result.find('Internet services successfully stopped') > -1:
        stop = True
    else:
        stop = False
        command_except = False

    if result.find('Internet services successfully restarted') > -1:
        start = True
        command_except = False
    else:
        start = False
        command_except = False
        
        return stop, start, command_except

The script would then set variables used in the notification to let us know what happened and the result of the automated actions.

    elif stop_command and start_command:
        # Restart succesful
        stopresult = 'IIS has successfully stopped'
        startresult = 'IIS has successfully started'
        log_event(stopresult)
        log_event(startresult)

    elif stop_command is True and start_command is False:
        # Stopped and didn't come back up
        stopresult = 'IIS has successfully stopped'
        startresult = 'IIS was unable to start and is down'
        log_event(stopresult)
        log_event(startresult)

    elif stop_command is False:
        # Failed to stop
        stopresult = 'IIS was unable to stop.'
        log_event(stopresult)

Then finally the script would check again to see if the site was back up and functioning properly and update another variable for the notification email regarding the status and send the notification message.

If the site was still down the script would wait for 30 mins before checking again to prevent a flood of emails. Otherwise, it went back into the main loop.

    # See if the site is back online
    if not check_site():
        resolved = 'The site is still down. No further action will be taken.'
        log_event(resolved)
        email_alert(pingserver, stopresult, startresult, resolved)
        time.sleep(1800)
    else:
        resolved = 'The site is back up.'
        log_event(resolved)
        email_alert(pingserver, stopresult, startresult, resolved)

Looking back the email function could be cleaned up but worked well enough for taking the variables and putting together a text only alert email.

The email would include whether or not the server was responding to pings, and if IIS was successfully restarted or not and whether the site was back up or still down.

def email_alert(pingresult, stopresult, startresult, resolved):
    message = """From: %s
To: %s
Subject: Site Checker Alert
%s IIS has failed 3 times to properly display the correct webpage.
%s
    %s
    %s
%s
""" % (sender, recipient, server, pingserver, stopresult, startresult, resolved)
    send_email(sender, recipient, message, smtp_serv)

Once in place the script was able to reduce the site downtime to a little over a minute from typically between 30-45 mins depending on how soon a user was able to notify us.