We used Orion at my previous job for monitoring but we had an issue with one of the main sites in production.
The site in question was a buggy web app that every so often would end up only displaying an asp error. Since the server was still up and responsive and still serving up http, our monitoring software wouldn't send an alert unless IIS or the server was actually down. Without an alert this resulted in prolonged downtime until we were notified the app wasn't working.
I worked with the applications admin to write a python script to, at first, notify us when the site was displaying the asp error but still up, and later to automate restarting IIS on the server.
The variables for the server, site, email etc are set manually at the beginning.
# Name of Webserver running the site server = 'server.local.domain' # URL of Page to Monitor site = 'http://site.com/default.asp' # Address Alerts are sent from sender = 'firstname.lastname@example.org' # Mail Exceptions to exception_recipient = 'email@example.com' # Alert recipient recipient = 'firstname.lastname@example.org' # SMTP Server to send mail through smtp_serv = 'mail.local.domain' # Log Location, use forward slashes instead of backslashes log_file = 'c:/sitechecker/log.txt' # Line to search for on site to verify the correct page is being displayed # if you aren't sure, visit the page in a browser, open source and copy some text web_string = '<header>Website</header>'
The main loop begins by resetting some variables and waiting for a minute and then using the check_site() function, if there isn't a pre existing session (one of the problems with the site was that it had to be set to leave sessions open for an hour before closing them, and any request would open up a new session), creating a new session and requesting the main page and verifying that the content is correct and not the ASP error.
If not successful the script will waits for a few seconds and tries again 3 times just in case the server is slow to respond.
def check_site(): try: r = s.get(site) except requests.exceptions.RequestException: return False return r and web_string in r.content
while True: pingserver = '' stopresult = '' startresult = '' resolved = '' command_except = '' time.sleep(60) for _ in range(3): if check_site(): break time.sleep(10)
If the request failed three times then script would log the event and try to ping the server and determine whether the issue was with the server itself or just the web app. If the server did respond to the pings, then the script would attempt to restart IIS.
else: log_event('Site failed to return correct page 3 times') if not ping(server): pingserver = 'The server is not responding to pings. No further action will be taken.' email_alert(pingserver, stopresult, startresult, resolved) log_event(pingserver) time.sleep(1800) continue pingserver = 'The server is responding to pings.\nAttempting to stop and restart IIS...' log_event('The server is responding to pings') stop_command, start_command, command_except = restart_iis()
To restart IIS I used subprocess to send the iisreset command to the web server. Then set variables depending IIS was able to restart successfully or not.
def restart_iis(): try: result = subprocess.check_output('iisreset ' + server, bufsize = -1) except: email_error('Restarting IIS') log_event('Exception Restarting IIS') stop = False start = False command_except = True return stop, start, command_except if result.find('Internet services successfully stopped') > -1: stop = True else: stop = False command_except = False if result.find('Internet services successfully restarted') > -1: start = True command_except = False else: start = False command_except = False return stop, start, command_except
The script would then set variables used in the notification to let us know what happened and the result of the automated actions.
elif stop_command and start_command: # Restart succesful stopresult = 'IIS has successfully stopped' startresult = 'IIS has successfully started' log_event(stopresult) log_event(startresult) elif stop_command is True and start_command is False: # Stopped and didn't come back up stopresult = 'IIS has successfully stopped' startresult = 'IIS was unable to start and is down' log_event(stopresult) log_event(startresult) elif stop_command is False: # Failed to stop stopresult = 'IIS was unable to stop.' log_event(stopresult)
Then finally the script would check again to see if the site was back up and functioning properly and update another variable for the notification email regarding the status and send the notification message.
If the site was still down the script would wait for 30 mins before checking again to prevent a flood of emails. Otherwise, it went back into the main loop.
# See if the site is back online if not check_site(): resolved = 'The site is still down. No further action will be taken.' log_event(resolved) email_alert(pingserver, stopresult, startresult, resolved) time.sleep(1800) else: resolved = 'The site is back up.' log_event(resolved) email_alert(pingserver, stopresult, startresult, resolved)
Looking back the email function could be cleaned up but worked well enough for taking the variables and putting together a text only alert email.
The email would include whether or not the server was responding to pings, and if IIS was successfully restarted or not and whether the site was back up or still down.
def email_alert(pingresult, stopresult, startresult, resolved): message = """From: %s To: %s Subject: Site Checker Alert %s IIS has failed 3 times to properly display the correct webpage. %s %s %s %s """ % (sender, recipient, server, pingserver, stopresult, startresult, resolved) send_email(sender, recipient, message, smtp_serv)
Once in place the script was able to reduce the site downtime to a little over a minute from typically between 30-45 mins depending on how soon a user was able to notify us.