Site Downtime Post-Mortem

This morning we had an unexpected outage of Analytics SEO. By the time I’d received the “system down” text message, done a quick ping and ssh test to the main web server, and got on the phone to our hosting provider, they had already got on the case, having been alerted by their own internal monitoring systems.

Within eight minutes they had the server back up, and the site was running nicely again – good work Darren at TSO Hosting!


The next stage was working out what caused the outage so we can prevent it from happening again. This was made massively easier by our recent decision to invest in New Relic monitoring software. The first thing we checked was the overall health of the web server over the last week:

This clearly shows that our CPU, disk and network utilisation are all pretty healthy – the failure was caused by running out of memory. We can also see that in normal operation we have plenty of memory – typical utilisation is around 30-40%. Something changed around 08:00 on the 19th which caused that to change, and caused our memory usage to gradually climb throughout the day, run out of physical memory around 23:00 and start swapping, then finally fail completely around 10:00 today.

The next chart to check was to see what Apache was doing:

So we can now see that the server ran out of memory because the number of running Apache instances rose from the usual 20 or so up to over 100, exhasting the available physical server memory. But what caused the number of Apache processes to go up?

There are two likely explanations:

  1. we started receiving substantially more page requests
  2. it started to take us much longer than usual to respond to page requests

Google Analytics could give us a quick answer to the first question, but an increase in traffic in this pattern looks more like a denial of service attack than organic growth – we haven’t done any significant marketing activity in the last few days which could account for this. If this was the case, then the additional requests probably wouldn’t show up in Google Analytics.

Analysis of the raw Apache log file using AwStats showed that the request pattern hadn’t changed significantly, which led us to conclude that something must have happened to the response times. This led us to the next interesting chart:

This chart shows the same pattern in terms of memory usage – Apache remains consistently low for most of the week, then shows a steady climb up to failure. The memory usage of our other processes on that server remain pretty stable. The CPU consumer chart shows something really interesting though – the node process (node.js, which we use to generate our nice-looking Highcharts-based charts on the server for inclusion in downloadable PDF and PowerPoint reports) showed a massive increase in CPU utilisation, which never dropped back down.

This led us to our hypothesis about the cause of the failure:

  1. something went wrong in node.js which caused it to increase CPU utilisation and lock up
  2. when users requested reports, or we generated a scheduled report, the Apache process would request a chart from node.js
  3. node.js wouldn’t respond
  4. because we request these charts using curl with no timeout set, the Apache process would wait indefinitely for a response which never came
  5. as other users continued to use the site, and more reports were requested, the number of Apache processes continued to grow until the server failed

Once we had our hypothesis, the actions to prevent a recurrence are pretty obvious:

  1. put a timeout on the curl requests to node.js
  2. implement a nightly restart of node.js, and a nightly graceful restart of Apache
  3. set up an alert on memory usage so that we know when usage reaches a high level, not when it reaches a critical level!

By: Mark Bennett

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.