Programming, Tips & Tricks, ruby/rails

Detecting anomalous spikes in a line graph via sub-sampling and standard deviation

Woah, what a title…

What I want to talk about isn’t as complicated as it sounds. Basically, in collecting daily site specific metrics for the purposes of SEO, you sometimes get some bad data from your third-party sources.

These bad data points can skew your graphs and make it almost impossible to visually derive any useful information from them.

For example:

Indexed Pages - before

The graph above (click to expand) shows the number of pages for a 6 month date range. In early January the data spikes dramatically upwards. This is clearly an anomaly, as the day after, the data is back in its normal range. The trouble here is that this anomalous point skews the entire graph, making it impossible to derive any real insights from it. The above graph is rendered useless by that one piece of anomalous data.

Here is the same graph WITHOUT that one bad piece of data (click to expand)

indexed_after

It’s amazing how one bad data point can skew your entire graph, isn’t it? Now we have an image that is useful!

Here’s some (hackish & sketchy) code to identify these statistical anomalies and replace them:

some posts that may be related

2 Comments

speak up

You can skip to the end and leave a response. Pinging is currently not allowed.

*Required Fields