Tifany Yung, Groupon, Inc.
Renato Martins, Groupon, Inc.
Metrics provide insight into system health, performance, and stability and must be monitored throughout the software release process. They are needed to catch issues before they reach the latter and affect end users, as well as to detect issues that do slip into production.
However, manual analysis of hundreds of these metrics, usually by visual inspection of their graphs, is time-consuming, subjective, and error-prone. Some automation may be achieved by setting alerts that automatically trigger when any particular metric crosses some critical threshold, but it may take time for the metric to reach them, or the metric may just never reach a level that triggers it, thereby allowing moderately severe issues to slip through.
Therefore, it is still necessary to inspect metrics for anomalous behavior that occurs below the critical level. We describe an algorithm for automated detection of both individual anomalies and overall changes in metric behavior, which allows us to determine not only whether a metric’s behavior has changed after a deployment, but also which data points in the time series contributed to the change. A model of expected behavior is built and used to predict the test data values, and anomalous points are identified based on how far off the predictions are from the actual observed values. Then, the set of anomalies is used to determine whether the test data exhibits different behavior from the model data.
Human involvement is necessary only once alerts have been generated for the flagged metrics, and only to review them and make calls on the deployment based on them. We found that our algorithm was capable of identifying changes in our deployment metrics at least as well as manual monitoring of metrics graphs, in some cases also being able to identify more subtle changes in metric behavior that were missed during the manual inspection, and reduced the time spent on analyzing metrics.