Mark outlier spikes in a measurement time series — mark_outlier_spikes

Uses a simple heuristic to detect positive outliers (i.e. unusually high spikes) in a time series. The approach is to compare a measurement with the median of measurements in a small centered window. If the measurements before and after are considerably lower than the current measurement, the median will also be much lower. This is used as a criterion to determine outliers.

Usage

mark_outlier_spikes_median(
  df,
  measurement_col,
  date_col = date,
  window = 5,
  threshold_factor = 5,
  mad_window = 14,
  mad_lower_quantile = 0.05
)

Arguments

df: A data.frame containing the time series of measurements.
measurement_col: The name of the column with the measurements. Use dplyr-style env-variables, not characters.
date_col: The name of the column with corresponding dates. Use dplyr-style env-variables, not characters.
window: The size of the centered window for computing the median.
threshold_factor: Beyond how many median absolute deviations from the median should a measurement be marked as outlier?
mad_window: The size of the right-aligned, one-day-lagged window for the mean absolute deviation. This should be longer than the window for the median.
mad_lower_quantile: At what quantile should the lower bound for the expected noise be? This is used to avoid false positives when concentration levels are very low.

Value

The provided data.frame, with an additional logical column is_outlier.

Details

To determine how much deviation from the median is significant, a moving median absolute deviation (as a more robust estimate than the standard deviation of how much noise to expect) in measurements is used. This seems to be more robust than just multiplying the median with a factor to determine the threshold. The moving MAD is lagged by one day such that the current value is not included. Moreover, note that because the window for the moving median is centered, the last window_size/2 dates have no spike detection.

The method also allows for multiple measurements per day (replicates), where each replicate is evaluated individually. However, this currently does not give more weight to days with more replicates, i.e. ignores potential differences in measurement uncertainty.