Modeling by Adrian Lison
ETH Zurich, Department of Biosystems
Science and Engineering, Zurich, Switzerland
This document details the generative model used by
EpiSewer to estimate reproduction numbers from wastewater
concentration measurements. The model is structured into 5 modules,
namely infections, shedding, sewage, sampling, and measurements.
Note that below, the full model with all options is specified, while
EpiSewer excludes certain parts of the model by default
(e.g. no multi-day composite samples) - these have to be explicitly
selected by the user to be included.
Infections module
Reproduction number
Variable:
Link function
The model employs a link function to ensure that
regardless of the smoothing approach used. An obvious choice for this is
the log link, which however has the disadvantage that it implies an
asymmetric prior on changes in
(for example, a change from
to
would be much more likely than a change from
to
).
EpiSewer therefore provides alternative link functions
which are configured to behave approximately like the identity function
around
but ensure that
.
Option 1: Inverse softplus
with sharpness parameter .
By default, we choose , which makes the softplus function behave effectively like for practically relevant values of while ensuring that .
Option 2: Scaled logit
with hyperparameters , , and .
Here, defines the maximum possible (default ). This link function is inspired from the epidemia package1, but has two further hyperparameters and , which are chosen such that the logistic function equals 1 at and also has a derivative of 1. Thus, the link function behaves roughly like the identity around , which makes the definition of adequate smoothing priors for easier.
Smoothing
is modeled via a time series or smoothing spline prior which ensure smoothness through temporal autocorrelation of the estimates.
Option 1: Gaussian process (default)
This option models using Gaussian processes (GPs) with a Matérn kernel, i.e.
where is the long-term trend, is the short-term deviation from the trend, modeled each as independent Gaussian processes with Matérn kernel and smoothness parameter . Importantly, we choose priors for the kernel parameters such that the length scale and magnitude of the long-term trend are in expectation much larger than those of the short-term deviation, i.e. and . By decomposing Rt into long- and short-term components, we obtain desirable behavior in real-time estimation: absent a clear signal, the short-term deviation component will revert to zero, such that Rt will follow the long-term trend.
Option 2: Random walk
This option smoothes Rt using a simple random walk model, i.e.
with a normal prior on the intercept and a zero-truncated normal prior on the standard deviation of the random walk.
Option 3: Exponential smoothing
This option uses an innovations state space model with additive errors to implement an exponential smoothing prior (Holt’s linear trend method with damping)2. For this, is modeled recursively as
where is the level, the trend, and the additive error at time , respectively.
Moreover, is a level smoothing parameter, a trend smoothing parameter, and a damping parameter. We use normal priors for the intercepts of the level and trend , and a zero-truncated normal prior for the variance of the innovations, allowing these to be estimated from the data. We use Beta priors for the smoothing parameters and .
In theory, the damping parameter can also be estimated from the data, but usually has to be fixed due to identifiability issues. The default is , i.e. the strength of the trend is assumed to halve approximately every week of predicting into the future.
Option 4: Splines
where is a (by default cubic) B-spline basis evaluated at date index , and is a vector of spline coefficients, with being the degrees of freedom.
We here use penalized B-splines, implemented by regularizing the spline coefficients via a random walk3, i.e.
where the random walk variance is scaled by , the distance between knot and knot . This allows to provide a (zero-truncated normal) prior on which is independent of the distance between knots. By enforcing smoothness of the spline coefficients, the risk of overfitting the splines is greatly reduced. For the intercept , a normal prior reflecting the expectation of at the beginning of the time series is used.
Knots are placed uniformly according to a user-specified knot distance, however the knot distance is heuristically reduced for dates close to the present (which are only partially informed by data) to avoid erroneous extrapolation.
EpiSewer also supports multi-level spline smoothing by decomposing as
where is an intercept, a component for smaller, short-term variation, and a component for larger, long-term variation. The two components are then smoothed using penalized cubic splines as described above, i.e.
Option 5: Smooth derivative
This option enforces a smooth first derivative of by placing cubic splines on the first-order differences of ,
where the spline coefficients are given a sparse Normal-Exponential-Gamma prior for . This prior has a long tail that supports large changes in . The first and last spline coefficient are fixed to 0 to avoid erroneous extrapolation towards the boundaries. This also has the effect that trends in level off towards the present.
Option 6: Piecewise constant
This option models as a piecewise constant function. The changepoints of the constant pieces are estimated from the data using an approximate changepoint model, corresponding roughly to the following model:
where is the intercept, are the estimated changepoint times of different changepoints, and is the change in at changepoint , modeled via a sparse Normal-Exponential-Gamma prior for . This prior has a long tail that supports large changes in . Generally, this option is suitable when follows a step-like trajectory with occasional large jumps, e.g. due to strong interventions. It is less suitable for modeling gradual changes in over time.
Addition: Changepoint model for variability
Options 2-4 also support changes in the variability of over time. For example, during the height of an epidemic wave, countermeasures may lead to much faster changes in than observable at other times (baseline). This potential additional variability is modeled using a linear change point model for (or for when using splines). The change points are placed at regular intervals, and the additional variation at the changepoints is modeled as independently distributed and following an Exponential-Gamma (EG) distribution, which has a strong peak towards zero and a long tail.
Infections
EpiSewer uses a stochastic renewal model as described in
earlier work45 to model both expected and realized
infections.
Expected infections
Variable:
We model the expected number of infections via the renewal equation
where is the number of realized infections that occurred days before and is a discrete generation time distribution with maximum generation time .
Realized infections
Variable:
Option 2: Negative binomial
with expected infections and inverse overdispersion parameter . Following 6, we place a zero-truncated normal prior not on but on .
In the stan implementation, this is approximated as
Seeding
We require a seeding phase at the start of the modeled time period, which describes the time evolution of infections when the renewal model cannot be applied yet.
Option 1: Constant
This option assumes a constant number of expected infections during the seeding phase, which is modeled via a normal prior on the log scale:
Option 2: Random walk
This option models the expected number of infections during the seeding phase via random walk on the log scale:
with a normal prior on intercept and a zero-truncated normal prior on the standard deviation of the random walk.
Option 3: Growth rate model
This option models the expected number of infections during the seeding phase as an exponential growth process:
Here the exponential growth rate follows a backwards-in-time random walk with standard deviation and initial value . The value corresponds to the last growth rate of the seeding phase, which we set to the first estimated reproduction number at the start of the modeling phase. We achieve this using an approach from the EpiAware package7, i.e. by solving the renewal equation within the model. This means that the intercept of the seeding phase growth rate depends on the prior provided for the initial reproduction number.
Shedding module
Expected symptom onsets
Variable:
where is the incubation period distribution with maximum incubation period . Note that this model component is only relevant when the shedding load distribution is referenced by the date of symptom onset. When the shedding load distribution is referenced by the date of infection, this component can be subsumed in the shedding load model.
Shedding load distribution
Variable:
The vector represents the distribution of shedding over time, with shedding up to days after infection or after symptom onset (depending on the model specification). It can be either assumed (fixed) or estimated (requires a prior). When estimating the shedding load distribution, priors are placed on the mean and coefficient of variation of a positive continuous distribution (gamma or log-normal), which is transformed into a discrete distribution within the model. This approach is inspired by the modeling of epidemiological distributions in the EpiNow28 and epinowcast9 packages.
EpiSewer also supports modeling the shedding load distribution from multiple priors in order to account for estimates from different studies or sensitivity analyses. Using a set of prior pairs, we model and as the weighted mean of prior parameters and , with the weights drawn from a Dirichlet distribution. By default, we use a symmetric Dirichlet prior with concentration parameter , corresponding to a uniform distribution over the weight simplex (i.e. we interpolate between the provided priors). By choosing a smaller concentration parameter, a more isolated support of the individual priors provided can be achieved. It is also possible to apply differential weighting of results from different studies through the parameters of the Dirichlet prior.
Total load shed in catchment
Variable:
Option 1: Without individual-level variation
where
is the average total shedding load per person (see
vignette("load_per_case")),
the shedding load distribution (see above), and
is the expected number of individuals with symptom onset on day
.
Option 2: With individual-level variation
where is the average total shedding load per person and the shedding load distribution (see above). This is now multiplied by , which is defined as the sum of i.i.d. Gamma distributed individual relative shedding intensities with mean and coefficient of variation :
where is the realized number of symptom onsets. That is, describes the number of individuals with symptom onset on day , weighted by their shedding intensity, where an intensity indicates below-average shedding, and an intensity indicates above-average shedding. The coefficient of variation describes how much the shedding intensity varies between individuals.
We here approximate with its expectation , i.e.
and place a zero-truncated normal prior on .
In the stan implementation, we improve sampling efficiency by using a normal approximation of the Gamma distribution for when is large ().
Sewage module
Sampling module
Concentration in single-day samples, with sample effects and outliers
Variable:
where is a design matrix of different covariates describing the characteristics of samples on days 1 to , and is a vector of coefficients which estimate the association of the sample covariates with the concentration. We place independent normal priors on .
Moreover, represents positive outlier concentrations that violate the uniform mixing assumption. We place a heavily right-tailed Type II extreme value prior on . By default, parameters are chosen such that 99% of the probability mass is below a concentration that would be observable from one infection in the catchment.
Note that if on some days no samples were taken, the corresponding
row of the sample effects design matrix
can be filled with arbitrary values, since they will not contribute to
the likelihood. To estimate weekday effects, EpiSewer provides a
function vignette("sample_effects_estimate_weekday") that
automatically creates a design matrix for the day of the week.
Measurements module
Analysed concentration before replication
Variable:
We model as Log-Normal distributed with unit-scale mean (sample concentration) and unit-scale coefficient of variation , i.e.
where is the location and is the scale of the Log-Normal distribution. Here, measures the unexplained variation in concentrations before the replication stage (for example, if you provide technical PCR replicates, then measures all variation before the PCR). We place a zero-truncated normal prior on .
Note that pre-replication variation is only modeled when providing
replicate measurements to EpiSewer and specifying
replicates = TRUE in the noise_estimate_
component.
Measured concentration
Non-detects
EpiSewer implements a hurdle model to account for non-detects (i.e. zero measurements). This is closely related to the limit of detection (LOD), below which sample concentrations cannot be reliably detected (typically defined as less than 95% probability of detection).
Option 1: Assumed LOD
This option models the probability of non-detection as an exponentially decreasing function of the underlying concentration, i.e.
where is the replicate concentration measurement from the (composite) sample on day . The parameter is calculated from an LOD value provided by the user, typically established from a lab experiment. For example, if the LOD was defined as the lowest concentration for which the detection probability was still above 95%, then .
Given the probability of non-detection, the probability for a non-zero measurement is
Option 2: Non-detection in digital PCR (dPCR)
This option models the probability of non-detection based on the statistical properties of digital PCR assays, using a binomial likelihood for zero successes. This depends on several parameters of the assay, such as the number and volume of partitions - which can be either assumed or estimated. by EpiSewer. Please see this preprint for details.
Non-zero measurements
Variable:
represents the replicate concentration measurement from the (composite) sample on day . Non-zero concentration measurements are modeled using a positive continuous distribution (Gamma by default, but Log-Normal and Truncated Normal are also available). This distribution is parameterized by its mean and coefficient of variation.
For the mean, we assume unbiased measurements, such that in principle . Nevertheless, the mean of non-zero measurements must be adjusted for the conditioning on detection, i.e.
For the coefficient of variation , EpiSewer currently offers three options (see below). In all cases, describes all remaining unexplained variation in the measurements. This means that if replicates are provided and variation before the replication stage is already accounted for (see above), then describes only the variation between the replicates. Otherwise it will also include variation before the replication stage.
Option 1: Constant CV
This option assumes a constant coefficient of variation for all measurements, i.e.
where is a parameter to be estimated. We place a zero-truncated normal prior on .
Option 2: CV based on digital PCR (dPCR)
This option models the CV as a function of the expected concentration , and the functional relationship is derived from the statistical properties of digital PCR assays, i.e. based on the binomial-distributed positive partition counts in the dPCR reaction. As a result, the CV is approximately constant for high concentrations, but increases exponentially as concentrations become small (it also increases when concentrations become so large that the assay saturates, but this should be avoided in lab practice through appropriate dilution). The exact functional relationship depends on several parameters of the assay, such as the number and volume of partitions - which can be either assumed or estimated. by EpiSewer. Please see this preprint for details.
Estimation
EpiSewer is directly fitted to observed concentration
measurements
(where
is the index of the sample date and
the index of the replicate out of
replicates in total for date index
).
Using Markov Chain Monte Carlo (MCMC), we draw samples from the joint posterior distribution of and the other parameters
where is the likelihood as defined by the 5 modules of our generative model (infection, shedding, sewage, sampling, and measurement), and represents the priors on the non-hierarchical parameters of the model (e.g. intercepts and variances of random walks).
Note that many variables in the generative model (including !) are in fact only transformations of other variables and therefore do not count as model parameters in a strict sense.
Indexing of time
For ease of notation, we have here used a unified time index for all variables. In reality, to model at , we need to model and many other variables also at earlier time points due to the various delays between infections and observations. In the model implementation, each variable therefore has its own time index, and indices are mapped to each other accordingly.