Modeling by Adrian Lison
ETH Zurich, Department of Biosystems
Science and Engineering, Zurich, Switzerland
This document details the generative model used by
EpiSewer
to estimate reproduction numbers from wastewater
concentration measurements. The model is structured into 5 modules,
namely infections, shedding, sewage, sampling, and measurements.
Note that below, the full model with all options is specified, while
EpiSewer
excludes certain parts of the model by default
(e.g. no multi-day composite samples) - these have to be explicitly
selected by the user to be included.
Infections module
Reproduction number
Variable:
Link function
The model employs a link function to ensure that
regardless of the smoothing approach used. An obvious choice for this is
the log link, which however has the disadvantage that it implies an
asymmetric prior on changes in
(for example, a change from
to
would be much more likely than a change from
to
).
EpiSewer
therefore provides alternative link functions
which are configured to behave approximately like the identity function
around
but ensure that
.
Option 1: Inverse softplus
with sharpness parameter .
By default, we choose , which makes the softplus function behave effectively like for practically relevant values of while ensuring that .
Option 2: Scaled logit
with hyperparameters , , and .
Here, defines the maximum possible (default ). This link function is inspired from the epidemia package1, but has two further hyperparameters and , which are chosen such that the logistic function equals 1 at and also has a derivative of 1. Thus, the link function behaves roughly like the identity around , which makes the definition of adequate smoothing priors for easier.
Smoothing
is modeled via a time series or smoothing spline prior which ensure smoothness through temporal autocorrelation of the estimates.
Option 1: Random walk
with a normal prior on the intercept and a zero-truncated normal prior on the standard deviation of the random walk.
Option 2: Exponential smoothing
This option uses an innovations state space model with additive errors to implement an exponential smoothing prior (Holt’s linear trend method with dampening)2. For this, is modeled recursively as
where is the level, the trend, and the additive error at time , respectively.
Moreover, is a level smoothing parameter, a trend smoothing parameter, and a dampening parameter. We use normal priors for the intercepts of the level and trend , and a zero-truncated normal prior for the variance of the innovations, allowing these to be estimated from the data. We use Beta priors for the smoothing parameters and .
In theory, the dampening parameter can also be estimated from the data, but usually has to be fixed due to identifiability issues. The default is , i.e. the strength of the trend is assumed to halve approximately every week of predicting into the future.
Option 3: Penalized B-splines
where is a (by default cubic) B-spline basis evaluated at date index , and is a vector of spline coefficients, with being the degrees of freedom.
We here use penalized B-splines, which we implemented by regularizing the spline coefficients via a random walk3, i.e.
where the random walk variance is scaled by , the distance between knot and knot . This allows to provide a (zero-truncated normal) prior on which is independent of the distance between knots. By enforcing smoothness of the spline coefficients, the risk of overfitting the splines is greatly reduced. For the intercept , a normal prior reflecting the expectation of at the beginning of the time series is used.
Knots are placed uniformly according to a user-specified knot distance, however the knot distance is heuristically reduced for dates close to the present (which are only partially informed by data) to avoid erroneous extrapolation.
Infections
EpiSewer
uses a stochastic renewal model as described in
earlier work45 to model both expected and realized
infections.
Expected infections
Variable:
Seeding
We require a seeding phase at the start of the modeled time period, when the renewal model cannot be applied yet. Here we use a random walk on the log scale to model the expected number of initial infections.
with a normal prior on intercept and a zero-truncated normal prior on the standard deviation of the random walk.
Realized infections
Variable:
Option 2: Negative binomial
with expected infections and inverse overdispersion parameter . Following 6, we place a zero-truncated normal prior not on but on .
In the stan implementation, this is approximated as
Shedding module
Expected symptom onsets
Variable:
where is the incubation period distribution with maximum incubation period .
Total load shed in catchment
Variable:
Option 1: Without individual-level variation
where is the expected number of individuals with symptom onset on day , is the average total shedding load per person, and is the distribution of shedding over time, with shedding up to days after symptom onset.
Option 2: With individual-level variation
where is again the average total shedding load per person and the distribution of shedding over time, with shedding up to days after symptom onset. This is now multiplied by , which is defined as the sum of i.i.d. Gamma distributed individual relative shedding intensities with mean and coefficient of variation :
where is the realized number of symptom onsets. That is, describes the number of individuals with symptom onset on day , weighted by their shedding intensity, where an intensity indicates below-average shedding, and an intensity indicates above-average shedding. The coefficient of variation describes how much the shedding intensity varies between individuals.
We here approximate with its expectation , i.e.
and place a zero-truncated normal prior on .
In the stan implementation, we use a normal approximation of the Gamma distribution for to improve sampling efficiency. This approximation is highly accurate for sufficiently large , which will typically be the case.
Sewage module
Sampling module
Concentration in single-day samples, with sample effects
Variable:
where is a design matrix of different covariates describing the characteristics of samples on days 1 to , and is a vector of coefficients which estimate the association of the sample covariates with the concentration. We place independent normal priors on .
Note that if on some days no samples were taken, the corresponding row of the design matrix can be filled with arbitrary values, since they will not contribute to the likelihood.
Measurements module
Analysed concentration before replication stage
Variable:
We model as Log-Normal distributed with unit-scale mean (sample concentration) and unit-scale coefficient of variation , i.e.
where is the location and is the scale of the Log-Normal distribution. Here, measures the unexplained variation in concentrations before the replication stage (for example, if the replication stage is PCR quantification, then measures all variation before PCR). We place a zero-truncated normal prior on .
Limit of detection (LOD)
We implement a hurdle model for the limit of detection, where the probability for a zero measurement is described by an exponentially decreasing function of the underlying concentration, i.e.
where is the replicate concentration measurement from the (composite) sample on day , and and are parameters of the PCR. The PCR parameters are either estimated by the coefficient of variation model (see below, this may require informative priors and/or replicate measurements), or the product can be back-calculated if the user provides an LOD value established from a lab experiment. The functional form of the LOD model is based on the statistical properties of digital PCR and corresponds to a binomial likelihood for zero successes. For non-zero measurements, we conversely have
Measured concentration
Variable:
Non-zero concentration measurements are modeled as truncated normal distributed, i.e.
where
is the
replicate concentration measurement from the (composite) sample on day
.
Here,
is the mean and
the standard deviation of the truncated Normal distribution. Note that
EpiSewer
also offers a Log-Normal likelihood as an
alternative, however based on the properties of dPCR assays, the
truncated Normal should be the better approximation for the true
distribution of measurements.
Specifically, is the coefficient of variation for the replication stage and modeled as a function of the expected concentration : with , and as parameters to be estimated. The functional form is derived from the statistical properties of the binomial-distributed positive partition counts in the dPCR reaction, where
- represents the coefficient of variation for noise in the concentration before the dPCR reaction,
- represents the average number of partitions in the PCR reaction, and
- represents the dPCR conversion factor, which is the partition volume scaled by the dilution of the wastewater in the PCR reaction. This factor converts the wastewater concentration into the expected number of gene copies per partition. Note that here the dilution includes all extraction and preparation steps. For example, if the partition volume is 4.5e-7 mL and the dilution of the wastewater is 100:3 (i.e. 100 gc/mL in the original wastewater sample correspond to 3 gc/mL in the PCR reaction), then the overall scaling factor is 4.5e-7 * 100 / 3 = 1.5e-5.
We place zero-truncated normal priors on , , and , respectively.
If replicates are modeled and the replication stage is PCR quantification, then measures the noise only from the PCR and all other noise. In this case, should typically be very small. In contrast, if no replicates are modeled, then measures the total variation and is not estimated. In this case, will be larger.
Estimation
EpiSewer
is directly fitted to observed concentration
measurements
(where
is the index of the sample date and
the index of the replicate out of
replicates in total for date index
).
Using Markov Chain Monte Carlo (MCMC), we draw samples from the joint posterior distribution of and the other parameters
where is the likelihood as defined by the 5 modules of our generative model (infection, shedding, sewage, sampling, and measurement), and represents the priors on the non-hierarchical parameters of the model (e.g. intercepts and variances of random walks).
Note that many variables in the generative model (including !) are in fact only transformations of other variables and therefore do not count as model parameters in a strict sense.
Indexing of time
For ease of notation, we have here used a unified time index for all variables. In reality, to model at , we need to model and many other variables also at earlier time points due to the various delays between infections and observations. In the model implementation, each variable therefore has its own time index, and indices are mapped to each other accordingly.