Skip to contents

Modeling by Adrian Lison
ETH Zurich, Department of Biosystems Science and Engineering, Zurich, Switzerland

This document details the generative model used by EpiSewer to estimate reproduction numbers from wastewater concentration measurements. The model is structured into 5 modules, namely infections, shedding, sewage, sampling, and measurements.

Note that below, the full model with all options is specified, while EpiSewer excludes certain parts of the model by default (e.g. no multi-day composite samples) - these have to be explicitly selected by the user to be included.

Infections module

Reproduction number

Variable: RtR_t

The model employs a link function to ensure that Rt>0R_t>0 regardless of the smoothing approach used. An obvious choice for this is the log link, which however has the disadvantage that it implies an asymmetric prior on changes in RtR_t (for example, a change from Rt=2R_t=2 to Rt=1R_t=1 would be much more likely than a change from Rt=1R_t=1 to Rt=2R_t=2). EpiSewer therefore provides alternative link functions which are configured to behave approximately like the identity function around Rt=1R_t=1 but ensure that Rt>0R_t>0.

Option 1: Inverse softplus

link1(x)=softplusk(x)=log(1+ekx)k\begin{equation} \text{link}^{-1}(x) = \text{softplus}_k(x) = \frac{\log(1+e^{kx})}{k} \end{equation}

with sharpness parameter kk.

By default, we choose k=4k=4, which makes the softplus function behave effectively like f(x)=xf(x)=x for practically relevant values of RtR_t while ensuring that Rt>0R_t>0.

Option 2: Scaled logit

link1(x)=logisticc,a,k(x)=c1+aekx\begin{equation} \text{link}^{-1}(x) = \text{logistic}_{c, a, k}(x) = \frac{c}{1 + a \, e^{-k \, x}} \end{equation}

with hyperparameters cc, aa, and kk.

Here, cc defines the maximum possible RtR_t (default Rmax=6R_\max=6). This link function is inspired from the epidemia package1, but has two further hyperparameters aa and kk, which are chosen such that the logistic function equals 1 at x=1x=1 and also has a derivative of 1. Thus, the link function behaves roughly like the identity f(x)=xf(x)=x around Rt=1R_t=1, which makes the definition of adequate smoothing priors for RtR_t easier.

Smoothing

RtR_t is modeled via a time series or smoothing spline prior which ensure smoothness through temporal autocorrelation of the RtR_t estimates.

Option 1: Random walk

link(Rt)|Rt1N(link(Rt1),σlink(R)2)\begin{equation} \text{link}(R_t)\,|\,R_{t-1} \sim N(\text{link}(R_{t-1}),{\sigma_{\text{link}(R)}}^2) \end{equation}

with a normal prior on the intercept R1R_1 and a zero-truncated normal prior on the standard deviation σlink(R)\sigma_{\text{link}(R)} of the random walk.

Option 2: Exponential smoothing

This option uses an innovations state space model with additive errors to implement an exponential smoothing prior (Holt’s linear trend method with dampening)2. For this, RtR_t is modeled recursively as

link(Rt)=lt1+ϕbt1+ϵtlt=lt1+ϕbt1+αϵtbt=ϕbt1+βϵtϵtN(0,σlink(R)2),\begin{align*} \text{link}(R_t) &= l_{t-1} + \phi' \, b_{t-1} + \epsilon_t \\ l_t &= l_{t-1} + \phi' \, b_{t-1} + \alpha' \, \epsilon_t \\ b_t &= \phi' \, b_{t-1} + \beta' \, \epsilon_t \\ \epsilon_t &\sim N(0,{\sigma_{\text{link}(R)}}^2), \end{align*}

where ltl_t is the level, btb_t the trend, and ϵt\epsilon_t the additive error at time tt, respectively.

Moreover, α\alpha' is a level smoothing parameter, β\beta' a trend smoothing parameter, and ϕ\phi' a dampening parameter. We use normal priors for the intercepts of the level l1l_1 and trend b1b_1, and a zero-truncated normal prior for the variance σlink(R)2{\sigma_{\text{link}(R)}}^2 of the innovations, allowing these to be estimated from the data. We use Beta priors for the smoothing parameters α\alpha' and β\beta'.

In theory, the dampening parameter ϕ\phi' can also be estimated from the data, but usually has to be fixed due to identifiability issues. The default is ϕ=0.9\phi'=0.9, i.e. the strength of the trend is assumed to halve approximately every week of predicting into the future.

Option 3: Penalized B-splines

link(Rt)=𝐁[t,]a\begin{equation} \text{link}(R_t) = \boldsymbol{B}_{[t,\,]} \, a \end{equation}

where 𝐁[t,]\boldsymbol{B}_{[t,\,]} is a (by default cubic) B-spline basis evaluated at date index tt, and a=(a1,a2,,adf)a = (a_1, a_2, \dots, a_\text{df}) is a vector of spline coefficients, with df=number of knots+spline degree\text{df} = \text{number of knots} + \text{spline degree} being the degrees of freedom.

We here use penalized B-splines, which we implemented by regularizing the spline coefficients via a random walk3, i.e.

ai|ai1N(ai1,σa2δi)\begin{equation} a_{i}\,|\,a_{i-1} \sim N(a_{i-1}, \sigma_{a}^2 \, \delta_i) \end{equation}

where the random walk variance σa2\sigma_{a}^2 is scaled by δi\delta_i, the distance between knot ii and knot i1i-1. This allows to provide a (zero-truncated normal) prior on σa\sigma_{a} which is independent of the distance between knots. By enforcing smoothness of the spline coefficients, the risk of overfitting the splines is greatly reduced. For the intercept a1a_1, a normal prior reflecting the expectation of RtR_t at the beginning of the time series is used.

Knots are placed uniformly according to a user-specified knot distance, however the knot distance is heuristically reduced for dates close to the present (which are only partially informed by data) to avoid erroneous extrapolation.

Infections

EpiSewer uses a stochastic renewal model as described in earlier work45 to model both expected and realized infections.

Expected infections

Variable: ιt\iota_t

Seeding

We require a seeding phase at the start of the modeled time period, when the renewal model cannot be applied yet. Here we use a random walk on the log scale to model the expected number of initial infections.

log(ιt)|ιt1N(log(ιt1),σlog(ι)2)|tG\begin{equation} \log(\iota_t)|\iota_{t-1} \sim N(\log(\iota_{t-1}),{\sigma_{\log(\iota)}}^2) \,|\, t \leq G \end{equation}

with a normal prior on intercept log(ι1)\log(\iota_1) and a zero-truncated normal prior on the standard deviation σlog(ι)\sigma_{\log(\iota)} of the random walk.

Renewal model

ιt=E[It]=Rts=1GτsgenIts|t>G\begin{equation} \iota_t = E[I_t] = R_t \sum_{s=1}^G \tau^\text{gen}_s \, I_{t-s} \,|\, t > G \end{equation}

where ItsI_{t-s} is the number of realized infections that occurred ss days before tt, and τgen=(τ1gen,τ2gen,,τGgen)\tau^\text{gen} = (\tau^\text{gen}_1, \tau^\text{gen}_2, \dots, \tau^\text{gen}_G) the discrete generation time distribution with maximum generation time GG.

Realized infections

Variable: ItI_t

Option 1: Poisson

It|ιtPoisson(ιt)\begin{equation} I_t|\iota_t \sim \text{Poisson}(\iota_t) \end{equation}

with expected infections ιt\iota_t.

In the stan implementation, this is approximated as It|ιtN(ιt,ιt)I_t|\iota_t \sim N\left(\iota_t, \iota_t\right)

Option 2: Negative binomial

It|ιtNegative binomial(μ=ιt,σ2=ιt+ιt2ϕ)\begin{equation} I_t|\iota_t \sim \text{Negative binomial}(\mu = \iota_t, \sigma^2 = \iota_t + \frac{\iota_t^2}{\phi}) \end{equation}

with expected infections ιt\iota_t and inverse overdispersion parameter ϕ\phi . Following 6, we place a zero-truncated normal prior not on ϕ\phi but on ξ=1ϕ\xi = \frac{1}{\sqrt{\phi}}.

In the stan implementation, this is approximated as It|ιtN(ιt,ιt+ιt2ϕ)I_t|\iota_t \sim N\left(\iota_t, \iota_t + \frac{\iota_t^2}{\phi}\right)

Shedding module

Expected symptom onsets

Variable: λt\lambda_t

λt=s=0LItsτsinc\begin{equation} \lambda_t = \sum_{s=0}^L I_{t-s}\, \tau^\text{inc}_s \end{equation}

where τinc=(τ0inc,τ1inc,,τLinc)\tau^\text{inc} = (\tau^\text{inc}_0, \tau^\text{inc}_1, \dots, \tau^\text{inc}_L) is the incubation period distribution with maximum incubation period LL.

Total load shed in catchment

Variable: ωt\omega_t

Option 1: Without individual-level variation

ωt=s=0Sλtsμloadτsshed\begin{equation} \omega_t = \sum_{s=0}^S \lambda_{t-s}\, \mu^\text{load}\, \tau^\text{shed}_s \end{equation} where λt\lambda_{t} is the expected number of individuals with symptom onset on day tt, μload\mu^\text{load} is the average total shedding load per person, and τshed=(τ0shed,τ1shed,,τSshed)\tau^\text{shed} = (\tau^\text{shed}_0, \tau^\text{shed}_1, \dots, \tau^\text{shed}_S) is the distribution of shedding over time, with shedding up to SS days after symptom onset.

Option 2: With individual-level variation

ωt=s=0Sζtsμloadτsshed\begin{equation} \omega_t = \sum_{s=0}^S \zeta_{t-s}\, \mu^\text{load}\, \tau^\text{shed}_s \end{equation} where (μload)(\mu^\text{load}) is again the average total shedding load per person and (τshed=(τ0shed,τ1shed,,τSshed))(\tau^\text{shed} = (\tau^\text{shed}_0, \tau^\text{shed}_1, \dots, \tau^\text{shed}_S)) the distribution of shedding over time, with shedding up to SS days after symptom onset. This is now multiplied by ζt\zeta_t, which is defined as the sum of i.i.d. Gamma distributed individual relative shedding intensities with mean 11 and coefficient of variation νζ\nu_\zeta:

ζti=1ΛtGamma(mean=1,cv=νζ)=Gamma(α=Λtνζ2,β=1νζ2)\begin{equation} \zeta_t \sim \sum_{i=1}^{\Lambda_t} \, \text{Gamma}(\text{mean} = 1, \text{cv} = \nu_\zeta) = \text{Gamma}(\alpha = \frac{\Lambda_t}{\nu_\zeta^2}, \beta = \frac{1}{\nu_\zeta^2}) \end{equation}

where Λt\Lambda_t is the realized number of symptom onsets. That is, ζt\zeta_t describes the number of individuals with symptom onset on day tt, weighted by their shedding intensity, where an intensity <1<1 indicates below-average shedding, and an intensity >1>1 indicates above-average shedding. The coefficient of variation νζ\nu_\zeta describes how much the shedding intensity varies between individuals.

We here approximate Λt\Lambda_t with its expectation λt=E[Λt]\lambda_t = E[\Lambda_t], i.e.

ζtGamma(α=λtνζ2,β=1μloadνζ2),\begin{equation} \zeta_t \sim \text{Gamma}(\alpha = \frac{\lambda_t}{\nu_\zeta^2}, \beta = \frac{1}{\mu^\text{load} \nu_\zeta^2}), \end{equation}

and place a zero-truncated normal prior on νζ\nu_\zeta.

In the stan implementation, we use a normal approximation of the Gamma distribution for ζt\zeta_t to improve sampling efficiency. This approximation is highly accurate for sufficiently large λt\lambda_t, which will typically be the case.

Sewage module

Load at sampling site

Variable: πt\pi_t

πt=d=0Dωtdτdres\begin{equation} \pi_t = \sum_{d=0}^D \omega_{t-d}\, \tau^\text{res}_d \end{equation}

where (τres=(τ0res,τ1res,,τDres))(\tau^\text{res} = (\tau^\text{res}_0, \tau^\text{res}_1, \dots, \tau^\text{res}_D)) is the sewer residence time distribution of pathogen particles.

Concentration at sampling site

Variable: χt\chi_t

χt=πtflowt\begin{equation} \chi_t = \frac{\pi_t}{\text{flow}_t} \end{equation}

where flowt\text{flow}_t is a measure of the total wastewater volume upstream from the sampling site on day tt.

Sampling module

Concentration in single-day samples, with sample effects

Variable: κt\kappa_t

κt=χt+𝐗η\begin{equation} \kappa_t = \chi_t + \boldsymbol{X} \, \eta \end{equation}

where 𝐗\boldsymbol{X} is a T×KT \times K design matrix of KK different covariates describing the characteristics of samples on days 1 to TT, and η=(η1,η2,,ηK)\eta = (\eta_1, \eta_2, \dots, \eta_K) is a vector of coefficients which estimate the association of the sample covariates with the concentration. We place independent normal priors on η1,η2,,ηK\eta_1, \eta_2, \dots, \eta_K.

Note that if on some days no samples were taken, the corresponding row of the design matrix 𝐗\boldsymbol{X} can be filled with arbitrary values, since they will not contribute to the likelihood.

Concentration in multi-day composite samples

Variable: ρt\rho_t

ρt=1wi=0w1κti\begin{equation} \rho_t = \frac{1}{w} \sum_{i=0}^{w-1} \kappa_{t-i} \end{equation}

where ww is the composite window size, i.e. the number of consecutive single-day samples that are equivolumetrically mixed into a composite sample. Here, tt is the day of the latest sample.

Measurements module

Analysed concentration before replication stage

Variable: Ψt\Psi_t

We model Ψt\Psi_t as Log-Normal distributed with unit-scale mean ρt\rho_t (sample concentration) and unit-scale coefficient of variation νψ\nu_\psi, i.e.

ΨtLog-Normal(μΨ,σΨ2)\begin{equation} \Psi_t \sim \text{Log-Normal}\left(\mu_\Psi,\, \sigma_\Psi^2\right) \end{equation}

where μΨ=log(ρt)12σΨ2\mu_\Psi = \log(\rho_t) - \frac{1}{2}\sigma_\Psi^2 is the location and σΨ2=log(1+νΨ2)\sigma_\Psi^2 = \log(1 + \nu_\Psi^2) is the scale of the Log-Normal distribution. Here, νψ\nu_\psi measures the unexplained variation in concentrations before the replication stage (for example, if the replication stage is PCR quantification, then νψ\nu_\psi measures all variation before PCR). We place a zero-truncated normal prior on νψ\nu_\psi.

Limit of detection (LOD)

We implement a hurdle model for the limit of detection, where the probability for a zero measurement is described by an exponentially decreasing function of the underlying concentration, i.e. 

P(Υt,i=0|Ψt=ψt)=eψtcb\begin{equation} P(\Upsilon_{t,i} = 0 | \Psi_t = \psi_t) = e^{-\psi_t\, c\, b} \end{equation}

where Υt,i\Upsilon_{t,i} is the ithi^{\text{th}} replicate concentration measurement from the (composite) sample on day tt, and bb and cc are parameters of the PCR. The PCR parameters are either estimated by the coefficient of variation model (see below, this may require informative priors and/or replicate measurements), or the product cbc\, b can be back-calculated if the user provides an LOD value established from a lab experiment. The functional form of the LOD model is based on the statistical properties of digital PCR and corresponds to a binomial likelihood for zero successes. For non-zero measurements, we conversely have

P(Υt,i>0|Ψt=ψt)=1P(Υt=0|Ψt=ψt)\begin{equation} P(\Upsilon_{t,i} > 0 | \Psi_t = \psi_t) = 1 - P(\Upsilon_t = 0 | \Psi_t = \psi_t) \end{equation}

Measured concentration

Variable: Υt\Upsilon_t

Non-zero concentration measurements are modeled as truncated normal distributed, i.e.

Υt,i|Υt,i>0Normal+(μΥt,σΥt2)\begin{equation} \Upsilon_{t,i} \,|\, \Upsilon_{t,i} > 0 \, \sim \text{Normal}^+\left(\mu_{\Upsilon_{t}},\, \sigma_{\Upsilon_{t}}^2\right) \end{equation} where Υt,i\Upsilon_{t,i} is the ithi^{\text{th}} replicate concentration measurement from the (composite) sample on day tt. Here, μΥt=ψt\mu_{\Upsilon_{t}} = \psi_t is the mean and σΥt2=μΥtνΥ(μΥt)\sigma_{\Upsilon_{t}}^2 = \mu_{\Upsilon_{t}}\, \nu_\Upsilon(\mu_{\Upsilon_{t}}) the standard deviation of the truncated Normal distribution. Note that EpiSewer also offers a Log-Normal likelihood as an alternative, however based on the properties of dPCR assays, the truncated Normal should be the better approximation for the true distribution of measurements.

Specifically, νΥ(μ)\nu_\Upsilon(\mu) is the coefficient of variation for the replication stage and modeled as a function of the expected concentration μ\mu: νΥ(μ)=1bexp(μc)1μc1+a2bμc\begin{equation} \nu_\Upsilon(\mu) = \frac{1}{\sqrt{b}}\, \frac{\sqrt{exp(\mu\, c) – 1}}{\mu\, c}\, \sqrt{1 + a^2\, b\, \mu\, c} \end{equation} with aa, bb and cc as parameters to be estimated. The functional form is derived from the statistical properties of the binomial-distributed positive partition counts in the dPCR reaction, where

  • a=νprea=\nu_{\text{pre}} represents the coefficient of variation for noise in the concentration before the dPCR reaction,
  • bb represents the average number of partitions in the PCR reaction, and
  • cc represents the dPCR conversion factor, which is the partition volume scaled by the dilution of the wastewater in the PCR reaction. This factor converts the wastewater concentration into the expected number of gene copies per partition. Note that here the dilution includes all extraction and preparation steps. For example, if the partition volume is 4.5e-7 mL and the dilution of the wastewater is 100:3 (i.e. 100 gc/mL in the original wastewater sample correspond to 3 gc/mL in the PCR reaction), then the overall scaling factor is 4.5e-7 * 100 / 3 = 1.5e-5.

We place zero-truncated normal priors on aa, bb, and cc, respectively.

If replicates are modeled and the replication stage is PCR quantification, then νΥ\nu_\Upsilon measures the noise only from the PCR and νψ\nu_\psi all other noise. In this case, aa should typically be very small. In contrast, if no replicates are modeled, then νΥ\nu_\Upsilon measures the total variation and νψ\nu_\psi is not estimated. In this case, aa will be larger.

Estimation

EpiSewer is directly fitted to observed concentration measurements yt,iy_{t,i} (where tTt \leq T is the index of the sample date and ii the index of the replicate out of ntn_t replicates in total for date index tt).

Using Markov Chain Monte Carlo (MCMC), we draw samples from the joint posterior distribution of RtR_t and the other parameters

P(R{t|tT},|y{t|tT},{i|int})P(y{t|tT},{i|int}|R{t|tT},)P(),\begin{equation} P\left(R_{\{t | t \leq T\}}, \dots \;\middle|\; y_{\{t | t \leq T\},\{i | i \leq n_t\}}\right) \propto P\left(y_{\{t | t \leq T\},\{i | i \leq n_t\}} \;\middle|\; R_{\{t | t \leq T\}}, \dots\right) \, P\left(\dots \right), \end{equation}

where P(y{t|tT},{i|int}|R{t|tT},)P\left(y_{\{t | t \leq T\},\{i | i \leq n_t\}} \;\middle|\; R_{\{t | t \leq T\}}, \dots\right) is the likelihood as defined by the 5 modules of our generative model (infection, shedding, sewage, sampling, and measurement), and P()P\left(\dots \right) represents the priors on the non-hierarchical parameters of the model (e.g. intercepts and variances of random walks).

Note that many variables in the generative model (including RtR_t!) are in fact only transformations of other variables and therefore do not count as model parameters in a strict sense.

Indexing of time

For ease of notation, we have here used a unified time index t{1,2,,T}t \in \{1, 2, \dots, T\} for all variables. In reality, to model Υt,i\Upsilon_{t,i} at t=1t=1, we need to model RtR_t and many other variables also at earlier time points t<1t<1 due to the various delays between infections and observations. In the model implementation, each variable therefore has its own time index, and indices are mapped to each other accordingly.