Skip to contents

Modeling by Adrian Lison
ETH Zurich, Department of Biosystems Science and Engineering, Zurich, Switzerland

This document details the generative model used by EpiSewer to estimate reproduction numbers from wastewater concentration measurements. The model is structured into 5 modules, namely infections, shedding, sewage, sampling, and measurements.

Note that below, the full model with all options is specified, while EpiSewer excludes certain parts of the model by default (e.g. no multi-day composite samples) - these have to be explicitly selected by the user to be included.

Infections module

Reproduction number

Variable: RtR_t

The model employs a link function to ensure that Rt>0R_t>0 regardless of the smoothing approach used. An obvious choice for this is the log link, which however has the disadvantage that it implies an asymmetric prior on changes in RtR_t (for example, a change from Rt=2R_t=2 to Rt=1R_t=1 would be much more likely than a change from Rt=1R_t=1 to Rt=2R_t=2). EpiSewer therefore provides alternative link functions which are configured to behave approximately like the identity function around Rt=1R_t=1 but ensure that Rt>0R_t>0.

Option 1: Inverse softplus

link1(x)=softplusk(x)=log(1+ekx)k\begin{equation} \text{link}^{-1}(x) = \text{softplus}_k(x) = \frac{\log(1+e^{kx})}{k} \end{equation}

with sharpness parameter kk.

By default, we choose k=4k=4, which makes the softplus function behave effectively like f(x)=xf(x)=x for practically relevant values of RtR_t while ensuring that Rt>0R_t>0.

Option 2: Scaled logit

link1(x)=logisticc,a,k(x)=c1+aekx\begin{equation} \text{link}^{-1}(x) = \text{logistic}_{c, a, k}(x) = \frac{c}{1 + a \, e^{-k \, x}} \end{equation}

with hyperparameters cc, aa, and kk.

Here, cc defines the maximum possible RtR_t (default Rmax=6R_\max=6). This link function is inspired from the epidemia package1, but has two further hyperparameters aa and kk, which are chosen such that the logistic function equals 1 at x=1x=1 and also has a derivative of 1. Thus, the link function behaves roughly like the identity f(x)=xf(x)=x around Rt=1R_t=1, which makes the definition of adequate smoothing priors for RtR_t easier.

Smoothing

RtR_t is modeled via a time series or smoothing spline prior which ensure smoothness through temporal autocorrelation of the RtR_t estimates.

Option 1: Gaussian process (default)

This option models RtR_t using Gaussian processes (GPs) with a Matérn kernel, i.e.

link(Rt)=αt+δt|α𝒢𝒫(1,kν(τ;α,σα2)),δ𝒢𝒫(0,kν(τ;δ,σδ2)),\begin{equation} \text{link}(R_t) = \alpha_t + \delta_t\quad |\quad \alpha \sim \mathcal{GP}\,\!\big(1,\, k_\nu(\tau;\,\ell_\alpha,\sigma^2_\alpha)\big),\; \delta \sim \mathcal{GP}\,\!\big(0,\, k_\nu(\tau;\,\ell_\delta,\sigma^2_\delta)\big), \end{equation}

where αt\alpha_t is the long-term trend, δt\delta_t is the short-term deviation from the trend, modeled each as independent Gaussian processes with Matérn kernel kνk_\nu and smoothness parameter ν=32\nu=\frac{3}{2}. Importantly, we choose priors for the kernel parameters such that the length scale α\ell_\alpha and magnitude σα\sigma_\alpha of the long-term trend are in expectation much larger than those of the short-term deviation, i.e. δ\ell_\delta and σδ\sigma_\delta. By decomposing Rt into long- and short-term components, we obtain desirable behavior in real-time estimation: absent a clear signal, the short-term deviation component will revert to zero, such that Rt will follow the long-term trend.

Option 2: Random walk

This option smoothes Rt using a simple random walk model, i.e.

link(Rt)|Rt1N(link(Rt1),σlink(R)2)\begin{equation} \text{link}(R_t)\,|\,R_{t-1} \sim N(\text{link}(R_{t-1}),{\sigma_{\text{link}(R)}}^2) \end{equation}

with a normal prior on the intercept R1R_1 and a zero-truncated normal prior on the standard deviation σlink(R)\sigma_{\text{link}(R)} of the random walk.

Option 3: Exponential smoothing

This option uses an innovations state space model with additive errors to implement an exponential smoothing prior (Holt’s linear trend method with damping)2. For this, RtR_t is modeled recursively as

link(Rt)=lt1+ϕbt1+ϵtlt=lt1+ϕbt1+αϵtbt=ϕbt1+βϵtϵtN(0,σlink(R)2),\begin{align*} \text{link}(R_t) &= l_{t-1} + \phi' \, b_{t-1} + \epsilon_t \\ l_t &= l_{t-1} + \phi' \, b_{t-1} + \alpha' \, \epsilon_t \\ b_t &= \phi' \, b_{t-1} + \beta' \, \epsilon_t \\ \epsilon_t &\sim N(0,{\sigma_{\text{link}(R)}}^2), \end{align*}

where ltl_t is the level, btb_t the trend, and ϵt\epsilon_t the additive error at time tt, respectively.

Moreover, α\alpha' is a level smoothing parameter, β\beta' a trend smoothing parameter, and ϕ\phi' a damping parameter. We use normal priors for the intercepts of the level l1l_1 and trend b1b_1, and a zero-truncated normal prior for the variance σlink(R)2{\sigma_{\text{link}(R)}}^2 of the innovations, allowing these to be estimated from the data. We use Beta priors for the smoothing parameters α\alpha' and β\beta'.

In theory, the damping parameter ϕ\phi' can also be estimated from the data, but usually has to be fixed due to identifiability issues. The default is ϕ=0.9\phi'=0.9, i.e. the strength of the trend is assumed to halve approximately every week of predicting into the future.

Option 4: Splines

link(Rt)=𝐁[t,]a\begin{equation} \text{link}(R_t) = \boldsymbol{B}_{[t,\,]} \, a \end{equation}

where 𝐁[t,]\boldsymbol{B}_{[t,\,]} is a (by default cubic) B-spline basis evaluated at date index tt, and a=(a1,a2,,adf)a = (a_1, a_2, \dots, a_\text{df}) is a vector of spline coefficients, with df=number of knots+spline degree\text{df} = \text{number of knots} + \text{spline degree} being the degrees of freedom.

We here use penalized B-splines, implemented by regularizing the spline coefficients via a random walk3, i.e.

ai|ai1N(ai1,σa2δi)\begin{equation} a_{i}\,|\,a_{i-1} \sim N(a_{i-1}, \sigma_{a}^2 \, \delta_i) \end{equation}

where the random walk variance σa2\sigma_{a}^2 is scaled by δi\delta_i, the distance between knot ii and knot i1i-1. This allows to provide a (zero-truncated normal) prior on σa\sigma_{a} which is independent of the distance between knots. By enforcing smoothness of the spline coefficients, the risk of overfitting the splines is greatly reduced. For the intercept a1a_1, a normal prior reflecting the expectation of RtR_t at the beginning of the time series is used.

Knots are placed uniformly according to a user-specified knot distance, however the knot distance is heuristically reduced for dates close to the present (which are only partially informed by data) to avoid erroneous extrapolation.

EpiSewer also supports multi-level spline smoothing by decomposing RtR_t as

Rt=link(R1+δtlocal+δtglobal),\begin{equation} R_t = \text{link}(R_1 + \delta_t^\text{local} + \delta_t^\text{global}), \end{equation}

where R1R_1 is an intercept, δtlocal\delta_t^\text{local} a component for smaller, short-term variation, and δtglobal\delta_t^\text{global} a component for larger, long-term variation. The two components are then smoothed using penalized cubic splines as described above, i.e.

δtϕ=𝐁[t,]ϕaϕ,ϕ{local,global}.\begin{equation} \delta_t^\phi = \boldsymbol{B}^\phi_{[t,\,]} \, a^\phi, \quad \phi \in \{\text{local},\text{global}\}. \end{equation}

Option 5: Smooth derivative

This option enforces a smooth first derivative of RtR_t by placing cubic splines on the first-order differences of RtR_t,

link(Rt)=link(Rt1)+δt,δt=𝐁[t,]a,\begin{equation} \text{link}(R_t) = \text{link}(R_{t-1}) + \delta_t,\quad \delta_t = \boldsymbol{B}_{[t,\,]} \, a, \end{equation}

where the spline coefficients a=(a0,a1,,an+2)a = (a_0, a_1, \dots, a_{n+2}) are given a sparse Normal-Exponential-Gamma prior aiNEG(μ=0,λ=λa,γ=γa)a_i \sim \text{NEG}(\mu = 0, \lambda = \lambda_a, \gamma = \gamma_a) for 1in+11 \leq i \leq n+1. This prior has a long tail that supports large changes in RtR_t. The first and last spline coefficient are fixed to 0 to avoid erroneous extrapolation towards the boundaries. This also has the effect that trends in RtR_t level off towards the present.

Option 6: Piecewise constant

This option models RtR_t as a piecewise constant function. The changepoints of the constant pieces are estimated from the data using an approximate changepoint model, corresponding roughly to the following model:

link(Rt)=link(R1+i=1Kai𝟙(tCi))\begin{equation} \text{link}(R_t) = \text{link}(R_1 + \sum_{i=1}^K a_i \, \mathbb{1}(t \geq C_i)) \end{equation}

where R1R_1 is the intercept, CiC_i are the estimated changepoint times of KK different changepoints, and aia_i is the change in RtR_t at changepoint ii, modeled via a sparse Normal-Exponential-Gamma prior aiNEG(μ=0,λ=λa,γ=γa)a_i \sim \text{NEG}(\mu = 0, \lambda = \lambda_a, \gamma = \gamma_a) for 1iK1 \leq i \leq K. This prior has a long tail that supports large changes in RtR_t. Generally, this option is suitable when RtR_t follows a step-like trajectory with occasional large jumps, e.g. due to strong interventions. It is less suitable for modeling gradual changes in RtR_t over time.

Addition: Changepoint model for variability

Options 2-4 also support changes in the variability of RtR_t over time. For example, during the height of an epidemic wave, countermeasures may lead to much faster changes in RtR_t than observable at other times (baseline). This potential additional variability is modeled using a linear change point model for σlink(R)2\sigma_{\text{link}(R)}^2 (or for σa\sigma_{a} when using splines). The change points are placed at regular intervals, and the additional variation at the changepoints is modeled as independently distributed and following an Exponential-Gamma (EG) distribution, which has a strong peak towards zero and a long tail.

Infections

EpiSewer uses a stochastic renewal model as described in earlier work45 to model both expected and realized infections.

Expected infections

Variable: ιt\iota_t

We model the expected number of infections ιt\iota_t via the renewal equation

ιt=E[It]=Rts=1GτsgenIts|t>G\begin{equation} \iota_t = E[I_t] = R_t \sum_{s=1}^G \tau^\text{gen}_s \, I_{t-s} \,|\, t > G \end{equation}

where ItsI_{t-s} is the number of realized infections that occurred ss days before tt and τgen\tau^\text{gen} is a discrete generation time distribution τgen=(τ1gen,τ2gen,,τGgen)\tau^\text{gen} = (\tau^\text{gen}_1, \tau^\text{gen}_2, \dots, \tau^\text{gen}_G) with maximum generation time GG.

Realized infections

Variable: ItI_t

Option 1: Poisson

It|ιtPoisson(ιt)\begin{equation} I_t|\iota_t \sim \text{Poisson}(\iota_t) \end{equation}

with expected infections ιt\iota_t.

In the stan implementation, this is approximated as It|ιtN(ιt,ιt)I_t|\iota_t \sim N\left(\iota_t, \iota_t\right)

Option 2: Negative binomial

It|ιtNegative binomial(μ=ιt,σ2=ιt+ιt2ϕ)\begin{equation} I_t|\iota_t \sim \text{Negative binomial}(\mu = \iota_t, \sigma^2 = \iota_t + \frac{\iota_t^2}{\phi}) \end{equation}

with expected infections ιt\iota_t and inverse overdispersion parameter ϕ\phi . Following 6, we place a zero-truncated normal prior not on ϕ\phi but on ξ=1ϕ\xi = \frac{1}{\sqrt{\phi}}.

In the stan implementation, this is approximated as It|ιtN(ιt,ιt+ιt2ϕ)I_t|\iota_t \sim N\left(\iota_t, \iota_t + \frac{\iota_t^2}{\phi}\right)

Seeding

We require a seeding phase at the start of the modeled time period, which describes the time evolution of infections when the renewal model cannot be applied yet.

Option 1: Constant

This option assumes a constant number of expected infections ιseed\iota_{seed} during the seeding phase, which is modeled via a normal prior on the log scale:

log(ιt)=log(ιseed)|tG,log(ιseed)N(μlog(ιseed),σlog(ιseed)2)\begin{equation} \log(\iota_t) = \log(\iota_{seed}) \,|\, t \leq G,\quad \log(\iota_\text{seed}) \sim N(\mu_{\log(\iota_\text{seed})},\sigma_{\log(\iota_\text{seed})}^2) \end{equation}

Option 2: Random walk

This option models the expected number of infections during the seeding phase via random walk on the log scale:

log(ιt)|ιt1N(log(ιt1),σlog(ι)2)|tG\begin{equation} \log(\iota_t)|\iota_{t-1} \sim N(\log(\iota_{t-1}),{\sigma_{\log(\iota)}}^2) \,|\, t \leq G \end{equation}

with a normal prior on intercept log(ι1)\log(\iota_1) and a zero-truncated normal prior on the standard deviation σlog(ι)\sigma_{\log(\iota)} of the random walk.

Option 3: Growth rate model

This option models the expected number of infections during the seeding phase as an exponential growth process:

log(ιt)=log(ιt1)+rt,rt|rt+1N(rt+1,σr2)|1<tG,\begin{equation} \log(\iota_t) = \log(\iota_{t-1}) + r_{t},\;\ r_{t} | r_{t+1} \sim N(r_{t+1},{\sigma_{r}}^2) \; \;|\; 1 < t \leq G, \end{equation}

Here the exponential growth rate rtr_t follows a backwards-in-time random walk with standard deviation σr\sigma_{r} and initial value rG+1r_{G+1}. The value rG+1r_{G+1} corresponds to the last growth rate of the seeding phase, which we set to the first estimated reproduction number RtR_t at the start of the modeling phase. We achieve this using an approach from the EpiAware package7, i.e. by solving the renewal equation RG+1(s=1GτsgenesrG+1)=1R_{G+1} (\sum_{s=1}^G \tau^\text{gen}_s e^{-s\,r_{G+1}}) = 1 within the model. This means that the intercept of the seeding phase growth rate depends on the prior provided for the initial reproduction number.

Shedding module

Expected symptom onsets

Variable: λt\lambda_t

λt=s=0LItsτsinc\begin{equation} \lambda_t = \sum_{s=0}^L I_{t-s}\, \tau^\text{inc}_s \end{equation}

where τinc=(τ0inc,τ1inc,,τLinc)\tau^\text{inc} = (\tau^\text{inc}_0, \tau^\text{inc}_1, \dots, \tau^\text{inc}_L) is the incubation period distribution with maximum incubation period LL. Note that this model component is only relevant when the shedding load distribution is referenced by the date of symptom onset. When the shedding load distribution is referenced by the date of infection, this component can be subsumed in the shedding load model.

Shedding load distribution

Variable: τshed\tau^\text{shed}

The vector τshed=(τ0shed,τ1shed,,τSshed)\tau^\text{shed} = (\tau^\text{shed}_0, \tau^\text{shed}_1, \dots, \tau^\text{shed}_S) represents the distribution of shedding over time, with shedding up to SS days after infection or after symptom onset (depending on the model specification). It can be either assumed (fixed) or estimated (requires a prior). When estimating the shedding load distribution, priors are placed on the mean μshed\mu^\text{shed} and coefficient of variation νshed\nu^\text{shed} of a positive continuous distribution (gamma or log-normal), which is transformed into a discrete distribution τshed\tau^\text{shed} within the model. This approach is inspired by the modeling of epidemiological distributions in the EpiNow28 and epinowcast9 packages.

EpiSewer also supports modeling the shedding load distribution from multiple priors in order to account for estimates from different studies or sensitivity analyses. Using a set of nn prior pairs, we model μshed\mu^\text{shed} and νshed\nu^\text{shed} as the weighted mean of prior parameters μshed=i=1n𝛉𝐢μished\mu^\text{shed} = \sum_{i=1}^{n} \boldsymbol{\theta_i}\, \mu^\text{shed}_i and νshed=i=1n𝛉𝐢νished\nu^\text{shed} = \sum_{i=1}^{n} \boldsymbol{\theta_i}\, \nu^\text{shed}_i, with the weights 𝛉\boldsymbol{\theta} drawn from a Dirichlet distribution. By default, we use a symmetric Dirichlet prior with concentration parameter a=1a = 1, corresponding to a uniform distribution over the weight simplex (i.e. we interpolate between the provided priors). By choosing a smaller concentration parameter, a more isolated support of the individual priors provided can be achieved. It is also possible to apply differential weighting of results from different studies through the parameters of the Dirichlet prior.

Total load shed in catchment

Variable: ωt\omega_t

Option 1: Without individual-level variation

ωt=s=0Sμloadτsshedλts\begin{equation} \omega_t = \sum_{s=0}^S \mu^\text{load}\, \tau^\text{shed}_s\, \lambda_{t-s} \end{equation} where μload\mu^\text{load} is the average total shedding load per person (see vignette("load_per_case")), τshed\tau^\text{shed} the shedding load distribution (see above), and λt\lambda_{t} is the expected number of individuals with symptom onset on day tt.

Option 2: With individual-level variation

ωt=s=0Sμloadτsshedζts\begin{equation} \omega_t = \sum_{s=0}^S \mu^\text{load}\, \tau^\text{shed}_s\, \zeta_{t-s} \end{equation} where (μload)(\mu^\text{load}) is the average total shedding load per person and τshed\tau^\text{shed} the shedding load distribution (see above). This is now multiplied by ζt\zeta_t, which is defined as the sum of i.i.d. Gamma distributed individual relative shedding intensities with mean 11 and coefficient of variation νζ\nu_\zeta:

ζti=1ΛtGamma(mean=1,cv=νζ)=Gamma(α=Λtνζ2,β=1νζ2)\begin{equation} \zeta_t \sim \sum_{i=1}^{\Lambda_t} \, \text{Gamma}(\text{mean} = 1, \text{cv} = \nu_\zeta) = \text{Gamma}(\alpha = \frac{\Lambda_t}{\nu_\zeta^2}, \beta = \frac{1}{\nu_\zeta^2}) \end{equation}

where Λt\Lambda_t is the realized number of symptom onsets. That is, ζt\zeta_t describes the number of individuals with symptom onset on day tt, weighted by their shedding intensity, where an intensity <1<1 indicates below-average shedding, and an intensity >1>1 indicates above-average shedding. The coefficient of variation νζ\nu_\zeta describes how much the shedding intensity varies between individuals.

We here approximate Λt\Lambda_t with its expectation λt=E[Λt]\lambda_t = E[\Lambda_t], i.e.

ζtGamma(α=λtνζ2,β=1μloadνζ2),\begin{equation} \zeta_t \sim \text{Gamma}(\alpha = \frac{\lambda_t}{\nu_\zeta^2}, \beta = \frac{1}{\mu^\text{load} \nu_\zeta^2}), \end{equation}

and place a zero-truncated normal prior on νζ\nu_\zeta.

In the stan implementation, we improve sampling efficiency by using a normal approximation of the Gamma distribution for ζt\zeta_t when λt\lambda_t is large (E[Λt]30×νζ2E[\Lambda_t] \approx 30 \times \nu_\zeta^2).

Sewage module

Load at sampling site

Variable: πt\pi_t

πt=d=0Dωtdτdres\begin{equation} \pi_t = \sum_{d=0}^D \omega_{t-d}\, \tau^\text{res}_d \end{equation}

where (τres=(τ0res,τ1res,,τDres))(\tau^\text{res} = (\tau^\text{res}_0, \tau^\text{res}_1, \dots, \tau^\text{res}_D)) is the sewer residence time distribution of pathogen particles.

Concentration at sampling site

Variable: χt\chi_t

χt=πtflowt\begin{equation} \chi_t = \frac{\pi_t}{\text{flow}_t} \end{equation}

where flowt\text{flow}_t is a measure of the total wastewater volume upstream from the sampling site on day tt.

Sampling module

Concentration in single-day samples, with sample effects and outliers

Variable: κt\kappa_t

κt=χt+𝐗η+ϵt\begin{equation} \kappa_t = \chi_t + \boldsymbol{X} \, \eta + \epsilon_t \end{equation}

where 𝐗\boldsymbol{X} is a T×KT \times K design matrix of KK different covariates describing the characteristics of samples on days 1 to TT, and η=(η1,η2,,ηK)\eta = (\eta_1, \eta_2, \dots, \eta_K) is a vector of coefficients which estimate the association of the sample covariates with the concentration. We place independent normal priors on η1,η2,,ηK\eta_1, \eta_2, \dots, \eta_K.

Moreover, ϵt>0\epsilon_t>0 represents positive outlier concentrations that violate the uniform mixing assumption. We place a heavily right-tailed Type II extreme value prior on ϵt\epsilon_t. By default, parameters are chosen such that 99% of the probability mass is below a concentration that would be observable from one infection in the catchment.

Note that if on some days no samples were taken, the corresponding row of the sample effects design matrix 𝐗\boldsymbol{X} can be filled with arbitrary values, since they will not contribute to the likelihood. To estimate weekday effects, EpiSewer provides a function vignette("sample_effects_estimate_weekday") that automatically creates a design matrix for the day of the week.

Concentration in multi-day composite samples

Variable: ρt\rho_t

ρt=1wi=0w1κti\begin{equation} \rho_t = \frac{1}{w} \sum_{i=0}^{w-1} \kappa_{t-i} \end{equation}

where ww is the composite window size, i.e. the number of consecutive single-day samples that are equivolumetrically mixed into a composite sample. Here, tt is the day of the latest sample.

Measurements module

Analysed concentration before replication

Variable: Ψt\Psi_t

We model Ψt\Psi_t as Log-Normal distributed with unit-scale mean ρt\rho_t (sample concentration) and unit-scale coefficient of variation νψ\nu_\psi, i.e.

ΨtLog-Normal(μΨ,σΨ2)\begin{equation} \Psi_t \sim \text{Log-Normal}\left(\mu_\Psi,\, \sigma_\Psi^2\right) \end{equation}

where μΨ=log(ρt)12σΨ2\mu_\Psi = \log(\rho_t) - \frac{1}{2}\sigma_\Psi^2 is the location and σΨ2=log(1+νΨ2)\sigma_\Psi^2 = \log(1 + \nu_\Psi^2) is the scale of the Log-Normal distribution. Here, νψ\nu_\psi measures the unexplained variation in concentrations before the replication stage (for example, if you provide technical PCR replicates, then νψ\nu_\psi measures all variation before the PCR). We place a zero-truncated normal prior on νψ\nu_\psi.

Note that pre-replication variation is only modeled when providing replicate measurements to EpiSewer and specifying replicates = TRUE in the noise_estimate_ component.

Measured concentration

Non-detects

EpiSewer implements a hurdle model to account for non-detects (i.e. zero measurements). This is closely related to the limit of detection (LOD), below which sample concentrations cannot be reliably detected (typically defined as less than 95% probability of detection).

Option 1: Assumed LOD

This option models the probability of non-detection as an exponentially decreasing function of the underlying concentration, i.e. 

P(Υt,i=0|Ψt=ψt)=eψtb,\begin{equation} P(\Upsilon_{t,i} = 0 | \Psi_t = \psi_t) = e^{-\psi_t\, b}, \end{equation}

where Υt,i\Upsilon_{t,i} is the ithi^{\text{th}} replicate concentration measurement from the (composite) sample on day tt. The parameter bb is calculated from an LOD value provided by the user, typically established from a lab experiment. For example, if the LOD was defined as the lowest concentration for which the detection probability was still above 95%, then b=log(10.95)/LODb = -\text{log}(1-0.95)/\text{LOD}.

Given the probability of non-detection, the probability for a non-zero measurement is

P(Υt,i>0|Ψt=ψt)=1P(Υt=0|Ψt=ψt).\begin{equation} P(\Upsilon_{t,i} > 0 | \Psi_t = \psi_t) = 1 - P(\Upsilon_t = 0 | \Psi_t = \psi_t). \end{equation}

Option 2: Non-detection in digital PCR (dPCR)

This option models the probability of non-detection based on the statistical properties of digital PCR assays, using a binomial likelihood for zero successes. This depends on several parameters of the assay, such as the number and volume of partitions - which can be either assumed or estimated. by EpiSewer. Please see this preprint for details.

Non-zero measurements

Variable: Υt,i\Upsilon_{t,i}

Υt,i\Upsilon_{t,i} represents the ithi^{\text{th}} replicate concentration measurement from the (composite) sample on day tt. Non-zero concentration measurements Υt,i>0\Upsilon_{t,i} > 0 are modeled using a positive continuous distribution (Gamma by default, but Log-Normal and Truncated Normal are also available). This distribution is parameterized by its mean and coefficient of variation.

For the mean, we assume unbiased measurements, such that in principle μΥt,i=ψt\mu_{\Upsilon_{t,i}} = \psi_t. Nevertheless, the mean of non-zero measurements must be adjusted for the conditioning on detection, i.e.

μΥt,i|Υt,i>0=ψt1P(Υt=0|Ψt=ψt),\begin{equation} \mu_{\Upsilon_{t,i} \,|\, \Upsilon_{t,i} > 0} = \frac{\psi_t}{1 - P(\Upsilon_t = 0 | \Psi_t = \psi_t)}, \end{equation}

For the coefficient of variation νΥ(ψt)\nu_\Upsilon(\psi_t), EpiSewer currently offers three options (see below). In all cases, νΥ(ψt)\nu_\Upsilon(\psi_t) describes all remaining unexplained variation in the measurements. This means that if replicates are provided and variation before the replication stage is already accounted for (see above), then νΥ(ψt)\nu_\Upsilon(\psi_t) describes only the variation between the replicates. Otherwise it will also include variation before the replication stage.

Option 1: Constant CV

This option assumes a constant coefficient of variation νΥ\nu_\Upsilon for all measurements, i.e. νΥ(ψt)=νΥ\begin{equation} \nu_\Upsilon(\psi_t) = \nu_\Upsilon \end{equation}

where νΥ\nu_\Upsilon is a parameter to be estimated. We place a zero-truncated normal prior on νΥ\nu_\Upsilon.

Option 2: CV based on digital PCR (dPCR)

This option models the CV as a function of the expected concentration ψt\psi_t, and the functional relationship is derived from the statistical properties of digital PCR assays, i.e. based on the binomial-distributed positive partition counts in the dPCR reaction. As a result, the CV is approximately constant for high concentrations, but increases exponentially as concentrations become small (it also increases when concentrations become so large that the assay saturates, but this should be avoided in lab practice through appropriate dilution). The exact functional relationship depends on several parameters of the assay, such as the number and volume of partitions - which can be either assumed or estimated. by EpiSewer. Please see this preprint for details.

Option 3: Constant variance

This option models measurements with a constant variance σΥ2\sigma^2_\Upsilon, such that

νΥ(ψt)=σΥψt.\begin{equation} \nu_\Upsilon(\psi_t) = \frac{\sigma_\Upsilon}{\psi_t}. \end{equation}

This option is not recommended for use in practice, as it is likely misspecified for most types of measurements.

Estimation

EpiSewer is directly fitted to observed concentration measurements yt,iy_{t,i} (where tTt \leq T is the index of the sample date and ii the index of the replicate out of ntn_t replicates in total for date index tt).

Using Markov Chain Monte Carlo (MCMC), we draw samples from the joint posterior distribution of RtR_t and the other parameters

P(R{t|1+GtT},|y{t|1+L+StT},{i|int})P(y{t|1+L+StT},{i|int}|R{t|1+GtT},)P(),\begin{equation} P\left(R_{\{t | 1+G \leq t \leq T\}}, \dots \;\middle|\; y_{\{t | 1+L+S \leq t \leq T\},\{i | i \leq n_t\}}\right) \propto P\left(y_{\{t | 1+L+S \leq t \leq T\},\{i | i \leq n_t\}} \;\middle|\; R_{\{t | 1+G \leq t \leq T\}}, \dots\right) \, P\left(\dots \right), \end{equation}

where P(y{t|1+L+StT},{i|int}|R{t|1+GtT},)P\left(y_{\{t | 1+L+S \leq t \leq T\},\{i | i \leq n_t\}} \;\middle|\; R_{\{t | 1+G \leq t \leq T\}}, \dots\right) is the likelihood as defined by the 5 modules of our generative model (infection, shedding, sewage, sampling, and measurement), and P()P\left(\dots \right) represents the priors on the non-hierarchical parameters of the model (e.g. intercepts and variances of random walks).

Note that many variables in the generative model (including RtR_t!) are in fact only transformations of other variables and therefore do not count as model parameters in a strict sense.

Indexing of time

For ease of notation, we have here used a unified time index t{1,2,,T}t \in \{1, 2, \dots, T\} for all variables. In reality, to model Υt,i\Upsilon_{t,i} at t=1t=1, we need to model RtR_t and many other variables also at earlier time points t<1t<1 due to the various delays between infections and observations. In the model implementation, each variable therefore has its own time index, and indices are mapped to each other accordingly.