# Real-time outbreak analysis: Ebola as a case study - part 2

/ [practicals]   / #simulation #response #ebola #epicurve #reproduction number

## Introduction

This practical is the second (out of three) part of a practical which simulates the early assessment and reconstruction of an Ebola Virus Disease (EVD) outbreak. Please make sure you have gone through part 1 before starting part 2. In part 2 of the practical, we introduce various aspects of analysis of the early stage of an outbreak, including growth rate estimation, contact tracing data, delays, and estimates of transmissibility. Part 3 of the practical will give an introduction to transmission chain reconstruction using outbreaker2.

Note: This practical is derived from earlier practicals called Ebola simulation part 1: early outbreak assessment and Ebola simulation part 2: outbreak reconstruction

## Learning outcomes

By the end of this practical (part 2), you should be able to:

• Estimate & interpret the growth rate & doubling time of the epidemic

• Estimate the serial interval from data on pairs infector / infected individuals

• Estimate & interpret the reproduction number of the epidemic

• Forecast short-term future incidence

## Context: A novel EVD outbreak in a fictional country in West Africa

A new EVD outbreak has been notified in a fictional country in West Africa. The Ministry of Health is in charge of coordinating the outbreak response, and have contracted you as a consultant in epidemic analysis to inform the response in real time. You have already read in an done descriptive analysis of the data (part 1 of the practical). Now let’s do some statistical analyses!

## Required packages

The following packages, available on CRAN or github, are needed for this analysis. You should have installed them in part 1 but if not, install necessary packages as follows:

# install.packages("remotes")
# install.packages("outbreaks")
# install.packages("incidence")
# remotes::install_github("reconhub/epicontacts@ttree")
# install.packages("distcrete")
# install.packages("epitrix")
# remotes::install_github("annecori/EpiEstim")
# remotes::install_github("reconhub/projections")
# install.packages("ggplot2")
# install.packages("magrittr")
# install.packages("binom")
# install.packages("ape")
# install.packages("outbreaker2")
# install.packages("here")

Once the packages are installed, you may need to open a new R session. Then load the libraries as follows:

library(readxl)
library(outbreaks)
library(incidence)
library(epicontacts)
library(distcrete)
library(epitrix)
library(EpiEstim)
library(projections)
library(ggplot2)
library(magrittr)
library(binom)
library(ape)
library(outbreaker2)
library(here)

## Read in the data processed in part 1

i_daily <- readRDS(here("data/clean/i_daily.rds"))
contacts <- readRDS(here("data/clean/contacts.rds"))

## Estimating the growth rate using a log-linear model

The simplest model of incidence is probably the log-linear model, i.e. a linear regression on log-transformed incidences. For this we will work with weekly incidence, to avoid having too many issues with zero incidence (which cannot be logged).

Visualise the log-transformed incidence:

ggplot(as.data.frame(i_weekly)) +
geom_point(aes(x = dates, y = log(counts))) +
scale_x_incidence(i_weekly) +
xlab("date") +
ylab("log weekly incidence") +
theme_minimal()

What does this plot tell you about the epidemic?

In the incidence package, the function fit will estimate the parameters of this model from an incidence object (here, i_weekly). Apply it to the data and store the result in a new object called f. You can print and use f to derive estimates of the growth rate r and the doubling time, and add the corresponding model to the incidence plot:

Fit a log-linear model to the weekly incidence data:

f <- incidence::fit(i_weekly)
## Warning in incidence::fit(i_weekly): 1 dates with incidence of 0 ignored
## for fitting

f
## <incidence_fit object>
##
## $model: regression of log-incidence over time ## ##$info: list containing the following items:
##   $r (daily growth rate): ## [1] 0.04145251 ## ##$r.conf (confidence interval):
##           2.5 %     97.5 %
## [1,] 0.02582225 0.05708276
##
##   $doubling (doubling time in days): ## [1] 16.72148 ## ##$doubling.conf (confidence interval):
##         2.5 %   97.5 %
## [1,] 12.14285 26.84302
##
##   $pred: data.frame of incidence predictions (12 rows, 5 columns)  plot(i_weekly, fit = f) Looking at the plot and fit, do you think this is a reasonable fit? ## Finding a suitable threshold date for the log-linear model, based on the observed delays Using the plot of the log(incidence) that you plotted above, and thinking about why exponential growth may not be observed in the most recent weeks, choose a threshold date and fit the log-linear model to a suitable section of the epicurve where you think we can more reliably estimate the growth rate, r, and the doubling time. You may want to examine how long after onset of symptoms cases are hospitalised; this may inform the threshold date you choose, as follows: summary(as.numeric(linelist_clean$date_of_hospitalisation - linelist_clean$date_of_onset)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 1.00 2.00 3.53 5.00 22.00  # how many weeks should we discard at the end of the epicurve n_weeks_to_discard <-  min_date <- min(i_daily$dates)
max_date <- max(i_daily$dates) - n_weeks_to_discard * 7 # weekly truncated incidence i_weekly_trunc <- subset(i_weekly, from = min_date, to = max_date) # discard last few weeks of data # daily truncated incidence (not used for the linear regression but may be used later) i_daily_trunc <- subset(i_daily, from = min_date, to = max_date) # remove last two weeks of data Refit and plot your log-linear model as before but using the truncated data i_weekly_trunc. ## <incidence_fit object> ## ##$model: regression of log-incidence over time
##
## $info: list containing the following items: ##$r (daily growth rate):
## [1] 0.04773599
##
##   $r.conf (confidence interval): ## 2.5 % 97.5 % ## [1,] 0.03141233 0.06405965 ## ##$doubling (doubling time in days):
## [1] 14.52043
##
##   $doubling.conf (confidence interval): ## 2.5 % 97.5 % ## [1,] 10.82034 22.06609 ## ##$pred: data.frame of incidence predictions (11 rows, 5 columns)


Look at the summary statistics of your fit:

summary(f$model) ## ## Call: ## stats::lm(formula = log(counts) ~ dates.x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.79781 -0.44508 -0.00138 0.35848 0.69880 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.296579 0.320461 0.925 0.379 ## dates.x 0.047736 0.007216 6.615 9.75e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5298 on 9 degrees of freedom ## Multiple R-squared: 0.8294, Adjusted R-squared: 0.8105 ## F-statistic: 43.76 on 1 and 9 DF, p-value: 9.754e-05  You can look at the goodness of fit (Rsquared), the estimated slope (growth rate) and the corresponding doubling time as below: # does the model fit the data well? adjRsq_model_fit <- summary(f$model)$adj.r.squared # what is the estimated growth rate of the epidemic? daily_growth_rate <- f$model$coefficients['dates.x'] # confidence interval: daily_growth_rate_CI <- confint(f$model, 'dates.x', level=0.95)
# what is the doubling time of the epidemic?
doubling_time_days <- log(2) / daily_growth_rate
# confidence interval:
doubling_time_days_CI <- log(2) / rev(daily_growth_rate_CI)

Although the log-linear is a simple and quick method for early epidemic assessment, care must be taken to only fit to the point that there is epidemic growth. Note that it may be difficult to define this point.

## Contact Tracing - Looking at contacts

Contact tracing is one of the pillars of an Ebola outbreak response. This involves identifying and following up any at risk individuals who have had contact with a known case (i.e. may have been infected). For Ebola, contacts are followed up for 21 days (the upper bound of the incubation period). This ensures that contacts who become symptomatic can be isolated quickly, reducing the chance of further transmission. We use the full linelist here rather than linelist_clean where we discarded entries with errors in the dates. However, the contact may still be valid.

Using the function make_epicontacts in the epicontacts package, create a new epicontacts object called epi_contacts. Make sure you check the column names of the relevant to and from arguments.

epi_contacts
##
## /// Epidemiological Contacts //
##
##   // class: epicontacts
##   // 169 cases in linelist; 60 contacts;  directed
##
##   // linelist
##
## # A tibble: 169 x 11
##    id    generation date_of_infecti… date_of_onset date_of_hospita…
##    <chr>      <dbl> <date>           <date>        <date>
##  1 d1fa…          0 NA               2014-04-07    2014-04-17
##  2 5337…          1 2014-04-09       2014-04-15    2014-04-20
##  3 f5c3…          1 2014-04-18       2014-04-21    2014-04-25
##  4 6c28…          2 NA               2014-04-27    2014-04-27
##  5 0f58…          2 2014-04-22       2014-04-26    2014-04-29
##  6 4973…          0 2014-03-19       2014-04-25    2014-05-02
##  7 f914…          3 NA               2014-05-03    2014-05-04
##  8 881b…          3 2014-04-26       2014-05-01    2014-05-05
##  9 e66f…          2 NA               2014-04-21    2014-05-06
## 10 20b6…          3 NA               2014-05-05    2014-05-06
## # … with 159 more rows, and 6 more variables: date_of_outcome <date>,
## #   outcome <chr>, gender <chr>, hospital <chr>, lon <dbl>, lat <dbl>
##
##   // contacts
##
## # A tibble: 60 x 3
##    from   to     source
##    <chr>  <chr>  <chr>
##  1 d1fafd 53371b other
##  2 f5c3d8 0f58c4 other
##  3 0f58c4 881bd4 other
##  4 f5c3d8 d58402 other
##  5 20b688 d8a13d funeral
##  6 2ae019 a3c8b8 other
##  7 20b688 974bc1 funeral
##  8 2ae019 72b905 funeral
##  9 40ae5f b8f2fd funeral
## 10 f1f60f 09e386 other
## # … with 50 more rows


You can easily plot these contacts, but with a little bit of tweaking (see ?vis_epicontacts) you can customise for example shapes by gender and arrow colours by source of exposure (or other variables of interest):

# for example, look at the reported source of infection of the contacts.
table(epi_contacts$contacts$source, useNA = "ifany")
##
## funeral   other
##      20      40

p <- plot(epi_contacts, node_shape = "gender", shapes = c(m = "male", f = "female"), node_color = "gender", edge_color = "source", selector = FALSE)
p

Using the function match (see ?match) check that the visualised contacts are indeed cases.

##  [1]   2   5   8  14  15  16  18  20  21  22  24  25  26  27  30  33  37
## [18]  38  40  46  48  51  58  59  62  64  68  69  70  71  73  75  79  84
## [35]  86  88  90  94  95  96  98 103 108 115 116 122 126 131 133 142 145
## [52] 146 147 148 153 155 157 160 162 166


Once you are happy that these are all indeed cases, look at the network:

• does it look like there is super-spreading (heterogeneous transmission)?
• looking at the gender of the cases, can you deduce anything from this? Are there any visible differences by gender?

## Estimating transmissibility ($R$)

### Branching process model

The transmissibility of the disease can be assessed through the estimation of the reproduction number R, defined as the expected number of secondary cases per infected case. In the early stages of an outbreak, and assuming a large population with no immunity, this quantity is also the basic reproduction number $R_0$, i.e. $R$ in a large fully susceptible population.

The package EpiEstim implements a Bayesian estimation of $R$, using dates of onset of symptoms and information on the serial interval distribution, i.e. the distribution of time from symptom onset in a case and symptom onset in their infector (see Cori et al., 2013, AJE 178: 1505–1512).

Briefly, EpiEstim uses a simple model describing incidence on a given day as Poisson distributed, with a mean determined by the total force of infection on that day:

$$I_t ∼ Poisson(\lambda_t)$$

where $I_t$ is the incidence (based on symptom onset) on day $t$ and $\lambda_t$ is the force of infection on that day. Noting R the reproduction number and w() the discrete serial interval distribution, we have:

$$\lambda_t = R \sum_{s=1}^t I_sw(t-s)$$

The likelihood (probability of observing the data given the model and parameters) is defined as a function of R:

$$L(I) = p(I|R) = \prod_{t = 1}^{T} f_P(I_t, \lambda_t)$$

where $f_P(.,\mu)$ is the probability mass function of a Poisson distribution with mean $\mu$.

### Estimating the serial interval

As data was collected on pairs of infector and infected individuals, this should be sufficient to estimate the serial interval distribution. If that was not the case, we could have used data from past outbreaks instead.

Use the function get_pairwise can be used to extract the serial interval (i.e. the difference in date of onset between infectors and infected individuals):

si_obs <- get_pairwise(epi_contacts, "date_of_onset")
summary(si_obs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   1.000   5.000   6.500   9.117  12.250  30.000

## Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## 1.000   5.000   6.500   9.117  12.250  30.000
hist(si_obs, breaks = 0:30,
xlab = "Days after symptom onset", ylab = "Frequency",
main = "Serial interval (empirical distribution)",
col = "grey", border = "white")

What do you think of this distribution? Make any adjustment you would deem necessary, and then use the function fit_disc_gamma from the package epitrix to fit a discretised Gamma distribution to these data. Your results should approximately look like:

si_fit <- fit_disc_gamma(si_obs, w = 1)
si_fit
## $mu ## [1] 8.612892 ## ##$cv
## [1] 0.7277355
##
## $sd ## [1] 6.267907 ## ##$ll
## [1] -183.4215
##
## $converged ## [1] TRUE ## ##$distribution
## A discrete distribution
##   name: gamma
##   parameters:
##     shape: 1.88822148063956
##     scale: 4.5613778865727


si_fit contains various information about the fitted delays, including the estimated distribution in the $distribution slot. You can compare this distribution to the empirical data in the following plot: si <- si_fit$distribution
si
## A discrete distribution
##   name: gamma
##   parameters:
##     shape: 1.88822148063956
##     scale: 4.5613778865727

## compare fitted distribution
hist(si_obs, xlab = "Days after symptom onset", ylab = "Frequency",
main = "Serial interval: fit to data", col = "salmon", border = "white",
50, ylim = c(0, 0.15), freq = FALSE, breaks = 0:35)
points(0:60, si$d(0:60), col = "#9933ff", pch = 20) points(0:60, si$d(0:60), col = "#9933ff", type = "l", lty = 2)

Would you trust this estimation of the generation time? How would you compare it to actual estimates from the West African EVD outbreak (WHO Ebola Response Team (2014) NEJM 371:1481–1495) with a mean of 15.3 days and a standard deviation 9.3 days?

### Estimation of the Reproduction Number

Now that we have estimates of the serial interval, we can use this information to estimate transmissibility of the disease (as measured by $R_0$). Make sure you use a daily (not weekly) incidence object truncated to the period where you have decided there is exponential growth (i_daily_trunc).

Using the estimates of the mean and standard deviation of the serial interval you just obtained, use the function estimate_R to estimate the reproduction number (see ?estimate_R) and store the result in a new object R.

Before using estimate_R, you need to create a config object using the make_config function, where you will specify the time window where you want to estimate the reproduction number as well as the mean_si and std_si to use. For the time window, use t_start = 2 (you can only estimate R from day 2 onwards as you are conditioning on the past observed incidence) and specify t_end = length(i_daily_trunc$counts) to estimate R up to the last date of your truncated incidence i_daily_trunc. config <- make_config(mean_si = si_fit$mu, # mean of the si distribution estimated earlier
std_si = si_fit$sd, # standard deviation of si dist estimated earlier t_start = 2, # starting day of time window t_end = length(i_daily_trunc$counts)) # final day of time window
R <- # use estimate_R using method = "parametric_si"
plot(R, legend = FALSE)  

## TableGrob (3 x 1) "arrange": 3 grobs
##       z     cells    name           grob
## incid 1 (1-1,1-1) arrange gtable[layout]
## R     2 (2-2,1-1) arrange gtable[layout]
## SI    3 (3-3,1-1) arrange gtable[layout]


Extract the median and 95% credible intervals (95%CrI) for the reproduction number as follows:

R_median <- R$R$Median(R)
R_CrI <- c(R$R$Quantile.0.025(R), R$R$Quantile.0.975(R))
R_median
## [1] 1.278192

R_CrI
## [1] 1.068374 1.513839


Interpret these results: what do you make of the reproduction number? What does it reflect? Based on the last part of the epicurve, some colleagues suggest that incidence is going down and the outbreak may be under control. What is your opinion on this?

Note that you could have estimated R0 directly from the growth rate and the serial interval, using the formula described in Wallinga and Lipsitch, Proc Biol Sci, 2007: $R_0 = \frac{1}{\int_{s=0}^{+\infty}e^{-rs}w(s)ds}$, and implemented in the function r2R0 of the epitrix package. Although this may seem like a complicated formula, the reasoning behind it is simple and illustrated in the figure below: for an observed exponentially growing incidence curve, if you know the serial interval, you can derive the reproduction number.

Compared to the figure above, there is uncertainty in the growth rate r, and the serial interval has a full distribution rather than a single value. This can be accounted for in estimating R as follows:

# generate a sample of R0 estimates from the growth rate and serial interval we estimated above
R_sample_from_growth_rate <- lm2R0_sample(f$model, # log-linear model which contains our estimates of the growth rate r si$d(1:100), # serial interval distribution (truncated after 100 days)
n = 1000) # desired sample size
# plot these:
hist(R_sample_from_growth_rate)

# what is the median?
R_median_from_growth_rate <- median(R_sample_from_growth_rate)
R_median_from_growth_rate # compare with R_median
## [1] 1.414592

# what is the 95%CI?
R_CI_from_growth_rate <- quantile(R_sample_from_growth_rate, c(0.025, 0.975))
R_CI_from_growth_rate # compare with R_CrI
##     2.5%    97.5%
## 1.266280 1.573704


Note the above estimates are slighlty different from those obtained using the branching process model. There are a few reasons for this. First, you used more detailed data (daily vs weekly incidence) for the branching process (EpiEstim) estimate. Furthermore, the log-linear model puts the same weight on all data points, whereas the branching process model puts a different weight on each data point (depending on the incidence observed at each time step). This may lead to slighlty different R estimates.

## Short-term incidence forecasting

The function project from the package projections can be used to simulate plausible epidemic trajectories by simulating daily incidence using the same branching process as the one used to estimate $R$ in EpiEstim. All that is needed is one, or several values of $R$ and a serial interval distribution, stored as a distcrete object.

Here, we first illustrate how we can simulate 5 random trajectories using a fixed value of $R$ = 1.28, the median estimate of R from above.
Use the same daily truncated incidence object as above to simulate future incidence.

#?project
small_proj <- project(i_daily_trunc,# incidence object
R = R_median, # R estimate to use
si = si,      # serial interval distribution
n_sim = 5,    # simulate 5 trajectories
n_days = 10,  # over 10 days
R_fix_within = TRUE) # keep the same value of R every day

# look at each projected trajectory (as columns):
as.matrix(small_proj)
##            [,1] [,2] [,3] [,4] [,5]
## 2014-06-18    2    4    3    2    2
## 2014-06-19    5    4    4    4    1
## 2014-06-20    2    5    4    6    3
## 2014-06-21    6    3    3    7    2
## 2014-06-22    6    6    5    5    1
## 2014-06-23    3    2    5    2    3
## 2014-06-24    8    4    6    5    5
## 2014-06-25    7    8    1    7    5
## 2014-06-26    5    5    5    4    4
## 2014-06-27    9    9    7    7    4


Using the same principle, generate 1,000 trajectories for the next 2 weeks, using a range of plausible values of $R$.
The posterior distribution of R is gamma distributed (see Cori et al. AJE 2013) so you can use the rgamma function to randomly draw values from that distribution. You will also need to use the function gamma_mucv2shapescale from the epitrix package as shown below.

sample_R <- function(R, n_sim = 1000)
{
mu <- R$R$Mean(R)
sigma <- R$R$Std(R)
Rshapescale <- gamma_mucv2shapescale(mu = mu, cv = sigma / mu)
R_sample <- rgamma(n_sim, shape = Rshapescale$shape, scale = Rshapescale$scale)
return(R_sample)
}

R_sample <- sample_R(R, 1000) # sample 1000 values of R from the posterior distribution
hist(R_sample, col = "grey")  # plot histogram of sample
abline(v = R_median, col = "red") # show the median estimated R as red solid vertical line
abline(v = R_CrI, col = "red", lty = 2) # show the 95%CrI of R as red dashed vertical lines

Store the results of your new projections in an object called proj.

You can visualise the projections as follows:

plot(i_daily_trunc) %>% add_projections(proj, c(0.025, 0.5, 0.975))

How would you interpret the following summary of the projections?

apply(proj, 1, summary)
##         2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23
## Min.         0.000      0.000      0.000       0.00      0.000      0.000
## 1st Qu.      2.000      2.000      3.000       3.00      3.000      3.000
## Median       4.000      4.000      4.000       4.00      4.000      4.000
## Mean         3.819      3.987      4.038       4.39      4.276      4.437
## 3rd Qu.      5.000      5.000      5.000       6.00      6.000      6.000
## Max.        11.000     13.000     11.000      12.00     13.000     14.000
##         2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29
## Min.         0.000      0.000      0.000      0.000       0.00      0.000
## 1st Qu.      3.000      3.000      3.000      3.000       3.00      3.000
## Median       4.000      5.000      5.000      5.000       5.00      5.000
## Mean         4.528      4.866      4.909      5.095       5.28      5.351
## 3rd Qu.      6.000      6.000      7.000      7.000       7.00      7.000
## Max.        13.000     14.000     14.000     14.000      16.00     16.000
##         2014-06-30 2014-07-01
## Min.         0.000      0.000
## 1st Qu.      4.000      4.000
## Median       5.000      5.000
## Mean         5.564      5.716
## 3rd Qu.      7.000      7.000
## Max.        20.000     19.000

apply(proj, 1, function(x) mean(x > 0)) # proportion of trajectories with at least 
## 2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23
##      0.970      0.979      0.980      0.990      0.983      0.986
## 2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29
##      0.993      0.993      0.989      0.988      0.995      0.987
## 2014-06-30 2014-07-01
##      0.987      0.992

                                        # one case on each given day

apply(proj, 1, mean) # mean daily number of cases 
## 2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23
##      3.819      3.987      4.038      4.390      4.276      4.437
## 2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29
##      4.528      4.866      4.909      5.095      5.280      5.351
## 2014-06-30 2014-07-01
##      5.564      5.716

apply(apply(proj, 2, cumsum), 1, summary) # projected cumulative number of cases in 
##         2014-06-18 2014-06-19 2014-06-20 2014-06-21 2014-06-22 2014-06-23
## Min.         0.000      0.000      2.000      5.000       7.00      8.000
## 1st Qu.      2.000      6.000      9.000     13.000      17.00     21.000
## Median       4.000      7.000     12.000     16.000      20.00     25.000
## Mean         3.819      7.806     11.844     16.234      20.51     24.947
## 3rd Qu.      5.000     10.000     14.000     19.000      24.00     29.000
## Max.        11.000     18.000     24.000     36.000      43.00     53.000
##         2014-06-24 2014-06-25 2014-06-26 2014-06-27 2014-06-28 2014-06-29
## Min.        11.000     13.000      16.00     16.000     17.000     18.000
## 1st Qu.     24.000     28.000      32.00     37.000     41.000     45.000
## Median      29.000     34.000      39.00     44.000     49.000     54.000
## Mean        29.475     34.341      39.25     44.345     49.625     54.976
## 3rd Qu.     34.000     40.000      45.00     51.000     57.000     64.000
## Max.        60.000     69.000      83.00     94.000    105.000    116.000
##         2014-06-30 2014-07-01
## Min.         20.00     21.000
## 1st Qu.      50.00     54.000
## Median       59.00     65.000
## Mean         60.54     66.256
## 3rd Qu.      70.00     76.000
## Max.        132.00    142.000

                                          # the next two weeks

According to these results, what are the chances that more cases will appear in the near future? Is this outbreak being brought under control? Would you recommend scaling up / down the response? Is this consistent with your estimate of $R$?

## Pause !

Please let a demonstrator know when you have reached this point before proceeding further.

## Accounting for uncertainty in the serial interval estimates when estimating the reproduction number

Note that this section is independent from the following ones, please skip if you don’thave time.

EpiEstim allows to explicitly account for the uncertainty in the serial interval estimates because of limited sample size of pairs of infector/infected individuals. Note that it also allows accounting for uncertainty on the dates of symptom onset for each of these pairs (which is not needed here).

Use the method = "si_from_data" option in estimate_R. To use this option, you need to create a data frame with 4 columns: EL, ER, SL and SR for the left (L) and right ® bounds of the observed time of symptoms in the infector (E) and infected (S for secondary) cases. Here we derive this from si_obs as follows:

si_data <- data.frame(EL = rep(0L, length(si_obs)),
ER = rep(1L, length(si_obs)),
SL = si_obs, SR = si_obs + 1L)

We can then feed this into estimate_R (but this will take some time to run as it estimates the serial interval distribution using an MCMC and fully accounts for the uncertainty in the serial interval to estimate the reproduction number).

config <- make_config(t_start = 2,
t_end = length(i_daily_trunc$counts)) R_variableSI <- estimate_R(i_daily_trunc, method = "si_from_data", si_data = si_data, config = config) ## Running 8000 MCMC iterations ## MCMCmetrop1R iteration 1 of 8000 ## function value = -186.93263 ## theta = ## 0.61048 ## 1.53754 ## Metropolis acceptance rate = 1.00000 ## ## MCMCmetrop1R iteration 1001 of 8000 ## function value = -187.66955 ## theta = ## 0.60839 ## 1.72039 ## Metropolis acceptance rate = 0.55145 ## ## MCMCmetrop1R iteration 2001 of 8000 ## function value = -186.33760 ## theta = ## 0.84673 ## 1.39242 ## Metropolis acceptance rate = 0.56322 ## ## MCMCmetrop1R iteration 3001 of 8000 ## function value = -186.84716 ## theta = ## 0.95161 ## 1.19723 ## Metropolis acceptance rate = 0.56448 ## ## MCMCmetrop1R iteration 4001 of 8000 ## function value = -187.56424 ## theta = ## 0.86940 ## 1.46982 ## Metropolis acceptance rate = 0.55711 ## ## MCMCmetrop1R iteration 5001 of 8000 ## function value = -186.49130 ## theta = ## 0.76777 ## 1.49876 ## Metropolis acceptance rate = 0.55449 ## ## MCMCmetrop1R iteration 6001 of 8000 ## function value = -186.49278 ## theta = ## 0.86552 ## 1.39227 ## Metropolis acceptance rate = 0.55257 ## ## MCMCmetrop1R iteration 7001 of 8000 ## function value = -186.98037 ## theta = ## 1.00900 ## 1.19664 ## Metropolis acceptance rate = 0.55092 ## ## ## ## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ## The Metropolis acceptance rate was 0.55012 ## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ## ## Gelman-Rubin MCMC convergence diagnostic was successful. ## ## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ## Estimating the reproduction number for these serial interval estimates... ## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  # checking that the MCMC converged: R_variableSI$MCMC_converged
## [1] TRUE

# and plotting the output:
plot(R_variableSI)

## TableGrob (3 x 1) "arrange": 3 grobs
##       z     cells    name           grob
## incid 1 (1-1,1-1) arrange gtable[layout]
## R     2 (2-2,1-1) arrange gtable[layout]
## SI    3 (3-3,1-1) arrange gtable[layout]


Look at the new median estimated R and 95%CrI: how different are they from your previous estimates? Do you think the size of the contacts dataset has had an impact on your results?

R_variableSI_median <- R_variableSI$R$Median(R)
R_variableSI_CrI <- c(R_variableSI$R$Quantile.0.025(R), R_variableSI$R$Quantile.0.975(R))
R_variableSI_median
## [1] 1.296516

R_variableSI_CrI
## [1] 1.078988 1.544788


## Estimating time-varying transmissibility

When the assumption that ($R$) is constant over time becomes untenable, an alternative is to estimating time-varying transmissibility using the instantaneous reproduction number ($R_t$). This approach, introduced by Cori et al. (2013), is also implemented in the package EpiEstim. It estimates ($R_t$) for a custom time windows (default is a succession of sliding time windows), using the same Poisson likelihood described above. In the following, we estimate transmissibility for 1-week sliding time windows (the default of estimate_R):

config = make_config(list(mean_si = si_fit$mu, std_si = si_fit$sd))
# t_start and t_end are automatically set to estimate R over 1-week sliding windows by default. 
Rt <-         # use estimate_R using method = "parametric_si"

# look at the most recent Rt estimates:
tail(Rt$R[, c("t_start", "t_end", "Median(R)", "Quantile.0.025(R)", "Quantile.0.975(R)")]) ## Default config will estimate R on weekly sliding windows. ## To change this change the t_start and t_end arguments. ## t_start t_end Median(R) Quantile.0.025(R) Quantile.0.975(R) ## 60 61 67 1.2417304 0.8144152 1.797603 ## 61 62 68 1.0045473 0.6318309 1.501326 ## 62 63 69 0.8404935 0.5074962 1.294848 ## 63 64 70 1.0276438 0.6538993 1.522464 ## 64 65 71 1.0335607 0.6576643 1.531230 ## 65 66 72 1.0337804 0.6578041 1.531556  Plot the estimate of $R$ over time: plot(Rt, legend = FALSE) ## Warning: Removed 1 rows containing missing values (geom_path).  ## TableGrob (3 x 1) "arrange": 3 grobs ## z cells name grob ## incid 1 (1-1,1-1) arrange gtable[layout] ## R 2 (2-2,1-1) arrange gtable[layout] ## SI 3 (3-3,1-1) arrange gtable[layout]  How would you interpret this result? What is the caveat of this representation? What would you have concluded if instead of using i_daily_trunc as above, you ad used the entire epidemics curve io.e. i_daily? Rt_whole_incid <- # use estimate_R using method = "parametric_si", # the same config as above but i_daily instead of i_daily_trunc # look at the most recent Rt estimates: tail(Rt_whole_incid$R[, c("t_start", "t_end",
"Median(R)", "Quantile.0.025(R)", "Quantile.0.975(R)")])  
## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.

## Warning in estimate_R_func(incid = incid, method = method, si_sample = si_sample, : You're estimating R too early in the epidemic to get the desired
##             posterior CV.

##    t_start t_end Median(R) Quantile.0.025(R) Quantile.0.975(R)
## 74      75    81 1.2330741         0.8412787         1.7310657
## 75      76    82 1.0008292         0.6564151         1.4488601
## 76      77    83 0.9432201         0.6128291         1.3753773
## 77      78    84 0.8202251         0.5158976         1.2258508
## 78      79    85 0.7452526         0.4566772         1.1356909
## 79      80    86 0.5146158         0.2811874         0.8515131


## Save data and outputs

This is the end of part 2 of the practical. Before going on to part 3, you’ll need to save the following objects:

saveRDS(linelist, here("data/clean/linelist.rds"))
saveRDS(linelist_clean, here("data/clean/linelist_clean.rds"))
saveRDS(epi_contacts, here("data/clean/epi_contacts.rds"))
saveRDS(si, here("data/clean/si.rds"))