Real-time outbreak analysis: Ebola as a case study - part 3

/ [practicals]   / #simulation #response #ebola #epicurve #reproduction number 

Introduction

This practical is the third (and last) part of a practical which simulates the early assessment and reconstruction of an Ebola Virus Disease (EVD) outbreak. Please make sure you have gone through part 1 and part 2 before starting part 3. In part 3 of the practical, we give an introduction to transmission chain reconstruction using outbreaker2.

Note: This practical is derived from earlier practicals called Ebola simulation part 1: early outbreak assessment and Ebola simulation part 2: outbreak reconstruction

Learning outcomes

By the end of this practical (part 3), you should be able to:

  • Use the outbreaker2 package to reconstruct who infected whom using epidemiological and genetic data
  • Understand how different types of data contribute to our understanding of who infected whom
  • Interpret the various outputs from outbreaker2, and understand the uncertainty associated with any conclusions you draw
  • Visualise a consensus tree using epicontacts

Context: A novel EVD outbreak in a fictional country in West Africa

A new EVD outbreak has been notified in a fictional country in West Africa. The Ministry of Health is in charge of coordinating the outbreak response, and have contracted you as a consultant in epidemic analysis to inform the response in real time. You have already done descriptive and statistical analyses of the data to understand the overall epidemic trend (part 1 and part 2 of the practical). Now we will try and draw a more detailed picture of the epidemic by trying to infer who infected whom using the data available to us, namely dates of symptom onset, whole genome sequences (WGS) and limited contact data. This can be achieved using outbreaker2, which provides a modular platform for outbreak reconstruction.

Required packages

The following packages, available on CRAN or github, are needed for this analysis. You should have installed them in part 1 and part 2 but if not, install necessary packages as follows:

# install.packages("remotes")
# install.packages("readxl")
# install.packages("outbreaks")
# install.packages("incidence")
# remotes::install_github("reconhub/epicontacts@ttree")
# install.packages("distcrete")
# install.packages("epitrix")
# remotes::install_github("annecori/EpiEstim")
# remotes::install_github("reconhub/projections")
# install.packages("ggplot2")
# install.packages("magrittr")
# install.packages("binom")
# install.packages("ape")
# install.packages("outbreaker2")
# install.packages("here")

Once the packages are installed, you may need to open a new R session. Then load the libraries as follows:

library(readxl)
library(outbreaks)
library(incidence)
library(epicontacts)
library(distcrete)
library(epitrix)
library(EpiEstim)
library(projections)
library(ggplot2)
library(magrittr)
library(binom)
library(ape)
library(outbreaker2)
library(here)

Read in the data processed in part 1 and part 2

linelist <- readRDS(here("data/clean/linelist.rds"))
linelist_clean <- readRDS(here("data/clean/linelist_clean.rds"))
epi_contacts <- readRDS(here("data/clean/epi_contacts.rds"))
si <- readRDS(here("data/clean/si.rds"))

Importing Whole Genome Sequences (WGS)

WGS have been obtained for all cases in this outbreak. They are stored as a fasta wgs_20140707.fasta. Download this file, save it in your working directory, and then import these data using the function read.FASTA from the ape package. Do the sequence labels match the IDs in your cleaned linelist?

dna <- read.FASTA(here("data/wgs_20140707.fasta"))
dna
## 169 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 18967 
## 
## Labels:
## d1fafd
## 53371b
## f5c3d8
## 6c286a
## 0f58c4
## 49731d
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.244 0.252 0.249 0.254 
## (Total: 3.21 Mb)
identical(labels(dna), linelist_clean$case_id) # check sequences match linelist data
## [1] FALSE

Subset the genetic data to keep only the sequences that match your cleaned linelist, and check again that the sequence labels match.

dna <- dna[match(linelist_clean$case_id, names(dna))]
identical(labels(dna), linelist_clean$case_id)
## [1] TRUE

Building delay distributions

outbreaker2 can handle different types of dates. When dates of onset are provided, information on the generation time (delay between primary and secondary infections) and on the incubation period (delay between infection and symptom onset) must be provided. For this practical we will assume that the generation time can be approximated using the serial interval, which you have already estimated above and stored in the object si.

How are the generation time and serial interval related? How might they differ?

The incubation period can be estimated from the linelist data by calculating the delay between dates of symptom onset and dates of infection, which are known for some cases. Visualise the distribution of incubation periods using a histogram.

incub_obs <- as.integer(na.omit(with(linelist_clean, date_of_onset - date_of_infection)))
summary(incub_obs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   8.019  10.000  37.000
hist(incub_obs, breaks = 0:40,
     xlab = "Days after infection", ylab = "Frequency",
     main = "Incubation period (empirical distribution)",
     col = "grey", border = "white")

As for the serial interval, use the function fit_disc_gamma to fit a Gamma distribution to these empirical estimates.

incub_fit <- fit_disc_gamma(incub_obs)
incub_fit
## $mu
## [1] 8.518541
## 
## $cv
## [1] 0.6840563
## 
## $sd
## [1] 5.827161
## 
## $ll
## [1] -309.7443
## 
## $converged
## [1] TRUE
## 
## $distribution
## A discrete distribution
##   name: gamma
##   parameters:
##     shape: 2.13705814653622
##     scale: 3.98610612672765
incub <- incub_fit$distribution

Compare this distribution to the empirical data to make sure the results seem reasonable:

hist(incub_obs, xlab = "Days after infection", ylab = "Frequency",
     main = "Incubation period: fit to data", col = "salmon", border = "white",
     50, ylim = c(0, 0.15), freq = FALSE, breaks = 0:40)
points(0:60, incub$d(0:60), col = "#9933ff", pch = 20)
points(0:60, incub$d(0:60), col = "#9933ff", type = "l", lty = 2)

An alternative approach here would be to use estimates of the mean and standard deviation of the incubation period and the generation time published in the literature. From this, one would need to use epitrix to convert these parameters into shape and scale for a Gamma distribution, and then use distcrete to generate discretised distributions.

Using the original outbreaker model

The original outbreaker model combined temporal information (here, dates of onset) with sequence data to infer who infected whom. Here, we use outbreaker2 to apply this model to the data.

All inputs to the new outbreaker function are prepared using dedicated functions, which make a number of checks on provided inputs and define defaults. Make sure you names the dates vector, so the DNA labels can be linked to their respective cases:

dates <- linelist_clean$date_of_onset
names(dates) <- linelist_clean$case_id
data <- outbreaker_data(dates = dates, # dates of onset
                        dna = dna, # WGS
                        w_dens = si$d(1:100), # generation time distribution
                        f_dens = incub$d(1:100) # incubation period distribution
                        )

We also create a configuration, which determines different aspects of the analysis, including which parameters need to be estimated, initial values of parameters, the length of the MCMC, etc. If you want see a progress bar to indicate how long the analysis will take, add an additional option pb = TRUE:

ob_config <- create_config(move_kappa = FALSE, # don't look for missing cases
                           move_pi = FALSE, # don't estimate reporting
                           init_pi = 1, # set reporting to 1
                           find_import = FALSE, # don't look for additional imported cases
                           init_tree = "star" # star-like tree as starting point
                           )

We can now run the analysis. This should take a couple of minutes on modern laptops. Note the use of set.seed(0) to have identical results across different users and computers:

set.seed(0)
res_basic <- outbreaker(data = data, config = ob_config)
res_basic
## 
## 
##  ///// outbreaker results ///
## 
## class:  outbreaker_chains data.frame
## dimensions 201 rows,  506 columns
## ancestries not shown: alpha_1 - alpha_166
## infection dates not shown: t_inf_1 - t_inf_166
## intermediate generations not shown: kappa_1 - kappa_166
## 
## /// head //
##   step      post      like    prior           mu pi eps lambda
## 1    1 -4212.492 -4214.795 2.302485 1.000000e-04  1 0.5   0.05
## 2   50 -1195.248 -1197.551 2.302578 6.670115e-06  1 0.5   0.05
## 3  100 -1172.138 -1174.441 2.302578 6.670115e-06  1 0.5   0.05
## 
## ...
## /// tail //
##      step      post      like   prior           mu pi eps lambda
## 199  9900 -1123.955 -1126.258 2.30258 4.851635e-06  1 0.5   0.05
## 200  9950 -1131.066 -1133.369 2.30258 4.851635e-06  1 0.5   0.05
## 201 10000 -1156.391 -1158.694 2.30258 4.851635e-06  1 0.5   0.05
plot(res_basic)

plot(res_basic, burn = 500)

The first two plots show the trace of the log-posterior densities (with, and without burnin). See ?plot.outbreaker_chains for details on available plots. Graphics worth looking at include:

## ancestry plot
## axis labels are removed because they are too small to make out
plot(res_basic,
     type = "alpha",
     burnin = 500) +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

## infection dates
plot(res_basic,
     type = "t_inf",
     burnin = 500) +
  theme_minimal() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

# mutation rate
plot(res_basic, "mu", burn = 500, type = "density")

# transmission trees
p <- plot(res_basic,
          type = "network",
          burnin = 500,
          min_support = .05) 
p

Explain briefly what each of these plots shows, and what (if any) conclusions can be drawn from them.

As a further help for interpretation, you can derive a consensus tree from the posterior samples of trees using summary. Look in particular at the support column.

smry_basic <- summary(res_basic)
head(smry_basic$tree)
##   from to time    support generations
## 1   NA  1   -4 0.04477612          NA
## 2    1  2    4 0.75124378           1
## 3    9  3   10 0.34825871           1
## 4    9  4   15 0.21890547           1
## 5    8  5   14 0.21393035           1
## 6    9  6   12 0.31840796           1
tail(smry_basic$tree)
##     from  to time    support generations
## 161  107 161   70 0.04477612           1
## 162  133 162   70 0.09950249           1
## 163  103 163   67 0.19402985           1
## 164  144 164   72 0.27860697           1
## 165   99 165   59 0.06467662           1
## 166  109 166   71 0.04477612           1
hist(smry_basic$tree$support, col = "grey", border = "white",
     main = "Consensus ancestry: support", xlim = c(0,1))

How would you interpret this result? How well resolved is the transmission tree?

As a point of comparison, repeat the same analysis using temporal data only, and plot a graph of ancestries (type = "alpha"); you should obtain something along the lines of:

set.seed(0)

data <- outbreaker_data(dates = linelist_clean$date_of_onset, # dates of onset
                        w_dens = si$d(1:100), # generation time distribution
                        f_dens = incub$d(1:100) # incubation period distribution
                        )
res_time <- outbreaker(data = data, config = ob_config)

## ancestry plot
## axis labels are removed because they are too small to make out
plot(res_time,
     type = "alpha",
     burnin = 500) +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

Create a consensus tree, visualise the support values and compare to the results when using temporal and genetic data.

smry_time <- summary(res_time)
hist(smry_time$tree$support, col = "grey", border = "white",
     main = "Consensus ancestry: support", xlim = c(0,1))

What is the usefulness of temporal and genetic data for outbreak reconstruction? What other data would you ideally include?

Adding contact data to the reconstruction process

outbreaker2 natively accepts epicontacts objects. Therefore all we need to do is make an epi_contacts_clean object that contains contacts between cases in linelist_clean, and pass it to the ctd argument in outbreaker_data. Make sure the epicontacts object is defined as the contacts we have are directional (i.e. specify an infector and infectee).

## only keep linelist and contacts elements in linelist_clean
epi_contacts_clean <- epi_contacts[i = linelist_clean$case_id, ## keep linelist
                                   j = linelist_clean$case_id, ## keep contacts
                                   contacts = 'both'] ## both contacts must be in linelist
epi_contacts_clean$directed ## check directionality
## [1] TRUE
data <- outbreaker_data(dates = linelist_clean$date_of_onset, # dates of onset
                        dna = dna, # dna sequences
                        ctd = epi_contacts_clean, # contact data
                        w_dens = si$d(1:100), # generation time distribution
                        f_dens = incub$d(1:100) # incubation period distribution
                        )

ob_config <- create_config(move_kappa = FALSE, # don't look for missing cases
                           move_pi = FALSE, # don't estimate reporting
                           init_pi = 1, # set reporting to 1
                           find_import = FALSE, # don't look for additional imported cases
                           init_tree = "star" # star-like tree as starting point
                           )

We are now ready to run the analysis. This may take a couple of minutes, depending on your computer:

set.seed(0)
res_full <- outbreaker(data = data, config = ob_config)
res_full
## 
## 
##  ///// outbreaker results ///
## 
## class:  outbreaker_chains data.frame
## dimensions 201 rows,  506 columns
## ancestries not shown: alpha_1 - alpha_166
## infection dates not shown: t_inf_1 - t_inf_166
## intermediate generations not shown: kappa_1 - kappa_166
## 
## /// head //
##   step      post      like    prior           mu pi        eps      lambda
## 1    1 -5228.625 -5230.928 2.302485 1.000000e-04  1 0.50000000 0.050000000
## 2   50 -1542.947 -1545.250 2.302566 1.883762e-05  1 0.07355283 0.018468232
## 3  100 -1401.486 -1403.789 2.302574 1.107595e-05  1 0.15644756 0.006344793
## 
## ...
## /// tail //
##      step      post      like    prior           mu pi       eps
## 199  9900 -1256.170 -1258.472 2.302579 5.999178e-06  1 0.3867233
## 200  9950 -1250.211 -1252.514 2.302583 1.999904e-06  1 0.3918788
## 201 10000 -1240.590 -1242.892 2.302583 1.999904e-06  1 0.3623597
##           lambda
## 199 0.0001219648
## 200 0.0001219648
## 201 0.0001219648

Produce graphics as in the previous model. Assess convergence, choose an appropriate burnin, visualise ancestries and the infection timelines:

Derive a consensus tree using summary, then visualise the results in the support column. This time set the option method = 'decycle' to remove any cycles in the consensus transmission tree.

smry_full <- summary(res_full, method = 'decycle')
head(smry_full$tree)
##   from to time   support generations
## 1   NA  1   -4 1.0000000          NA
## 2    1  2    2 1.0000000           1
## 3    1  3    8 0.3930348           1
## 4    6  4   16 0.2338308           1
## 5    3  5   13 0.9950249           1
## 6    2  6   12 0.3383085           1
smry_full$tree$support <- round(smry_full$tree$support, 2)

hist(smry_full$tree$support, col = "grey", border = "white",
     main = "Consensus ancestry: support", xlim = c(0,1))

How would you interpret the results? Compare the results to those using temporal data only, or temporal and genetic data. Can we always rely on contact data to estimate who infected whom?

Now make a new epicontacts object to visualise the consensus tree with meta-information. First convert the index labels in the consensus tree to the case IDs in linelist. Also include the estimated infection time from the consensus tree (smry_full$tree$time).

## match index labels to case IDs
smry_full$tree$from <- linelist_clean$case_id[smry_full$tree$from]
smry_full$tree$to <- linelist_clean$case_id[smry_full$tree$to]

## convert infection time estimates from integers to dates
linelist_clean$t_inf_est <- as.Date(smry_full$tree$time,
                                    origin = min(linelist_clean$date_of_onset))

cons_tree <- make_epicontacts(linelist_clean,
                              smry_full$tree,
                              directed = TRUE)

In the following, we create color palettes which will be used to display information on the final graph:

support_pal <- colorRampPalette(
    c("#918D98", "#645877", "#423359", "#281449", "#1A0340")
)

outcome_pal <- colorRampPalette(
    c("#3288BD", "#ABDDA4")
)

Looking carefully at the documentation of vis_epicontacts, try to reproduce the final consensus tree below:

epicontacts can now also display timed transmission trees using the method = 'ttree' argument. In the interactive plot below, the colour of the edges represents the posterior support, the color of the nodes the outcome, the size of the nodes individual reproductive number Ri.

Try adjusting the height and width arguments so the tree is cleary visible, then explore clusters of infection by zooming in and out and moving the tree around. Drag nodes around to improve clarity if need be. Hover your mouse over the nodes to get more information, single-click on a node to highlight its subtree and double-click on a node to collapse it. Set ttree_shape = 'rectangle' if you want a tree with right-angled edges. For more info, see ?vis_ttree.

q <- plot(cons_tree,
          method = 'ttree', ## show timed tree
          x_axis = 't_inf_est', ## use 't_inf_est' column for x_axis
          edge_color = "support", ## color edges by posterior support
          edge_col_pal = support_pal, ## specify color palette 
          node_size = 'R_i', ## node size reflects R_i
          ## edge_label = "support", ## not showing for readability
          ## node_shape = "gender", ## not showing for readability
          ## shapes = c(m = "male", f = "female"), ## specify shapes
          width = 1000, ## adjust to improve readability
          height = 1500, ## adjust to improve readability
          node_color = "outcome", ## color nodes by outcome
          hide_labels = TRUE, ## don't show node labels
          date_labels = "%d/%m", ## formatting date labels
          axis_type = 'double', ## show two axes
          col_pal = outcome_pal, ## specify color palette
          node_order = 'subtree_size', ## order nodes vertically by subtree size
          reverse_node_order = TRUE, ## put large subtrees as the bottom
          ttree_shape = 'branching' ## 
          )
q

Are there any conclusions can you draw from these figures?

Now plot the consensus tree, but only including links with more than 30% posterior support:

r <- plot(cons_tree[, cons_tree$contacts$support > 0.3],
          method = 'ttree', ## show timed tree
          x_axis = 't_inf_est', ## use 't_inf_est' column for x_axis
          edge_color = "support", ## color edges by posterior support
          edge_col_pal = support_pal, ## specify color palette 
          node_size = 'R_i', ## node size reflects R_i
          ## edge_label = "support",## not showing for readability
          ## node_shape = "gender", ## not showing for readability
          ## shapes = c(m = "male", f = "female"), ## specify shapes
          width = 1000, ## adjust to improve readability
          height = 1500, ## adjust to improve readability
          node_color = "outcome", ## color nodes by outcome
          hide_labels = TRUE, ## don't show node labels
          date_labels = "%d/%m", ## formatting date labels
          axis_type = 'double', ## show two axes
          col_pal = outcome_pal, ## specify color palette
          unlinked_pos = 'middle', ## show unlinked nodes in the middle
          root_order = 't_inf_est', ## order roots by time of infection
          reverse_root_order = TRUE ## show earliest nodes at the top
          )
r

Do you think we can reliably answer the question of who infected whom in this outbreak? How can we improve our estimates?