During an epidemic, public health authorities need an accurate count of the number of individuals infected in order to stop the spread of disease. In most epidemics, this count is difficult to get because almost all of the data comes from reported cases and many cases are not reported. Therefore, to get an accurate count we must estimate the underreporting rate. There are two ways to mode underreporting. In the first, every case has the same probability of being reported. In the second, cases have varying probabilities of being reported. The probability may either depend on the time and location or on whether the previous case in the transmission tree was reported.
In this project I will use coalescent theory to study underreporting. Starting at the ends of the branches of a transmission tree, the rate at which branches merge is the coalescent rate. The coalescent rate depends on the number of infected individuals, and so we can use it to estimate the actual number of cases. Most current coalescent methods assume that we know the transmission tree, but this is almost never true. We will develop a method to estimate the underreporting rate that reconstructs the transmission tree from case and pathogen sequence data.
I will study the population genetics of underreporting in three stages. First, I will construct a stochastic simulation model of a transmission tree with different types of reporting and determine the underreporting rate from the known coalescent rate. Second, I will simulate the evolution of sequences along each tree and test the method with sequences sampled at different time points. Finally, I will test the method on real genomic data from the 2014-15 Ebola Virus Disease outbreak in West Africa and the ongoing Whooping Cough outbreak in the United States.