School of Mathematics and Statistics - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Inference under the coalescent with recombination
    Mahmoudi, Ali ( 2020)
    Inferring the genealogical history, also known as the Ancestral Recombination Graph (ARG), of a set of DNA sequences has been a central challenge in population genetics for decades. Reconstructing the actual ARG simplifies many inference problems in population genetics. Many different methods have been proposed for inferring the ARG, most of which are limited in size and accuracy. The state-of-the-art probabilistic model, ARGweaver, provides substantial improvements over other methods but uses a discretized version of the Sequentially Markov Coalescent (SMC), which is an approximation of the Coalescent with Recombination (CwR) and ignores a significant amount of information in the ARG. In this thesis, I develop a novel Markov Chain Monte Carlo (MCMC) algorithm, implemented in the software ARGinfer, to perform probabilistic inference under the CwR. This method takes advantage of the superior properties of the Tree Sequence (TS), which is an efficient data structure to store the genealogical trees in an ARG so that the identical subtrees of the neighboring trees are recorded only once. I first devise a data structure to represent the ARG and the mutation information by augmenting the TS. Then, I develop a heuristic algorithm to construct an ARG consistent with the data used as an initial value for the MCMC algorithm. Computing both the prior (CwR model) and the likelihood under an approximation to the infinite sites model are relatively straightforward and fast. The challenging part is to explore the ARG space, for which I introduce a proposal distribution in the form of six transition types to rearrange both the topology and the event times. I demonstrate the utility of ARGinfer by applying it to simulated data sets. ARGinfer can accurately estimate many ARG-derived parameters such as the total branch length, number of recombination events, time to the most recent common ancestor, recombination rate, and allele ages. I also compare ARGinfer against ARGweaver. Since ARGinfer assumes a more complex evolutionary model than ARGweaver, it can infer a larger class of parameters. ARGinfer outperforms ARGweaver in estimating the recombination rate and is at least as accurate for other parameters that ARGweaver can infer. ARGinfer also accurately estimates parameters that ARGweaver cannot, such as the number of recombinations on trapped non-ancestral materials.
  • Item
    Thumbnail Image
    Missing data analysis, combinatorial model selection and structure learning
    Kwok, Chun Fung ( 2019)
    This thesis examines three problems in statistics: the missing data problem in the context of extracting trends from time series data, the combinatorial model selection problem in regression analysis, and the structure learning problem in graphical modelling / system identification. The goal of the first problem is to study how uncertainty in the missing data affects trend extraction. This work derives an analytical bound to characterise the error of the estimated trend in terms of the error of the imputation. It works for any imputation method and various trend-extraction methods, including a large subclass of linear filters and the Seasonal-Trend decomposition based on Loess (STL). The second problem is to tackle the combinatorial complexity which arises from the best-subset selection in regression analysis. Given p variables, a model can be formed by taking a subset of the variables, and the total number of models p is $2^p$. This work shows that if a hierarchical structure can be established on the model space, then the proposed algorithm, Gibbs Stochastic Search (GSS), can recover the true model with probability one in the limit and high probability with finite samples. The core idea is that when a hierarchical structure exists, every evaluation of a wrong model would give information about the correct model. By aggregating these information, one may recover the correct model without exhausting the model space. As an extension, parallelisation of the algorithm is also considered. The third problem is about inferring from data the systemic relationship between a set of variables. This work proposes a flexible class of multivariate distributions in a form of a directed acyclic graphical model, which uses a graph and models each node conditioning on the rest using a Generalised Linear Model (GLM), and it shows that while the number of possible graphs is $\Omega(2^{p \choose 2})$, a hierarchical structure exists and the GSS algorithm applies. Hence, a systemic relationship may be recovered from the data. Other applications like imputing missing data and simulating data with complex covariance structure are also investigated.
  • Item
    Thumbnail Image
    Influenza viral dynamics models to explore the roles of innate and adaptive immunity
    Yan, Ada W. C. ( 2017)
    A mathematical model which captures how the immune response controls influenza infection is essential for predicting the effects of pharmaceutical interventions, and alleviating the public health burden of the disease. However, current models do not agree on how immune response components work together to control infection. Hence, the predicted effects of existing treatments differ between models, implying that predictions of the effects of novel treatments may be unreliable. The discrepancies between models arise because many models are only fit to viral load data from a single infection, from which it is difficult to distinguish between competing models. This study focuses on the construction of a viral dynamics model which reproduces experimental observations of the protection conferred by a primary infection against a subsequent infection. Incorporating observations from multiple experimental conditions enables more accurate extraction of the timing and strength of cross-immunity, and thus quantification of the roles of each component of the immune response. Following a literature review and mathematical preliminaries (Chapters 1-4), Chapter 5 details the analysis of data from experiments where ferrets are sequentially infected with two different influenza strains. The analysis shows that the protection conferred by a primary infection against a subsequent infection depends on the time between exposures as well as the strains used. Chapters 6 and 7 then present the construction of a viral dynamics model to show that the innate immune response can explain the delay of a secondary infection by a primary infection, and that both cross-reactivity and memory in the cellular adaptive immune response are required to explain the shortening of a secondary infection by a primary infection. The model also reproduces qualitative observations from a range of knockout experiments. To quantify the roles of each immune component, the model must be fitted to experimental data. However, because of the large number of model parameters, a simulation estimation study is first conducted in Chapters 8 and 9. The study shows that rather than examining marginal posterior distributions, using a fitted model to make predictions more clearly elucidates the role of each immune component in controlling infection. Moreover, a model fitted to sequential infection data accurately recovers the timing and extent of cross-protection between strains, whereas insufficient information is available from single infection data to enable inference of these quantities. This represents significant progress in ensuring that results of data fitting are interpreted appropriately, such that the fitted model provides a sound foundation upon which to explore the effects of interventions. Lastly, Chapter 10 addresses the observation that under identical experimental conditions, some ferrets become infected when exposed to a second virus and some do not. Equations are derived for the extinction probability of the second virus for a stochastic version of one of the models in the study. The numerical solutions to these equations show that the dependence of the extinction probability on the inter-exposure interval is consistent with experimental observations. Thus, stochasticity in viral dynamics alone is a viable hypothesis for the difference in observed infection outcomes.