School of Mathematics and Statistics - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 7 of 7
  • Item
    Thumbnail Image
    Quantitative Epidemiology: A Bayesian Perspective
    Zarebski, Alexander Eugene ( 2019)
    Influenza inflicts a substantial burden on society but accurate and timely forecasts of seasonal epidemics can help mitigate this burden by informing interventions to reduce transmission. Recently, both statistical (correlative) and mechanistic (causal) models have been used to forecast epidemics. However, since mechanistic models are based on the causal process underlying the epidemic they are poised to be more useful in the design of intervention strategies. This study investigate approaches to improve epidemic forecasting using mechanistic models. In particular, it reports on efforts to improve a forecasting system targeting seasonal influenza epidemics in major cities across Australia. To improve the forecasting system we first needed a way to benchmark its performance. We investigate model selection in the context of forecasting, deriving a novel method which extends the notion of Bayes factors to a predictive setting. Applying this methodology we found that accounting for seasonal variation in absolute humidity improves forecasts of seasonal influenza in Melbourne, Australia. This result holds even when accounting for the uncertainty in predicting seasonal variation in absolute humidity. Our initial attempts to forecast influenza transmission with mechanistic models were hampered by high levels of uncertainty in forecasts produced early in the season. While substantial uncertainty seems inextricable from long-term prediction, it seemed plausible that historical data could assist in reducing this uncertainty. We define a class of prior distributions which simplify the process of incorporating existing knowledge into an analysis, and in doing so offer a refined interpretation of the prior distribution. As an example we used historical time series of influenza epidemics to reduce initial uncertainty in forecasts for Sydney, Australia. We explore potential pitfalls that may be encountered when using this class of prior distribution. Deviating from the theme of forecasting, we consider the use of branching processes to model early transmission in an epidemic. An inhomogeneous branching process is derived which allows the study of transmission dynamics early in an epidemic. A generation dependent offspring distribution allows for the branching process to have sub-exponential growth on average. The multi-scale nature of a branching process allows us to utilise both time series of incidence and infection networks. This methodology is applied to data collected during the 2014–2016 Ebola epidemic in West-Africa leading to the inference that transmission grew sub-exponentially in Guinea, Liberia and Sierra Leone. Throughout this thesis, we demonstrate the utility of mechanistic models in epidemiology and how a Bayesian approach to statistical inference is complementary to this.
  • Item
    Thumbnail Image
    Mechanistic and statistical models of skin disease transmission
    Lydeamore, Michael J. ( 2018)
    At any one time, more than 160 million children worldwide are infected with skin sores. In remote Aboriginal Australian communities, prevalence is as high as 40%. Skin sores infected with Group A Streptococcus (GAS) can lead to a number of acute and chronic health conditions. One of the primary risk factors for GAS infection is scabies, a small mite which causes a break in the skin layer, potentially allowing skin sore infection to take hold. This biological connection is reaffirmed by the observation that mass treatment for scabies in these remote communities has been associated with a reduction in the prevalence of skin sore infection, despite skin sores not being directly targeted. In the most extreme case, it has been hypothesised that the eradication of scabies in remote communities may lead to an eradication of skin sore related infection. Mass drug administration is the go-to solution for tackling the high prevalence of disease in these rural settings, but despite more than 20 years of implementation, sustained reductions in prevalence have not been achieved. My aim in this thesis is to develop and analyse both mechanistic and statistical models of skin sores and scabies, considering the dynamics of each disease in isolation and coupled together. These models build a framework on which control strategies can be tested, with the aim to develop strategies that will lead to sustained prevalence reductions. Following a biological introduction and technical information (Chapters 1 and 2), a mechanistic model for scabies infection is introduced. This model includes the dynamics of the life-cycle of the scabies mite, incorporating two methods of treatment for the infection. Mass drug administration strategies are also modelled. The optimal interval between successive mass drug administration (MDA) doses is calculated to be approximately two weeks. The analysis shows that an optimally timed two-dose, 100% effective, 100% coverage MDA is highly unlikely to lead to the eradication of scabies. In fact, four optimally timed successive doses are required for a probability of eradication greater than 1/1000. Next, an annually recurring MDA program is considered, in which some number of optimally timed doses is administered, and repeated annually. It is shown that increasing the number of administered doses always increases the probability of eradication. Importantly, moving from a two dose to a three dose annual strategy significantly increases the probability of eradication of scabies infection. In order to parameterise a dynamic transmission model for skin sores, at least two key quantities must be estimated: the force of infection, and the infectious period. The study in Chapter 4 estimates the age of first infection, which is the inverse of the force of infection, using clinic presentation data of children from birth up to five years age. Three survival models are considered: the Kaplan-Meier estimator, the Cox proportional hazards model, and the parametric exponential mixture model. The mean age of first infection is estimated to be ten months for skin sores, and nine months for scabies. The work in Chapter 5 estimates both the force of infection and the infectious period by utilising a linearised infectious disease model. The data considered in this chapter consists of longitudinal observations of individuals across three studies. The methodology is verified using simulation estimation, and each dataset tested to ensure it carries sufficient information for use with the estimation method. The estimates for the force of infection vary by an order of magnitude between settings. Estimates of the infectious period are relatively constant at 12 − 20 days. Chapter 6 consists of a dynamic model for skin sores transmission coupled with models for scabies transmission. Three different scabies models are considered. The first assumes that the dynamics of scabies are at equilibrium. In this case, analytical expressions for key epidemiological quantities can be derived, and values for the scabies prevalence below which skin sores will be eradicated can be calculated. Next, two dynamic models of scabies are considered. The first of these is the scabies model introduced in Chapter 3, which includes the full life-cycle of the scabies mite and treatment mechanisms. The second model consists of just three compartments, which is termed the SITS model. The SITS model approximates the complex life-cycle of the scabies mite into two compartments. The differences in dynamics between these two scabies models are analysed, and the impact on the prevalence of skin sores of an MDA which directly targets only scabies is considered. The comparison shows that, relative to the full model, the SITS model overestimates the impact on skin sores prevalence due to the MDA in the time period immediately following the MDA, but also predicts an earlier time of return to pre-MDA endemic infection prevalence. The SITS model also estimates a higher probability of eradication of skin sores compared to the full model. These two results demonstrate that caution should be taken when approximating the life-cycle of the scabies mite to consider the potential impact of MDA which targets only scabies. Finally, Chapter 7 summarises the work presented in my thesis, discusses the limitations of the work and explores potential future directions for this research problem.
  • Item
    Thumbnail Image
    Statistical models for the location of lightning-caused wildfire ignitions
    Read, Nicholas ( 2018)
    Lightning-caused wildfire is a significant concern for fire management agencies worldwide. Unlike other ignition sources, lightning fires often occur in remote and inaccessible locations making detection and suppression particularly challenging. Furthermore, individual lightning storms result in a large number of fires clustered in space and time which can overwhelm suppression efforts. Victoria, Australia, is one of the most fire prone environments in the world and the increased frequency of large-scale landscape fires over the last decade is of particular concern to local wildfire management authorities. This thesis is concerned with modeling lightning-caused wildfire ignition locations in Victoria. Such models could be used for predicting daily lightning-caused ignition likelihood as well as simulating realistic point patterns for use in fire spread models for risk analyses. The first half of this thesis looks at regression models. We review methods for the model selection, validation, approximation and interpretation of generalised additive models. A review of performance metrics, such as the AUC, shows the difficulties and subtleties involved in evaluating the predictive performance of models. We apply this theory to construct a non-linear logistic regression model for lightning-caused wildfires in Victoria. The model operates on a daily time scale, with a spatial resolution of 20 km and uses covariate data including fuel moisture indices, vegetation type, a lightning potential index and weather. We develop a simple method to deconstruct model output into contributions from each of the individual covariates, allowing predictions to be explained in terms of the weather conditions driving them. Using these ideas, we discuss ranking the relative 'importance' of covariates in the model, leading to an approximating model with similar performance to the full model. The second half of this thesis looks at point process models for lightning-caused ignitions. We introduce general theory for point processes, focusing on the inhomogeneous Poisson process, cluster processes and replicated point patterns. The K-function is a useful summary function for describing the spatial correlation point patterns and for fitting models. We present a method for pooling multiple estimates of the K-function, such as those that arise when using replicated point patterns, intended to reduce bias. We fit an inhomogeneous Poisson process model as well as a Thomas and Cauchy cluster process model to the Victorian lightning-caused ignition data set. The cluster process models prove to have significantly better fit than the Poisson process model, but still struggle to reproduce the complex behaviour of the physical process.
  • Item
    Thumbnail Image
    Resolving some high dimensional problems in bioinformatics
    Rodrigues, Sabrina ( 2016)
    "It has been claimed and demonstrated that many of the conclusions drawn from biomedical research are probably false" (Button et al, 2013). ``Toth et al. (2010) make a common mistake of seeing structure where none exists" Chowdhury et al. (2015). "...the performance of their particular set of 150 probe-sets does not stand out compared to that of randomly sampled sets of 150 probe-sets from the same array” Jacob et al. (2016). These are statements that have emerged over the last few years in the biomedical and bioinformatics literature. Many of the studies in bioinformatics fall into the “small n, large p” category, where numerous statistical difficulties emerge. Rather than identifying studies that have false results, this thesis provides a unifying, theoretical and computational solution to two high dimensional problems that arise in these studies: the problem of variable selection-dimension reduction; and, that of sample size determination. We first consider these problems separately and from a theoretical viewpoint. Next we show how our methods can be applied to different applications in bioinformatics. For the first problem, we revisit and explain the concepts of central subspaces and conditional independence to develop a novel matrix decomposition - the ST matrix decomposition. This decomposition forms the basis of our two new computational methods that involve covariance matrices and covariance operators, the stMVN and the stKDR methods, respectively. For the second problem, we review the Kullback-Leibler divergence and use independent mixture models to develop a new formula for sample size determination for the high dimensional setting where p>>n. We then combine the solutions to these two problems to demonstrate through simple simulations, that when sample sizes are determined according to our formula, it is possible to detect the true predictors that are associated with a response variable. Finally, we show how our unifying solution can be applied in practice to several bioinformatics studies.
  • Item
    Thumbnail Image
    Transforms and truncations of time series
    Beaumont, Adrian N. ( 2015)
    A time series can be defined as a collection of random variables indexed according to the order they are obtained in time. Examples of time series are monthly Australian retail sales, or quarterly GDP data. Forecasting of time series is generally considered much more important than fitting. Models that use exponential smoothing methods have been found to perform well on time series. Chapter 2 describes the estimation and forecasting procedure of additive forms of time series models; these include the local level model, local trend model, damped trend model, and seasonal equivalents. This chapter also briefly discusses some other time series methods, and introduces the M3-competition data that is extensively used in this thesis. Models that include multiplicative components for time series are considered in Chapter 3, increasing the total number of possible models from 6 to 30. While multiplicative models are often better than purely additive models, model selection methods using all combinations of multiplicative and additive models are found to be no better statistically than just selecting using the purely additive models; model selection methods are confused by the large number of possible models. In this thesis, transforms and truncations are used with exponential smoothing, in the quest for better forecasts of time series. Two types of transforms are explored: those applied directly to a time series; and those applied indirectly, to the prediction errors. The various transforms are tested on a large number of time series from the M3-competition data, and analysis of variance (ANOVA) is applied to the results. We find that the non-transformed time series is significantly worse than some transforms on the monthly data, and on a distribution-based performance measure for both annual and quarterly data. To try to understand why the transforms perform as they do, a simulation study was carried out, using simulations from a paper on outliers. Three types of simulations were used: a Level Shift permanently shifts the series to a new level; an Additive Outlier increases the series for only one time period; and a Transitory Change gradually reverts the series to the old level after the jump point. The non-transformed time series were significantly worse than some transforms on some simulation types. Truncations are applied so that there is no possibility of obtaining an observation below zero on a positive-definite time series. There are two types of truncations: those applied only to the forecasts, and those applied to the fits and forecasts. By using the same methods as for the transforms, we found that the truncations worked better when applied only to the forecasts, but the non-truncated model was never significantly worse than any truncation. Chapter 7 combines transforms with truncations. We find that applying the heteroscedastic state space transform with a truncated normal significantly improved forecasts over the non-transformed results. The final chapter of this thesis investigates how various properties of time series affect the forecasting performance. Of particular interest is the finding that a measure commonly used to assess prediction performance is flawed.
  • Item
    Thumbnail Image
    The statistical analysis of high-throughput assays for studying DNA methylation
    HICKEY, PETER ( 2015)
    DNA methylation is an epigenetic modification that plays an important role in X-chromosome inactivation, genomic imprinting and the repression of repetitive elements in the genome. It must be tightly regulated for normal mammalian development and aberrant DNA methylation is strongly associated with many forms of cancer. This thesis examines the statistical and computational challenges raised by high-throughput assays of DNA methylation, particularly the current gold standard assay of whole-genome bisulfite-sequencing. Using whole-genome bisulfite-sequencing, we can now measure DNA methylation at individual nucleotides across entire genomes. These experiments produce vast amounts of data that require new methods and software to analyse. The first half of the thesis outlines the biological questions of interest in studying DNA methylation, the bioinformatics analysis of these data, and the statistical questions we seek to address. In discussing these bioinformatics challenges, we develop software to facilitate novel analyses of these data. We pay particular attention to analyses of methylation patterns along individual DNA fragments, a novel feature of sequencing-based assays. The second half of the thesis focuses on co-methylation, the spatial dependence of DNA methylation along the genome. We demonstrate that previous analyses of co-methylation have been limited by inadequate data and deficiencies in the applied statistical methods. This motivates a study of co-methylation from 40 whole-genome bisulfite-sequencing samples. These 40 samples represent a diverse range of tissues, from embryonic and induced pluripotent stem cells, through to somatic cells and tumours. Making use of software developed in the first half of the thesis, we explore different measures of co-methylation and relate these to one another. We identify genomic features that influence co-methylation and how it varies between different tissues. In the final chapter, we develop a framework for simulating whole-genome bisulfite-sequencing data. Simulation software is valuable when developing new analysis methods since it can generate data on which to assess the performance of the method and benchmark it against competing methods. Our simulation model is informed by our analyses of the 40 whole-genome bisulfite-sequencing samples and our study of co-methylation.
  • Item
    Thumbnail Image
    The performance of multiple hypothesis testing procedures in the presence of dependence
    Clarke, Sandra Jane ( 2010)
    Hypothesis testing is foundational to the discipline of statistics. Procedures exist which control for individual Type I error rates and more global or family-wise error rates for a series of hypothesis tests. However, the ability of scientists to produce very large data sets with increasing ease has led to a rapid rise in the number of statistical tests performed, often with small sample sizes. This is seen particularly in the area of biotechnology and the analysis of microarray data. This thesis considers this high-dimensional context with particular focus on the effects of dependence on existing multiple hypothesis testing procedures. While dependence is often ignored, there are many existing techniques employed currently to deal with this context but these are typically highly conservative or require difficult estimation of large correlation matrices. This thesis demonstrates that, in this high-dimensional context when the distribution of the test statistics is light-tailed, dependence is not as much of a concern as in the classical contexts. This is achieved with the use of a moving average model. One important implication of this is that, when this is satisfied, procedures designed for independent test statistics can be used confidently on dependent test statistics. This is not the case however for heavy-tailed distributions, where we expect an asymptotic Poisson cluster process of false discoveries. In these cases, we estimate the parameters of this process along with the tail-weight from the observed exceedences and attempt to adjust procedures. We consider both conservative error rates such as the family-wise error rate and more popular methods such as the false discovery rate. We are able to demonstrate that, in the context of DNA microarrays, it is rare to find heavy-tailed distributions because most test statistics are averages.