School of Mathematics and Statistics - Theses
Now showing items 1-12 of 191
Identification of molecular phenotypes and their regulation in cancer
Complex diseases manifest through the dysregulation of otherwise finely regulated transcriptional programs resulting in functional alterations. Insight into altered transcriptional processes and their subsequent functional consequences allow molecular characterisation of disease phenotypes and can lead to the identification of potential therapeutic targets. The altered regulation of transcriptional programs can be identified using computational and statistical methods to infer gene regulatory networks that are changed between biological contexts. Numerous methods have been developed to infer conditional relationships however extensive evaluations remain scarce because of a lack of validation data. I developed an evaluation framework that simulates transcriptomic data from a dynamical systems model of gene regulation. I used 812 simulated datasets with varying model parameters to evaluate 14 different context specific inference methods. The evaluation revealed that context specific causative regulatory relationships were difficult to infer while inferring context specific co expression was relatively easier. Some variability in performance was attributed to properties of the global regulatory network structure. Applying the best performing approach, a z score method, to identify estrogen receptor specific regulatory relationships in a breast cancer dataset revealed an immune related program regulated in basal like breast cancers and dysregulated in all other subtypes. I identified a key gene in this network that was associated with immune infiltration in basal like breast cancers. The result of any regulatory cascade is a molecular phenotype such as the immune infiltration phenotype described above. Assessing these phenotypes aids in characterisation of disease and can be used to guide therapies. Most methods to assess molecular phenotypes are incapable of acting on individual samples and therefore cannot be used in personalised medicine. With colleagues, I developed a novel rank based method, singscore, that assesses molecular phenotypes using transcriptomic measurements from an individual sample. I evaluated the new method in a variety of applications, ranging from molecular phenotyping to sample stratification, and benchmarked the method against other single sample based methods. I then demonstrated three applications of this flexible rank based approach: inferring and investigating the epithelial mesenchymal landscape in breast cancer, inferring the mutation status of the NPM1c mutation in acute myeloid leukemia, and the prioritisation of gene sets with stably expressed genes. While the transcriptome clearly contains abundant information that might be of clinical use, translation of molecular phenotyping into clinical applications that can be readily adopted would require a reduction in the number of transcriptomic measurements to tens or hundreds of transcripts. This would support a reduction in the cost of potential clinical assays and a reduction in the amount of transcriptomic material required for molecular phenotyping. I developed a method that uses genes with stable expression to drastically reduce the number of transcriptomic measurements required for molecular phenotyping. I showed that molecular phenotype assessments using these reduced numbers of measurements are comparable to those performed using transcriptome wide measurements. Stable genes identified from this analysis provide enhanced scope for use compared with similar previously identified sets thereby promoting other applications such as correction of batch effects, and normalisation across a wide range of transcriptomic and other datatypes. In summary, my PhD developed methodology to understand the molecular state of biological systems in a context specific manner. This was done by identifying context specific gene regulatory networks and followed with assessing the molecular phenotypes resulting from context specific regulation in a clinical setting. This work highlights the importance of context specific analysis in disease and shows the importance and utility of comprehensive benchmarks. It also highlights the need for developing clinically applicable analysis methods to achieve the eventual goal of personalised medicine.
A study of optimised network flows for prediction of force transmission and crack propagation in bonded granular media
This thesis focuses on study bonded granular materials. We mainly analyse discrete element method simulation data for unconfined concrete specimens subjected to uniaxial tension and compression. In these systems, the contacts can support compressive, tensile and shear forces. Thus, under applied loads, a member grain can transmit tensile and/or compressive forces to its neighbours resulting in a highly heterogeneous contact force network. The objective of this thesis is two-fold. The first objective of this thesis is to develop algorithms for the identification and characterisation of two classes of force transmission patterns in these systems: (a) force chains, (b) force (energy) bottlenecks. The former comprises a subgroup of grains that transmit the majority of the load through the sample, while the latter comprises a subgroup of contacts that are prone to force congestion and damage. These two classes are related and coevolve as loading history proceeds. Here this coevolution is characterised quantitatively to gain new insights into the interdependence between force transmission and failure in bonded grain assemblies. The second objective of this thesis is to establish the extent to which the ultimate (dominant) crack location can be predicted early in the prefailure regime for disordered and heterogeneous bonded granular media based on known microstructural features. To achieve this, a new data-driven model is developed within the framework of Network Flow Theory which takes as input data on contact network and contact strengths. We tested this model for a range of samples undergoing quasibrittle failure subject to various loading conditions (i.e., uniaxial tension, uniaxial compression) as well as field-scale data for an open-pit mine. In all cases, the location of the ultimate (primary) macrocrack/failure zone is predicted early in the prefailure regime as well as those of other secondary cracks. We uncovered an optimised force transmission and damage propagation in the prefailure regime, especially by using data from uniaxial tension tests on concrete samples. Tensile force chains emerged in routes that can transmit the global transmission capacity of the contact network through the shortest transmission pathways. Macrocracks developed along with force/energy bottlenecks. We brought some of the commonly used optimisation based fracture criteria into a single framework and showed how heterogeneity and disorder in the contact network affect the prediction.
Seiberg-Witten Theory and Topological Recursion
Kontsevich-Soibelman (2017) reformulated Eynard-Orantin topological recursion (2007) in terms of Airy structure which provides some geometrical insights into the relationship between the moduli space of curves and topological recursion. In this work, we investigate the analytical approach to this relationship using the Seiberg-Witten family of curves as the main example. In particular, we are going to show that the formula computing the Hitchin systems' Special Kahler's prepotential from the genus zero part of topological recursion as obtained by Baraglia-Huang (2017) can be generalized for a more general family of curves embedded inside a foliated symplectic surface, including the Seiberg-Witten family. Consequently, we obtain a similar formula relating the Seiberg-Witten prepotential to the genus zero part of topological recursion on a Seiberg-Witten curve. Finally, we investigate the connection between Seiberg-Witten theory and Frobenius manifolds which potentially enable the generalization of the current result to include the higher genus parts of topological recursion in the future.
Risk Analysis and Probabilistic Decision Making for Censored Failure Data
Operation and maintenance of a fleet always require a high level of readiness, reduced cost, and improved safety. In order to achieve these goals, it is essential to develop and determine an appropriate maintenance programme for the components in use. A failure analysis involving failure model selection, robust parameter estimation, probabilistic decision making, and assessing the cost-effectiveness of the decisions are the key to the selection of a proper maintenance programme. Two significant challenges faced in failure analysis studies are, minimizing the uncertainty associated with model selection and making strategic decisions based on few observed failures. In this thesis, we try to resolve some of these problems and evaluate the cost-effectiveness of the selections. We focus on choosing the best model from a model space and robust estimation of quantiles leading to the selection of optimal repair and replacement time of units. We first explore the repair and replacement cost of a unit in a system. We design a simulation study to assess the performance of the parameter estimation methods, maximum likelihood estimation (MLE), and median rank regression method (MRR) in estimating quantiles of the Weibull distribution. Then, we compare the models; Weibull, gamma, log-normal, log-logistic, and inverse-Gaussian in failure analysis. With an example, we show that the Weibull and the gamma distributions provide competing fits to the failure data. Next, we demonstrate the use of Bayesian model averaging in accounting for that model uncertainty. We derive an average model for the failure observations with respective posterior model probabilities. Then, we illustrate the cost-effectiveness of the selected model by comparing the distribution of the total replacement and repair cost. In the second part of the thesis, we discuss the prior information. Initially, we assume, the parameters of the Weibull distribution are dependent by a function of the form rho = sigma/mu and re-parameterize the Weibull distribution. Then we propose a new Jeffreys’ prior for the parameters mu and rho. Finally, we designed a simulation study to assess the performance of the new Jeffreys’ prior compared to the MLE.
Mathematical models of calcium signalling in the context of cardiac hypertrophy
Throughout the average human lifespan, our hearts beat over 2 billion times. With each beat, calcium floods the cytoplasm of every heart cell, causing it to contract until calcium re-uptake allows the heart to relax, ready for the next beat. However, calcium is known to be critical in other cell functions, including growth. Calcium plays a central role in mediating hypertrophic signalling in ventricular cardiomyocytes on top of its contractile function. How intracellular calcium can encode several different, specific signals at once is not well understood. In heart cells, calcium release from ryanodine receptors (RyRs) triggers contraction. Under hypertrophic stimulation, calcium release from inositol 1,4,5-trisphosphate receptor (IP3R) channels modifies the calcium contraction signal, triggering dephosphorylation and nuclear import of the transcription factor nuclear factor of activated T cells (NFAT), with resulting gene expression linked to cell growth. Several hypotheses have been proposed as to how the modified cytosolic calcium contraction signal transmits the hypertrophic signal to downstream signalling proteins, including changes to amplitude, duration, duty cycle, and signal localisation. We investigate the form of these signals within the cardiac myocyte using mathematical modelling. Using a compartmental heart cell model, we show that the effect of calcium channel interaction on the global calcium signal supports the idea that increased calcium duty cycle is a plausible mechanism for IP3-dependent hypertrophic signalling in cardiomyocytes. A corresponding calcium signal within the nucleus must be present to maintain NFAT in the nucleus and thus allow NFAT to alter gene expression, initiating hypertrophic remodelling. Yet the nuclear membrane is permeable to calcium and this must all occur on a background of rising and falling calcium with each heartbeat. The mechanisms shaping calcium dynamics within the nucleus remain unclear. We use a spatial model of calcium diffusion into the nucleus to determine the effects of buffers and cytosolic transient shape on nuclear calcium dynamics. Using experimental data, we estimate the diffusion coefficient and the effects of buffers on nuclear [Ca2+]. Additionally, we explore the effects of altered cytosolic calcium transients and calcium release on nuclear calcium. To approximate experimental measurements of nuclear calcium, we find that there must be perinuclear Ca2+ release and nonlinear diffusion. Comparisons of 1D and 3D models of calcium in the nucleus suggest that spatial variation in calcium concentration within the nucleus will not have a large effect on calcium-mediated gene regulation. This work brings us closer to understanding the signalling pathway that leads to pathological hypertrophic cardiac remodelling.
Understanding the regulation of epidermal tissue thickness by cellular and subcellular processes using multiscale modelling
The epidermis is the outermost layer of the skin, providing a protective barrier for our bodies. Two important aspects to the barrier function of the epidermis are maintenance of its barrier layer and constant cell turnover. The main barrier layer in the epidermis is the outermost layer, called the stratum corneum. This layer blocks both the entry of antigens and the loss of internal water and solutes. If antigens do enter the system, cell turnover has been hypothesised to propel them out the system by providing a constant upwards velocity of cells which carry the toxins with them. The majority of severe diseases of the epidermis relate to a reduction in thickness of the stratum corneum. Decreased thickness reduces the barrier function of the layer, causing discomfort and inflammation. Due to its importance to barrier function, the maintenance of stratum corneum thickness, and consequently overall tissue thickness, is the focus of this thesis. In order to maintain both stratum corneum thickness and overall tissue thickness it is necessary for the system to balance cell proliferation and cell loss. Cell loss in the epidermis occurs when dead cells at the top of the tissue are lost to the environment through a process called desquamation. Cell proliferation occurs in the base, or basal, layer. As the basal cells proliferate, cells above them are pushed upwards through the tissue, causing constant upwards movement in the tissue. Not only does this contribute directly to the barrier function through the cell turnover as discussed above, but the velocity of the cells is likely to be key in regulating the tissue thickness. Assuming the cell loss occurs at a fairly constant rate, the combination of the velocity and the loss rate determine tissue thickness. In order to investigate these processes we develop a three dimensional discrete, multiscale, multicellular model, focussing on maintenance of cell proliferation and desquamation. Using this model, we are able to investigate how subcellular and cellular level processes interact to maintain a homeostatic tissue. Our model is able to reproduce a system that self-regulates its thickness. The first aspect of this regulation is maintaining a constant rate of proliferation in the epidermis, and consequently a constant upwards velocity of cells. The second aspect is a maintained rate of desquamation. The model shows that hypothesised biological models for the degradation of cell-cell adhesion from the literature are able to provide a consistent rate of cell loss which balances proliferation. An investigation into a disorder which disrupts this desquamation model shows reduced tissue thickness, consequently diminishing the protective role of the tissue. In developing the multiscale model we have begun to delve deeper into the relationship between subcellular and cellular processes and epidermal tissue structure. The model is developed with scope for the integration of further subcellular processes. This provides it with the potential for further experiments into the causes and effects of behaviours and diseases of the epidermis, with much higher time and cost efficiency than other experimental methods.
Biorthogonal Polynomial Sequences and the Asymmetric Simple Exclusion Process
The diffusion algebra equations of the stationary state of the three parameter Asymmetric Simple Exclusion Process are represented as a linear functional, acting on a tensor algebra. From the linear functional, a pair of sequences (P and Q) of monic polynomials are constructed which are bi-orthogonal, that is, they are orthogonal with respect to each other and not necessarily themselves. The uniqueness and existence of the pair of sequences arises from the determinant of the bi-moment matrix whose elements satisfy a pair of q-recurrence relations. The determinant is evaluated using an LDU-decomposition. If the action of the linear functional is represented as an inner product, then the action of the polynomials Q on a boundary vector V, generates a basis whose orthogonal dual vectors are given by the action of P on the dual boundary vector W}. This basis gives the representation of the algebra which is associated with the Al-Salam-Chihara polynomials obtained by Sasamoto. Several theorems associated with the three parameter asymmetric simple exclusion process are proven combinatorially. The theorems involve the linear functional which, for the three parameter case, is a substitution morphism on a q-Weyl algebra. The two polynomial sequences, P and Q, are represented in terms of q-binomial lattice paths. A combinatorial representation for the value of the linear functional defining the matrix elements of a bi-moment matrix is established in terms of the value of a q-rook polynomial and utilised to provide combinatorial proofs for results pertaining to the linear functional. Combinatorial proofs are provided for theorems in terms of the p,q-binomial coefficients, which are closely related to the combinatorics of the three parameter ASEP. The results for the three parameter diffusion algebra of the Asymmetric Simple Exclusion Process are extended to five parameters. A pair of basis changes are derived from the LDU decomposition of the bi-moment matrix. In order to derive the LDU decomposition a recurrence relation satisfied by the lower triangular matrix elements is conjectured. Associated with this pair of bases are three sequences of orthogonal polynomials. The first pair of orthogonal polynomials generate the new basis vectors (the boundary basis) by their action on the boundary vectors (written is the standard basis), whilst the third orthogonal polynomials are essentially the Askey-Wilson polynomials. All theses results are ultimately related to the LDU decomposition of a matrix.
Exploring the statistical aspects of expert elicited experiments
We explore the statistical aspects of some of the known methods of analysing experts’ elicited data to identify potential improvements on the accuracy of their outcomes in this study. It can be identified that potential correlation structures induced in the probability predictions by the characteristics of experimental designs are ignored in computing experts’ Brier scores. We show that the accuracy of the standard error estimates of experts’ Brier scores can be improved by incorporating the within-question correlations of probability predictions in the second chapter of this thesis. Missing probability predictions of events can impact on assessing the prediction accuracy of experts using different sets of events (Merkle et al., 2016; Hanea et al., 2018). It is shown in the third chapter that multiple imputation method using a mixed-effects model with questions’ effects as random effects can effectively estimate missing predictions to enhance the comparability of experts’ Brier scores. Testing experts’ calibration on eliciting credible intervals of unknown quantities using hit rates; observed proportions of elicited intervals that contain realized values of given quantities (McBride, Fidler, and Burgman, 2012), has a property of obtaining lower values of power to correctly identify well-calibrated experts and more importantly, the power tends to decrease as the number of elicited intervals increases. The equivalence test of a single binomial proportion can be used to overcome these problems as shown in the fourth chapter. There is a possibility of allocating higher weights to some of the not well-calibrated experts by the way experts’ calibration is assessed in the Cooke’s classical model (Cooke, 1991) to derive experts’ weights. We show that the multinomial equivalence test can be used to overcome this problem in the fifth chapter. Experts’ weights that derived from experiments to combine experts’ elicited subjective probability distributions to obtain aggregated probability distributions of unknown quantities (O’Hagan, 2019) are random variables subject to uncertainty. We derive shrinkage experts’ weights with reduced mean squared errors in the sixth chapter to enhance the precision of the resulting aggregated distributions of quantities.
Nonparametric estimation for streaming data
Streaming data are a type of high-frequency and nonstationary time series data. The collection of streaming data is sequential and potentially never-ending. Examples of streaming data, including data from sensor networks, mobile devices and the Internet, are prevalent in our daily lives. An estimator for streaming data needs to be computationally efficient so that it is relatively easy to update the estimator using newly arrived data. In addition, the estimator has to be adaptive to the nonstationarity of data. These constraints make streaming data analysis more challenging than analysing the conventional non-streaming data sets. Although streaming data analysis has been discussed in the machine learning community for more than two decades, it has received limited attention from statistical researchers. Estimation methods that are both computationally efficient and theoretically justified are still lacking. In this thesis, we propose nonparametric density and regression estimation methods for streaming data, where the smoothing parameters are chosen in a computationally efficient and fully data-driven way. These methods extend some classical kernel smoothing techniques, such as the kernel density estimator and the Nadaraya-Watson regression estimator, to address the theoretical and computational challenges arising from streaming data analysis. Asymptotic analyses provide these methods with theoretical justification. Numerical studies have shown the superiority of our methods over conventional ones. Through some real-data examples, we show that these methods are potentially useful in modelling real-world problems. Finally, we discuss some directions for future research, including extending these methods to model higher-dimensional streaming data and to streaming data classification.
Stress testing mixed integer programming solvers through new test instance generation methods
Optimisation algorithms require careful tuning and analysis to perform well in practice. Their performance is strongly affected by algorithm parameter choices, software, and hardware and must be analysed empirically. To conduct such analysis, researchers and developers require high-quality libraries of test instances. Improving the diversity of these test sets is essential to driving the development of well-tested algorithms. This thesis is focused on producing synthetic test sets for Mixed Integer Programming (MIP) solvers. Synthetic data should be carefully designed to be unbiased, diverse with respect to measurable features of instances, have tunable properties to replicate real-world problems, and challenge the vast array of algorithms available. This thesis outlines a framework, methods and algorithms developed to ensure these requirements can be met with synthetically generated data for a given problem. Over many years of development, MIP solvers have become increasingly complex. Their overall performance depends on the interactions of many different components. To cope with this complexity, we propose several extensions over existing approaches to generating optimisation test cases. First, we develop alternative encodings for problem instances which restrict consideration to relevant instances. This approach provides more control over instance features and reduces the computational effort required when we have to resort to search-based generation approaches. Second, we consider more detailed performance metrics for MIP solvers in order to produce test cases which are not only challenging but from which useful insights can be gained. This work makes several key contributions: 1. Performance metrics are identified which are relevant to component algorithms in MIP solvers. This helps to define a more comprehensive performance metric space which looks beyond benchmarking statistics such as CPU time required to solve a problem. Using these more detailed performance metrics we aim to produce explainable and insightful predictions of algorithm performance in terms of instance features. 2. A framework is developed for encoding problem instances to support the design of new instance generators. The concepts of completeness and correctness defined in this framework guide the design process and ensure all problem instances of potential interest are captured in the scheme. Instance encodings can be generalised to develop search algorithms in problem space with the same guarantees as the generator. 3. Using this framework new generators are defined for LP and MIP instances which control feasibility and boundedness of the LP relaxation, and integer feasibility of the resulting MIP. Key features of the LP relaxation solution, which are directly controlled by the generator, are shown to affect problem difficulty in our analysis of the results. The encodings used to control these properties are extended into problem space search operators to generate further instances which discriminate between solver configurations. This work represents the early stages of an iterative methodology required to generate diverse test sets which continue to challenge the state of the art. The framework, algorithms and codes developed in this thesis are intended to support continuing development in this area.
Intelligent Management of Elective Surgery Patient Flow
Rapidly growing demand and soaring costs for healthcare services in Australia and across the world are jeopardising the sustainability of government-funded healthcare systems. We need to be innovative and more efficient in delivering healthcare services in order to keep the system sustainable. In this thesis, we utilise a number of scientific tools to improve the patient flow in a surgical suite of a hospital and subsequently develop a structured approach for intelligent patient flow management. First, we analyse and understand the patient flow process in a surgical suite. Then we obtain data from the partner hospital and extract valuable information from a large database. Next, we use machine learning techniques, such as classification and regression tree analysis, random forest, and k-nearest neighbour regression, to classify patients into lower variability resource user groups and fit discrete phase-type distributions to the clustered length of stay data. We use length of stay scenarios sampled from the fitted distributions in our sequential stochastic mixed-integer programming model for tactical master surgery scheduling. Our mixed-integer programming model has the particularity that the scenarios are utilised in a chronologically sequential manner, not in parallel. Moreover, we exploit the randomness in the sample path to reduce the requirement of optimising the process for many scenarios which helps us obtain high-quality schedules while keeping the problem algorithmically tractable. Last, we model the patient flow process in a healthcare facility as a stochastic process and develop a model to predict the probability of the healthcare facility exceeding capacity the next day as a function of the number of inpatients and the next day scheduled patients, their resource user groups, and their elapsed length of stay. We evaluate the model's performance using the receiver operating characteristic curve and illustrate the computation of the optimal threshold probability by using cost-benefit analysis that helps the hospital management make decisions.
Copula-based spatio-temporal modelling for count data
Modelling of spatio-temporal count data has received considerable attention in recent statistical research. However, the presence of massive correlation between locations, time points and variables imposes a great computational challenge. In existing literature, latent models under the Bayesian framework are predominately used. Despite numerous theoretical and practical advantages, likelihood analysis of spatio-temporal modelling on count data is less wide spread, due to the difficulty in identifying the general class of multivariate distributions for discrete responses. In this thesis, we propose a Gaussian copula regression model (copSTM) for the analysis of multivariate spatio-temporal data on lattice. Temporal effects are modelled through the conditional marginal expectations of the response variables using an observation-driven time series model, while spatial and cross-variable correlations are captured in a block dependence structure, allowing for both positive and negative correlations. The proposed copSTM model is flexible and sufficiently generalizable to many situations. We provide pairwise composite likelihood inference tools. Numerical examples suggest that the proposed composite likelihood estimator produces satisfactory estimation performance. While variable selection of generalized linear models is a well developed topic, model subsetting in applications of Gaussian copula models remains a relatively open research area. The main reason is the computational burden that is already quite heavy for simply fitting the model. It is therefore not computationally affordable to evaluate many candidate sub-models. This makes penalized likelihood approaches extremely inefficient because they need to search through different levels of penalty strength, apart from the fact suggested by our numerical experience that optimization of penalized composite likelihoods with many popular penalty terms (e.g LASSO and SCAD) usually does not converge in copula models. Thus, we propose to use a criterion-based selection approach that borrows strength from the Gibbs sampling technique.The methodology guarantees to converge to the model with the lowest criterion value, yet without searching through all possible models exhaustively. Finally, we present an R package implementing the estimation and selection of the copSTM model in C++. We show examples comparing our package to many available R packages (on some special cases of the copSTM), confirming the correctness and efficiency of the package functions. The package copSTM provides a competitive toolkit option for the analysis spatio-temporal count data on lattice in terms of both model flexibility and computational efficiency.