Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Computational substructure querying and topology prediction of the beta-sheet
    Ho, Hui Kian ( 2014)
    Studying the three-dimensional structure of proteins is essential to understanding their function, and ultimately, their dysfunction that causes disease. The limitations of experimental protein structure determination presents a need for computational approaches to protein structure prediction and analysis. The beta-sheet is a commonly occurring protein substructure important to many biological processes and are often implicated in neurological disorders. Targeted experimental studies of beta-sheets are especially difficult due to their general insolubility in isolation. This thesis presents a series of contributions to the computational analysis and prediction of beta-sheet structure, which are useful for knowledge discovery and for directing more detailed experimental work. Approaches for predicting the simplest type of beta-sheet, the beta-hairpin, are first described. Improvements over existing methods are obtained by using the most important beta-hairpin features identified through systematic feature selection. An examination of the most important features provides a physiochemical basis of their usefulness in beta-hairpin prediction. New methods for the more general problem of beta-sheet topology prediction are described. Unlike recent methods, ours are independent of multiple sequence alignment (MSAs) and therefore do not rely on the coverage of reference sequence databases or sequence homology. Our evaluations showed that our methods do not exhibit the same reductions in performance as a state-of-the-art method for sequences with low quality MSAs. A new method for the indexing and querying of beta-sheet substructures, called BetaSearch, is described. BetaSearch exploits the inherent planar constraints of beta-sheet structure to achieve significant speedups over existing graph indexing and conventional 3D structure search methods. Case studies are presented that demonstrate the potential of this method for the discovery of biologically interesting beta-sheet substructures. Finally, a purpose-built open source toolkit for generating 2D protein maps is described, which is useful for the coarse-grained analysis and visualisation of 3D protein structures. It can also be used in existing knowledge discovery pipelines for automated structural analysis and prediction tasks, as a standalone application, or imported into existing experimental applications.
  • Item
    Thumbnail Image
    Scalable approaches for analysing high density single nucleotide polymorphism array data
    Wong, Gerard Kum Peng ( 2012)
    Prior to making inferences from the raw data produced by these microarrays, several challenges need to be addressed. First, it is important to limit the impact of noise on microarray measurements while maintaining data integrity. An unexplored aspect of noise is the extent of probeset sequence identity in SNP microarrays. Second, microarray-based datasets often have at least two orders of magnitude more probesets than the number of samples they describe. This poses a challenge for traditional statistical tests when used in this context. Third, the number of features in each dataset is large even when sample sizes are small, thus computationally efficient approaches are required to analyse these datasets. Finally, with improving resolution of SNP arrays, there is a need to exploit this improvement in resolution to identify finer-scaled mutations in the human genome. Most existing approaches deal with these challenges at an individual sample level and do not look for consensus change across the population to identify sites of DNA mutation. Other approaches artificially smooth or segment the data to obtain uniform segments of copy number change, and lose possible fine-scaled copy number changes in the process. Others are based on computationally expensive approaches that do not scale well to array resolution and sample size. Our first contribution is a comprehensive survey of the sequence identity of all probesets for all variants of the Affymetrix Genome-Wide Human SNP array. This survey assesses the target uniqueness of every probeset and provides a basis for the development of a set of gold standard labels of copy number change between genders. The derived sets of gold standard labels are a benchmark for assessing the performance of various algorithms in detecting recurrent copy number change. This benchmark is utilised in the evaluation of our second and third contribution. Our second contribution is a statistical approach called Detecting Recurrent Copy Number Changes Using Rank Order Statistics (DRECS), which is designed to identify regions of consensus copy number change across multiple samples in SNP array datasets. Through the use of rank-based statistics, DRECS efficiently draws on the statistical power of using multiple samples to identify fine-scaled copy number changes down to the width of a single probe in a computationally efficient way. Our third contribution is called Sum of Ranks Exact Test (SoRET), a non-parametric extension of DRECS. SoRET addresses SNP datasets with small sample sizes and makes no assumptions about the distribution from which the data was sampled. Its performance in terms of Type I and Type II errors is superior to competitive parametric and non-parametric statistical tests for small sample sizes. Our fourth contribution is a feature set reduction approach called FSR. FSR enables existing filter-based feature selection approaches to handle large dimensional microarray-type datasets by pruning irrelevant and redundant features. A novel scoring measure is developed to assess the strength of each feature in terms of sample class discrimination. FSR uses measures of entropy to efficiently gauge the contribution of higher order feature patterns to avoid combinatorial explosions in assessing the utility of features. In our tested datasets, classifiers trained on features selected from FSR-reduced feature sets have shown notably better predictive accuracy than classifiers trained on features selected from complete feature sets.