Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Detecting gene-cancer associations by analysing gene-expression microarrays
    Shi, Fan ( 2011)
    Modern bioinformatics studies have shown that many physiological and behavioral characteristics in organisms are influenced by the information encoded in genes. Cancer genomics is a subject that specifically studies the genetic mechanisms of the formation and progression of cancer in order to improve the diagnosis and prognosis of cancer. An important task in cancer genomics is to detect gene-cancer associations, which is the biological focus of this thesis. DNA microarrays are a high throughput bioinformatics technique to assay the expression levels for thousands of genes simultaneously. By comparing the gene expression levels between different cancer types or subtypes, we may discover potential gene-cancer associations. However, the large scale of gene expression microarrays requires effective and efficient computational approaches for their analysis. Thus, from a computational perspective, we focus on developing computational approaches to detect gene-cancer associations based on gene expression microarrays in this thesis. We summarise three key problems and the corresponding traditional methods in the analysis of gene expression microarrays. First, statistical tests can be used to identify differentially expressed genes, which show significantly different expression patterns between different cancer types or subtypes. Second, unsupervised clustering methods can be used to identify co-expressed genes, which are groups of differentially expressed genes that show similar expression levels in the same cancer types. Third, classification methods can be used to predict the types or subtypes of cancer based on differentially expressed genes. However, several challenges, such as the high dimensionality in microarrays and the small sample sizes in cancer studies, may lead to low accuracy and efficiency problems for analysis based on these traditional methods. In this thesis, we have proposed three computational approaches to address the above problems. First, we have developed a meta-analysis method, called Incomplete Gene Meta-analysis (IGM), to identify differentially expressed genes by integrating multiple studies. The IGM method is able to integrate datasets from different microarray platforms and incorporate the genes that are not measured in all datasets, which we refer to as incomplete genes. In our evaluation, we verify the importance of including incomplete genes, and the experimental results demonstrate that IGM identifies more significant genes by imputing the statistical significance of incomplete genes than traditional methods. Second, we have proposed an unsupervised Bi-ordering Analysis (BOA) method to detect local patterns, where a subset of genes are co-expressed under a subset of samples, called biclusters, in microarrays. The BOA method uses an iterative process to identify consistently over or under-expressed gene groups in specific samples. This approach addresses several challenges for detecting biclusters, including the identification of biologically meaningful patterns, the efficiency of biclustering algorithms and the stability of biclusters. Our statistical assessments demonstrate both the statistical and biological significance of the biclusters produced by our method. Third, we have proposed a method for making multiple predictions with an associated confidence level to classify the cancers of unknown primary origin (CUP). This classification method is able to identify a set of the most likely cancer types and assign a confidence level to the predictions for CUP samples. Our method for making multiple predictions takes into account the biological similarity in different cancer types at a gene expression level, and is thus more suitable for classifying multi-class cancer samples than making a single class prediction. Our evaluation verifies the importance of making multiple predictions and validates our method.