Sir Peter MacCallum Department of Oncology - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Automated discovery of interacting genomic events that impact cancer survival by using data mining and machine learning techniques
    Lupat, Richard ( 2020)
    Rapid advancement in genomic technologies has driven down the cost of sequencing significantly. This efficiency has enabled large-scale cancer genomic studies to be conducted, generating a vast amount of data across different levels of omics variables. However, the tasks to extract new knowledge and information from this enormous volume of data present unique challenges. These analyses often require the application of specialised techniques for data mining, integration and interpretation to provide valuable insights. With the rise of machine learning adoption in recent decades, many advanced computational algorithms based on artificial intelligence techniques have also been proposed to analyse these genomics data. Although some of these applications have led to clinically relevant conclusions, many others are still relying on incomplete prior knowledge, or limited to only a selected number of features. These limitations raise the general question about the broader applicability of machine learning in the field of cancer genomics. This research addresses this question by assessing the application of machine learning techniques in the context of breast cancer genomics data. This assessment includes a comprehensive evaluation of computational methods for predicting cancer driver genes and the development of a novel deep learning approach for identifying breast cancer subtypes. The evaluation result of driver gene prediction algorithms suggests that the selection of the best method to be applied to a dataset will primarily be driven by the objectives of the study and the characteristics of the dataset. All of the evaluated approaches could identify well- studied genes, but not all of them performed as well on smaller datasets, subtype-specific cohorts, and in discovering novel genes. To examine the benefit of a more complex machine learning model, this thesis also presents a novel deep learning approach that integrates multi-omics data for predicting various breast cancer’ biomarkers and molecular subtypes. This method combines a semi-supervised autoencoder for dimensionality reduction, and a supervised multitask learning setup for the classifications. Taking an input of gene expression, somatic point mutation and copy number data, the algorithm predicts the ER-Status, HER2-Status and molecular subtypes of breast cancer samples. Further survival analysis of the outputs from this deep learning approach indicates that the predicted subtypes show a stronger correlation with patient prognosis compared to the original PAM50 label. While the outputs from machine learning algorithms still require further validation, the adoption of these complex computational methods in cancer genomics will become increasingly common. Collectively, the results from this thesis suggest that the machine learning analysis of ‘omics data hold great potential in automating the discovery of clinically- relevant molecular features.