Show simple item record

dc.contributor.authorHerath Mudiyanselage, Damayanthi Kumari Herath
dc.date.accessioned2019-12-02T17:15:21Z
dc.date.available2019-12-02T17:15:21Z
dc.date.issued2019
dc.identifier.urihttp://hdl.handle.net/11343/233379
dc.description© 2019 Damayanthi Kumari Herath Herath Mudiyanselage
dc.description.abstractMetagenomics which utilises high throughput DNA sequencing is widely applied to study bacteria and viruses and their effects on their host environments. Metagenomics involves collective sequencing of genetic material of the species in an environmental sample, subsequently requiring robust methods to elucidate the characteristics of the species in the sample from the heterogeneous data. A key step in learning the taxonomic diversity of a metagenomic sample is binning. Binning refers to grouping the nucleotide sequences belonging to an individual or closely related species. Identification of appropriate features and machine learning methods is essential in binning a metagenome of many unknown genomes. A significant challenge in binning metagenomic sequences is to bin a sample of closely related species. The thesis addresses this challenge and proposes a new two-tiered workflow called Coverage and composition based binning of Metagenomes (CoMet) for binning assembled sequences (contigs) of a metagenomic sample. It is demonstrated that a combination of features coupled with appropriate unsupervised learning methods can improve the precision in binning while enabling characterization of more species in a metagenome of species with similar genetic variants. Species richness is a key species diversity measure which corresponds to the number of species in an environmental sample. Estimating species richness of a metagenome of viruses (i.e. a virome) based on the reference data is challenging because of the limited amount of sequence data of viruses available in reference databases. A limitation identified with the methods that do not rely on reference sequence data in estimating species richness is the assumption of equal genome length for all the species in the sample. The thesis addresses this limitation by proposing a method to estimate species richness from a virome considering the variability of the genome lengths of species in the sample. The proposed method enables inference of genome lengths distribution from the metagenomic sequence data in addition to estimating the species richness. RNA-Seq refers to a set of techniques enabling the effective study of the transcriptome. An application of RNA-Seq is differential transcript usage analysis (DTU) which refers to inferring differences in expressions of multiple transcripts (isoforms) of a gene across different conditions from the sequencing data generated in an experiment. A key step in RNA-Seq data analysis is aligning the sequence reads to a reference sequence. SuperTranscripts is an alternate reference sequence proposed mainly for analyzing organisms with no/incomplete reference sequences. The thesis explores the use of superTranscripts to test for DTU in organisms with good reference sequences and annotations. Three definitions of counting-bins based on superTranscripts which are further used to infer DTU in genes are considered. The results with simulated data of fruit fly and human demonstrate that superTranscripts enable the analysis of DTU in genes with better control in False Discovery Rate (FDR) than the standard methods while not requiring the prior estimation of isoform abundances. The analysis of real data demonstrates the effectiveness of using superTranscripts to visualize the DTU in genes.
dc.rightsTerms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.
dc.subjectData Driven
dc.subjectMachine Learning
dc.subjectHighthroughput sequencing
dc.subjectMetagenomics
dc.subjectTranscriptomics
dc.subjectBinning
dc.subjectSpecies Diversity
dc.subjectMetavirome
dc.subjectDifferential Transcript Usage
dc.titleMethods for profling heterogeneous sequencing data
dc.typePhD thesis
melbourne.affiliation.departmentMechanical Engineering
melbourne.affiliation.facultyEngineering
melbourne.thesis.supervisornameSaman Halgamuge
melbourne.contributor.authorHerath Mudiyanselage, Damayanthi Kumari Herath
melbourne.thesis.supervisorothernameAlicia Oshlack
melbourne.thesis.supervisorothernameDavid Ackland
melbourne.tes.fieldofresearch1080109 Pattern Recognition and Data Mining
melbourne.tes.fieldofresearch2080607 Information Engineering and Theory
melbourne.tes.confirmedtrue
melbourne.accessrightsOpen Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record