Methods for profling heterogeneous sequencing data
Document TypePhD thesis
Access StatusOpen Access
© 2019 Damayanthi Kumari Herath Herath Mudiyanselage
Metagenomics which utilises high throughput DNA sequencing is widely applied to study bacteria and viruses and their effects on their host environments. Metagenomics involves collective sequencing of genetic material of the species in an environmental sample, subsequently requiring robust methods to elucidate the characteristics of the species in the sample from the heterogeneous data. A key step in learning the taxonomic diversity of a metagenomic sample is binning. Binning refers to grouping the nucleotide sequences belonging to an individual or closely related species. Identification of appropriate features and machine learning methods is essential in binning a metagenome of many unknown genomes. A significant challenge in binning metagenomic sequences is to bin a sample of closely related species. The thesis addresses this challenge and proposes a new two-tiered workflow called Coverage and composition based binning of Metagenomes (CoMet) for binning assembled sequences (contigs) of a metagenomic sample. It is demonstrated that a combination of features coupled with appropriate unsupervised learning methods can improve the precision in binning while enabling characterization of more species in a metagenome of species with similar genetic variants. Species richness is a key species diversity measure which corresponds to the number of species in an environmental sample. Estimating species richness of a metagenome of viruses (i.e. a virome) based on the reference data is challenging because of the limited amount of sequence data of viruses available in reference databases. A limitation identified with the methods that do not rely on reference sequence data in estimating species richness is the assumption of equal genome length for all the species in the sample. The thesis addresses this limitation by proposing a method to estimate species richness from a virome considering the variability of the genome lengths of species in the sample. The proposed method enables inference of genome lengths distribution from the metagenomic sequence data in addition to estimating the species richness. RNA-Seq refers to a set of techniques enabling the effective study of the transcriptome. An application of RNA-Seq is differential transcript usage analysis (DTU) which refers to inferring differences in expressions of multiple transcripts (isoforms) of a gene across different conditions from the sequencing data generated in an experiment. A key step in RNA-Seq data analysis is aligning the sequence reads to a reference sequence. SuperTranscripts is an alternate reference sequence proposed mainly for analyzing organisms with no/incomplete reference sequences. The thesis explores the use of superTranscripts to test for DTU in organisms with good reference sequences and annotations. Three definitions of counting-bins based on superTranscripts which are further used to infer DTU in genes are considered. The results with simulated data of fruit fly and human demonstrate that superTranscripts enable the analysis of DTU in genes with better control in False Discovery Rate (FDR) than the standard methods while not requiring the prior estimation of isoform abundances. The analysis of real data demonstrates the effectiveness of using superTranscripts to visualize the DTU in genes.
KeywordsData Driven; Machine Learning; Highthroughput sequencing; Metagenomics; Transcriptomics; Binning; Species Diversity; Metavirome; Differential Transcript Usage
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References