Show simple item record

dc.contributor.authorOloomi, Seyed Mohammad Hossein
dc.date.accessioned2019-01-11T00:25:39Z
dc.date.available2019-01-11T00:25:39Z
dc.date.issued2018en_US
dc.identifier.urihttp://hdl.handle.net/11343/219894
dc.description© 2018 Dr. Seyed Mohammad Hossein Oloomi
dc.description.abstractDetermining the DNA sequence of an organism is an essential step in many biological studies. The high-throughput DNA sequencing technologies break the DNA molecule randomly into many small fragments and determine the sequences of these fragments in parallel. Given these short sequences, referred to as reads, efficient algorithms are used to assemble reads back into a full genome sequence. In read mapping, the reads are aligned against a reference genome sequence to construct the sequence of the target genome. Resolution of multi-mappings when a read can be aligned to more than one location in a reference sequence is a significant problem in the processing of short read data. The read aligners either discard multi-mappings, report one location randomly, or report all candidate mapping locations. These strategies may result in important biological variations being missed or false positives to be introduced. Furthermore, most of the existing multi-mapping resolution tools are developed to be used with RNA-Seq data and focus on correcting the abundance of genes and transcripts rather than considering the sequence alignment of multi-mappings. In this thesis, we first review the sources of multi-mappings and the current approaches for resolving them. We then investigate the impact of multi-mapping in read mapping in order to find how challenging the structure of the reference genome and multi-reads can be for an accurate read mapping or a downstream analysis. The findings show the importance of multi-mapping resolution in read mapping, while our approaches to analysis can be used to discover the extent to which multi-mappings can affect mapping accuracy for a given genome and to provide insight into the likely accuracy and limitations of the mapping results. We then present a new method for Probabilistic Resolution of Multi-mappings (PROM) in high-throughput DNA Sequencing that takes into account the sequence alignment of uniquely mapped reads and multi-reads with nucleotide precision to determine the correct location for multi-reads. The results of evaluation with real and simulated data shows that it yields a significant improvement in the accuracy of read mapping and variant calling. Finally, we demonstrate a biological case study where resolution of multi-mappings with PROM can help the downstream analysis. We use the outcome of PROM multi-mapping resolution to identify mutations in the 23S rRNA gene, which has four copies in N. gonorrhoeae genome. Use of PROM enables identification of mutation in this multi-copy gene even with high-precision variant calling. In addition, it provides a general approach for estimating the number of mutant alleles without requiring read alignment against a masked genome.en_US
dc.rightsTerms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.
dc.subjectmulti-mappingen_US
dc.subjectread mappingen_US
dc.subjectDNA sequencingen_US
dc.titleThe impact of multi-mappings in short read mappingen_US
dc.typePhD thesisen_US
melbourne.affiliation.departmentComputing and Information Systems
melbourne.affiliation.facultyEngineering
melbourne.thesis.supervisornameZobel, Justin
melbourne.contributor.authorOloomi, Seyed Mohammad Hossein
melbourne.accessrightsOpen Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record