The impact of multi-mappings in short read mapping
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusOpen Access
© 2018 Dr. Seyed Mohammad Hossein Oloomi
Determining the DNA sequence of an organism is an essential step in many biological studies. The high-throughput DNA sequencing technologies break the DNA molecule randomly into many small fragments and determine the sequences of these fragments in parallel. Given these short sequences, referred to as reads, efficient algorithms are used to assemble reads back into a full genome sequence. In read mapping, the reads are aligned against a reference genome sequence to construct the sequence of the target genome. Resolution of multi-mappings when a read can be aligned to more than one location in a reference sequence is a significant problem in the processing of short read data. The read aligners either discard multi-mappings, report one location randomly, or report all candidate mapping locations. These strategies may result in important biological variations being missed or false positives to be introduced. Furthermore, most of the existing multi-mapping resolution tools are developed to be used with RNA-Seq data and focus on correcting the abundance of genes and transcripts rather than considering the sequence alignment of multi-mappings. In this thesis, we first review the sources of multi-mappings and the current approaches for resolving them. We then investigate the impact of multi-mapping in read mapping in order to find how challenging the structure of the reference genome and multi-reads can be for an accurate read mapping or a downstream analysis. The findings show the importance of multi-mapping resolution in read mapping, while our approaches to analysis can be used to discover the extent to which multi-mappings can affect mapping accuracy for a given genome and to provide insight into the likely accuracy and limitations of the mapping results. We then present a new method for Probabilistic Resolution of Multi-mappings (PROM) in high-throughput DNA Sequencing that takes into account the sequence alignment of uniquely mapped reads and multi-reads with nucleotide precision to determine the correct location for multi-reads. The results of evaluation with real and simulated data shows that it yields a significant improvement in the accuracy of read mapping and variant calling. Finally, we demonstrate a biological case study where resolution of multi-mappings with PROM can help the downstream analysis. We use the outcome of PROM multi-mapping resolution to identify mutations in the 23S rRNA gene, which has four copies in N. gonorrhoeae genome. Use of PROM enables identification of mutation in this multi-copy gene even with high-precision variant calling. In addition, it provides a general approach for estimating the number of mutant alleles without requiring read alignment against a masked genome.
Keywordsmulti-mapping; read mapping; DNA sequencing
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References