Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Compression of large DNA databases
    Kuruppu, Shanika Sewwandini ( 2012)
    The thesis explores algorithms to efficiently store and access repetitive DNA sequence collections produced by large-scale genome sequencing projects. First, existing general-purpose and DNA compression algorithms are evaluated for their suitability for compressing large collections of DNA sequences. Then two novel algorithms for compressing large collections of DNA sequences are introduced. The first algorithm is COMRAD, which is a disk-based dictionary compression algorithm that iteratively detects repetitions that occur across multiple sequences, and substitutes them with non-terminals. The method showed that repeats can be feasibly detected across multiple sequences for relatively large collections, while preserving sequence boundaries. The second algorithm is RLZ, which compresses highly similar sequence collections using a simple LZ77 parsing of each sequence with respect to a sequence chosen to be the reference. RLZ was also extended to conduct non-greedy LZ77 parsing, and with the combination of a few other optimisations, the algorithm indirectly detects and encodes approximate matches. RLZ is memory efficient, and is one of the fastest DNA sequence compression algorithms existing to date, both in terms of the compression and decompression speed. Both COMRAD and RLZ can compress sequence collections of fully assembled chromosomes and genomes, as well as sets of contigs and reads. Both algorithms also support individual sequence extraction and random access queries on compressed sets. RLZ was also extended to a full self-index by enabling substring searching on compressed sets with some limitations on detecting substrings. Since the effectiveness of RLZ compression depends on the reference sequence chosen for a collection, techniques for constructing reference sequences that better represent the collection were also explored. The results showed that using a reference sequence constructed with repeats detected by dictionary compressors leads to significant improvements in the compressed sizes produced by RLZ.