University Library
  • Login
A gateway to Melbourne's research publications
Minerva Access is the University's Institutional Repository. It aims to collect, preserve, and showcase the intellectual output of staff and students of the University of Melbourne for a global audience.
View Item 
  • Minerva Access
  • Medicine, Dentistry & Health Sciences
  • Medical Biology
  • Medical Biology - Theses
  • View Item
  • Minerva Access
  • Medicine, Dentistry & Health Sciences
  • Medical Biology
  • Medical Biology - Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

    Statistical analyses of high-throughput sequencing data to study chromatin structure and organisation

    Thumbnail
    Citations
    Altmetric
    Author
    Lun, Aaron Tin Long
    Date
    2015
    Affiliation
    Medical Biology
    Metadata
    Show full item record
    Document Type
    PhD thesis
    Access Status
    This item is currently not available from this repository
    URI
    http://hdl.handle.net/11343/57482
    Description

    © 2015 Dr. Aaron Tin Long Lun

    Abstract
    Massively parallel sequencing technology is a powerful experimental tool for molecular biology research. The most obvious application is RNA-seq, where cellular RNA is sequenced to quantify the expression of each gene in the genome. Another application is that of ChIP-seq, which is widely used to identify the genomic binding sites of transcription factors or other proteins. The Hi-C and ChIA-PET methods aim to examine pairwise interactions between genomic loci, such as those between enhancers and genes. Together, these techniques can be used to to study the structure and organisation of the genome, and how they relate to genome function, i.e., as mechanisms of gene regulation. This thesis describes novel statistical and computational approaches for the analysis of data from these sequencing experiments. In particular, this work focuses on the detection of differential features between conditions, e.g., changes in the binding profile for ChIP-seq or in the interaction intensities for Hi-C. Such changes are easier to detect using established methods like those in the edgeR package. Molecular changes are also more likely to be relevant, as they can be associated with the biological differences between conditions. For ChIP-seq data, the aim of the analysis is to identify regions of differential binding between conditions. This is done in a de novo manner that does not depend on pre-specified regions of interest. Instead, empirically defined peaks or sliding windows must be used. Peak calling must be performed independently of the tests for differential binding, in order to maintain type I error control for the latter. Similarly, the false discovery rate may be misinterpreted when multiple overlapping windows are present for each binding site. Some strategies are proposed here to maintain error control in both analyses. The impact of normalization on differential binding analyses is also discussed. Composition and efficiency biases may be present between libraries. However, methods for scaling normalization will only be able to remove one of these biases. The assumptions of some of these methods are examined in a mathematical framework, along with the effect of each method on the analysis. For complex trended biases, a method for non-linear normalization is proposed that is more robust than other approaches at low counts. For Hi-C data, an analysis pipeline is described that spans from read alignment to detection of differential interactions. Several strategies for the alignment of chimeric reads are assessed. The effect of different parametrizations during counting and filtering are examined. The need for non-linear normalization and modelling of biological variability is also demonstrated on real data. A method is proposed to avoid misinterpretation of the false discovery rate, and to combine results from analyses at different spatial resolutions. A similar pipeline is described to detect specific protein-mediated interactions from ChIA-PET data. The inadequacy of an existing method is explored with respect to the non-randomness of ligation. A more robust analysis is described that exploits hetero-linker data to define the null hypothesis. This new approach also accounts for overdispersion between biological replicates, though some care is required in handling low counts. Some work is also done pertaining to general analyses of sequencing data. The statistical concept of independent filtering for count data is explored, and the use of the average abundance as an independent filter statistic is proposed. The effect of zero counts on the residual degrees of freedom in generalized linear models is studied, and a refinement to existing methods is proposed to avoid loss of error control during testing. Intersection-union tests are also examined to integrate analyses from multiple datasets, where several new procedures are proposed to increase power relative to conventional methods. In summary, this thesis describes a number of novel methods for the analysis of ChIP-seq, Hi-C and ChIA-PET data, as well as that of sequencing data in general.
    Keywords
    sequencing; statistics; differential analysis; Hi-C; RNA-seq; ChIP-seq; ChIA-PET

    Export Reference in RIS Format     

    Endnote

    • Click on "Export Reference in RIS Format" and choose "open with... Endnote".

    Refworks

    • Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References


    Collections
    • Medical Biology - Theses [209]
    Minerva AccessDepositing Your Work (for University of Melbourne Staff and Students)NewsFAQs

    BrowseCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects
    My AccountLoginRegister
    StatisticsMost Popular ItemsStatistics by CountryMost Popular Authors