Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 70
  • Item
  • Item
    Thumbnail Image
    A fast hybrid short read fragment assembly algorithm
    Schmidt, B ; Sinha, R ; Beresford-Smith, B ; Puglisi, SJ (OXFORD UNIV PRESS, 2009-09-01)
    The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan, an assembly algorithm which can be viewed as a hybrid of these two approaches. Taipan uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue. We show that this approach can achieve an assembly quality at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources.
  • Item
    Thumbnail Image
    is-rSNP: a novel technique for in silico regulatory SNP detection
    Macintyre, G ; Bailey, J ; Haviv, I ; Kowalczyk, A (OXFORD UNIV PRESS, 2010-09)
    MOTIVATION: Determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score. RESULTS: We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS. AVAILABILITY: is-rSNP software is available for use at: www.genomics.csse.unimelb.edu.au/is-rSNP.
  • Item
    Thumbnail Image
    MIRAGAA-a methodology for finding coordinated effects of microRNA expression changes and genome aberrations in cancer
    Gaire, RK ; Bailey, J ; Bearfoot, J ; Campbell, IG ; Stuckey, PJ ; Haviv, I (OXFORD UNIV PRESS, 2010-01-15)
    MOTIVATION: Cancer evolves through microevolution where random lesions that provide the biggest advantage to cancer stand out in their frequent occurrence in multiple samples. At the same time, a gene function can be changed by aberration of the corresponding gene or modification of microRNA (miRNA) expression, which attenuates the gene. In a large number of cancer samples, these two mechanisms might be distributed in a coordinated and almost mutually exclusive manner. Understanding this coordination may assist in identifying changes which significantly produce the same functional impact on cancer phenotype, and further identify genes that are universally required for cancer. Present methodologies for finding aberrations usually analyze single datasets, which cannot identify such pairs of coordinating genes and miRNAs. RESULTS: We have developed MIRAGAA, a statistical approach, to assess the coordinated changes of genome copy numbers and miRNA expression. We have evaluated MIRAGAA on The Cancer Genome Atlas (TCGA) Glioblastoma Multiforme datasets. In these datasets, a number of genome regions coordinating with different miRNAs are identified. Although well known for their biological significance, these genes and miRNAs would be left undetected for being less significant if the two datasets were analyzed individually. AVAILABILITY AND IMPLEMENTATION: The source code, implemented in R and java, is available from our project web site at http://www.csse.unimelb.edu.au/~rgaire/MIRAGAA/index.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
  • Item
    Thumbnail Image
    Comparative analysis of long DNA sequences by per element information content using different contexts
    Dix, TI ; Powell, DR ; Allison, L ; Bernal, J ; Jaeger, S ; Stern, L (BMC, 2007)
    BACKGROUND: Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space complexity is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This paper presents a methodology for performing comparative analysis to find features exposed by such models. RESULTS: We apply such a model to find features across chromosomes of Cyanidioschyzon merolae. We present a tool that provides useful linear transformations to investigate and save new sequences. Various examples illustrate the methodology, finding features for sequences alone and in different contexts. We also show how to highlight all sets of self-repetition features, in this case within Plasmodium falciparum chromosome 2. CONCLUSION: The methodology finds features that are significant and that biologists confirm. The exploration of long information sequences in linear time and space is fast and the saved results are self documenting.
  • Item
    Thumbnail Image
    Allocation strategies for utilization of space-shared resources in Bag of Tasks grids
    De Rose, CAF ; Ferreto, T ; Calheiros, RN ; Cirne, W ; Costa, LB ; Fireman, D (ELSEVIER, 2008-05)
  • Item
    Thumbnail Image
    Shuffle-Sum: Coercion-Resistant Verifiable Tallying for STV Voting
    Benaloh, J ; Moran, T ; Naish, L ; Ramchen, K ; Teague, V (IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2009-12)
  • Item
    Thumbnail Image
    An empirical study of the effects of NLP components on Geographic IR performance
    Stokes, N ; Li, Y ; Moffat, A ; Rong, J (TAYLOR & FRANCIS LTD, 2008)
  • Item
    Thumbnail Image
    Index compression using 64-bit words
    Anh, VN ; Moffat, A (WILEY, 2010-02)
  • Item
    Thumbnail Image
    The V*Diagram: A query-dependent approach to moving KNN queries
    Nutanong, S ; Zhang, R ; Taniny, E ; Kulik, L (Association for Computing Machinery (ACM), 2008-01-01)
    The moving k nearest neighbor (M k NN) query finds the k nearest neighbors of a moving query point continuously. The high potential of reducing the query processing cost as well as the large spectrum of associated applications have attracted considerable attention to this query type from the database community. This paper presents an incremental safe-region-based technique for answering M k NN queries, called the V*-Diagram. In general, a safe region is a set of points where the query point can move without changing the query answer. Traditional safe-region approaches compute a safe region based on the data objects but independent of the query location. Our approach exploits the current knowledge of the query point and the search space in addition to the data objects. As a result, the V*-Diagram has much smaller IO and computation costs than existing methods. The experimental results show that the V*-Diagram outperforms the best existing technique by two orders of magnitude.