Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 18
  • Item
    Thumbnail Image
    Voice in virtual worlds
    Wadley, Gregory Robert ( 2011)
    Virtual worlds are simulated online spaces through which large numbers of people connect in order to work, play and socialize. Examples include massively multiplayer online games like World of Warcraft and open-ended worlds like Second Life. Virtual worlds are differentiated from other systems by their simulation of a persistent three-dimensional landscape, in which users are usually represented as avatars. Virtual worlds host large numbers of users and support a variety of recreational and instrumental uses. A critical aspect of any collaborative system is communication. In virtual worlds this is especially complex. Most users encounter people who are unknown to them offline. Most maintain a presence in the world over many months or even years, yet may prefer to be pseudonymous or engage in identity-play. Users must simultaneously manage both physical and virtual contexts. Synchronous as well as asynchronous communication is required. Virtual worlds initially offered only text as a medium for user-to-user communication. More recently, vendors have introduced facilities for communicating by voice. This has made the experience of virtual worlds more convivial for some, and has enabled forms of collaboration that were previously only possible in small experimental systems. But the introduction of voice provoked controversy, with some protesting that it projects too clearly the personal characteristics of speakers, damaging pseudonymity. Some are more sensitive about speaking with strangers than they are typing, and may become less communicative when adopting voice, or more easily dominated by extroverted collaborators. Voice channels are more prone to abuse, and the abuse can be more impactful. Users encounter sound quality problems, and are often uncertain whether they are being heard. Voice is less suited to asynchronous communication, and is more prone to congestion. It appears that voice works well when conditions suit, but can lead to failed implementations when deployed inappropriately. Yet little research has been conducted to help us understand to which situations voice is suited, whom it benefits and whom it does not, and how it can best be configured to support different activities. I conducted four studies designed to fill this gap. Two examined the influence of voice on user experience, in the two major types of virtual world. The others examined the interaction between voice and spatiality, the defining feature of virtual worlds, at macro and micro scales. I studied use in naturalistic contexts, collecting data via interviews and diaries and triangulating these with observation, online ethnography, conversation analysis and quantitative measures. I found that voice transforms the user experience of virtual worlds. It makes some forms of collaboration more efficient. However it interferes with identity-play and the ability of users to manage multiple tasks and conversations. When voice is propagated spatially, it increases immersion, reduces channel clutter, and affords new strategies for team coordination. However verbal references to places and objects often fail. I discuss these results in the light of post-media-richness theories of communication, arguing that preferences for one modality or another reflect broader issues of managing social presence in virtual and physical contexts.
  • Item
    Thumbnail Image
    Analysis and correction of short read data
    Schröder, Jan ( 2011)
    Next generation sequencing (NGS) platforms have revolutionised biomedical research and created a new branch of bioinformatics algorithms and applications. Due to the novelty of the technology and its fast evolvement, little is understood about the properties of the created data. Furthermore, sequencing errors make it difficult to process the data. For example, when assembling the short read fragments of NGS data into the long DNA sequences they derived from, sequencing errors can cause ambiguities and make it difficult to identify the correct relation between fragments. Thus sequencing errors in NGS data can lead to shorter and erroneous assemblies of the genetic material. Understanding the properties of the NGS data and correcting errors contained in it is crucial to maximising the scientific gain from sequencing experiments. This thesis investigates properties of sequencing data, presents novel approaches to error correction, and compares a range of error correction algorithms. First, we analyse a range of data sets on common properties to gain insights into characteristics of NGS data. We present a methodology to analyse the quality of a data set without the need of a reference sequence. This includes the analysis of base calls, quality scores, k-mer distributions, and using the Kullback-Leibler divergence to compare different positions in NGS reads. Second, this thesis introduces a novel approaches to error correcting NGS data, the error correction algorithm Shrec. Shrec utilises a data structure derived from a suffix trie to identify and correct errors in reads by comparing subsequences. Third, we introduce a machine learning guided approach to error correction. On the basis of the above Shrec algorithm we train classifiers to decide which bases are errors and which are not. We demonstrate an improvement of the error correction capabilities of the algorithm due to this method on a range of data. Furthermore, we establish sufficient robustness of the approach for practical use. Fourth, this thesis presents Coral, another error correction algorithm. Coral makes use of multiple alignments to compare reads and identify and correct error in them. Multiple alignments add more context to the error correction process than comparing subsequences as done in the above Shrec and other error correction algorithms. Fifth, a comparison of a range of error correction tools demonstrates the competitiveness of the above methods. We assess all algorithms for their correction accuracy, resource consumption, and impact on assembling the corrected NGS data. This comparison reveals different strengths and weaknesses of the available error correction algorithms and helps to identify directions of future development. In summary, we present observations on NGS data and a methodology to assess properties of NGS data without a reference sequence. We introduce novel approaches to the error correction problem and establish competitiveness of these algorithms in comparison to other work in the field.
  • Item
    Thumbnail Image
    Computing relationships and relatedness between contextually diverse entities
    GRIESER, KARL ( 2011)
    When presented with a pair of entities such as a ball and a bat, a person may make the connection that both of these entities are involved in sport (e.g., the sports baseball or cricket, based on the individual's background), that the composition of the two entities is similar (e.g., a wooden ball and a wooden stick), or if the person is especially creative, a fancy dress ball where someone has come dressed as a bat. All of these connections are equally valid, but depending on the context the person is familiar with (e.g., sport, wooden objects, fancy dress), a particular connection may be more apparent to that person. From a computational perspective, identifying these relationships and calculating the level of relatedness of entity pairs requires consideration of all ways in which the entities are able to interact with one another. Existing approaches to identifying the relatedness of entities and the semantic relationships that exist between them fail to take into account the multiple diverse ways in which these entities may interact, and hence do not explore all potential ways in which entities may be related. In this thesis, I use the collaborative encyclopedia Wikipedia as the basis for the formulation of a measure of semantic relatedness that takes into account the contextual diversity of entities (called the Related Article Conceptual Overlap, or RACO, method), and describe several methods of relationship extraction that utilise the taxonomic structure of Wikipedia to identify pieces of text that describe relations between contextually diverse entities. I also describe the construction of a dataset of museum exhibit relatedness judgements used to evaluate the performance of RACO. I demonstrate that RACO outperforms state-of-the-art measures of semantic relatedness over a collection of contextually diverse entities (museum exhibits), and that the taxonomic structure of Wikipedia provides a basis for identifying valid relationships between contextually diverse entities. As this work is presented in regard to the domain of Cultural Heritage and using Wikipedia as a basis for representation, I additionally describe the process for adapting the principle of conceptual overlap for calculating semantic relatedness and the relationship extraction methods based on taxonomic links to alternate contextually diverse domains, and for use with other representational resources.
  • Item
    Thumbnail Image
    Identity-based encryption and its applications using bilinear maps
    Lee, Peter Hyun-Jeen ( 2011)
    This thesis is comprised of three major results. Firstly, we present an efficient certificateless encryption scheme in the random oracle model. We observe that the previous schemes either have compact public key size or efficient computation, but not both. Our result shows that it is possible to keep the public key size compact while retaining the computational efficiency. Next, we present a ciphertext anonymous identity-based signcryption scheme in the standard model. We confirm that the previous identity-based signcryption schemes proven to be chosen ciphertext secure in the standard model are broken. Then, we construct an identity-based signcryption scheme by carefully combining the existing primitives to obtain efficient ciphertext size. We further strengthen the security of this basic scheme with ciphertext anonymity. Our result shows that the construction is efficient, with emphasis on ciphertext size reduction, while providing stronger security than the underlying primitives used. Lastly, we present a revocable attribute-based encryption scheme in the standard model. This work is inspired by the recent approach in constructing a revocable attribute-based encryption scheme via integrating broadcast encryption and attribute-based encryption. The main advantage of our construction is that it has a constant ciphertext size and its security is derived from the underlying schemes with a slight modification. Our result shows that the construction is secure and provides efficient user revocation.
  • Item
    Thumbnail Image
    Experimental evaluation of information retrieval systems
    RAVANA, SRI DEVI ( 2011)
    Comparative evaluations of information retrieval systems using test collections are based on a number of key premises, including that representative topic sets can be created, suitable relevance judgments can be generated, and that systems can be sensibly compared based on its retrieval correctness over the selected topic. The performance over each topic is measured using a chosen evaluation metric. In recent years, statistical analysis has been applied to further assess the significance of such measurements. This thesis makes several contributions to this experimental methodology by addressing realistic issues faced by researchers using test collections for evaluating information retrieval systems. First, it is common for the performance of a system on a set of topics to be represented by a single overall performance score such as the average. Therefore, we explore score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. We show that an adjusted geometric mean provides more consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization achieves the same outcome in a more consistent manner. We suggest that care with the selection of central tendency measure and evaluation metric in representing system effectiveness can yield more consistent system rankings than previously. Our second contribution is related to the efforts taken to reduce the experimental cost due to growing document corpus sizes and hence increasing number of relevance judgments required for reliable system evaluation. We introduce smoothing of retrieval effectiveness scores using previous knowledge of system performance, to raise the quality of evaluations without incurring additional experimental cost. Smoothing balances results from prior incomplete query sets against limited additional complete information, in order to obtain more refined system orderings than would be possible on the new queries alone. Finally, it is common for only partial relevance judgments to be used when comparing retrieval system effectiveness, in order to control experimental cost. We consider the uncertainty introduced into per-topic effectiveness scores by pooled judgments, and measure the effect that incomplete evidence has on both the systems scores that are generated, and also on the quality of paired system comparisons. We propose score estimation methods to handle the uncertainty introduced into effectiveness scores in the face of missing relevance judgments.
  • Item
    Thumbnail Image
    Computational methods for understanding the molecular mechanisms governing transcriptional regulation in humans
    Macintyre, Geoffrey John ( 2011)
    Understanding the molecular mechanisms governing transcriptional regulation is crucial in complex disease etiology. The prohibitive cost of experimentally mapping the complete set of genomic features governing gene regulation in humans means that computational techniques are required to define regions for further experimental exploration. This thesis has therefore developed a suite of computational techniques that assist in understanding aberrant modes of transcriptional regulation in human diseases. Historically, financial and technical limitations have restricted most research on gene regulation to a fraction of the genome made up of regions around transcription start sites (TSSs). This is largely motivated by the observation that most regulatory elements are contained in this fraction in the model organisms frequently used to study transcriptional regulation (such as yeast). In humans however, there are examples of transcription factors (TFs) bound over one megabase from the genes they regulate. Although such cases are considered rare, the increased complexity of gene regulation in humans over model organisms, essentially requires that TFs must assemble and regulate genes beyond gene promoter regions. As such, we have sought to assess the compromise made by limiting search to promoter regions in humans, and characterise the distance around TSSs that would warrant adequate encapsulation of the regulatory elements controlling any given gene. Our results show that in humans, searching for regulatory elements in promoter regions characterises at best 40-60% of the events controlling a gene. Instead, we estimate that search 100kb up and downstream of a TSS is necessary to characterise the majority of regulatory elements controlling any given gene. Using this distance estimate, we assessed a number of disease associated SNPs with no known function, for their potential to exert regulatory effects. We revealed two strong candidates that impact the regulation of disease associated genes: a SNP affecting binding of the AhR:Arnt complex located 77kb downstream of the MTHFR gene in schizophrenia, and a SNP creating a REST binding site 25kb upstream of MSMB in prostate cancer. Another important aspect in understanding mechanisms of transcriptional regulation in human disease is to determine the pathways or biological processes which are likely to be perturbed in the disease. Commonly, this is done through manipulation of data resulting from gene expression profiling techniques. Gene expression profiling provides insight into the functions of genes at a molecular level, and clustering of these profiles can facilitate the identification of the underlying driving biological program causing genes’ co-expression. Standard clustering methods, grouping genes based on similar expression values, can fail to capture weak expression correlations potentially causing genes in the same biological process to be grouped separately. We therefore developed a novel clustering algorithm, which incorporates functional gene information from the Gene Ontology into the clustering process, that results in more biologically meaningful clusters. We validated this method using two multi-cancer microarray datasets, showing that our method provides clusters which are more easily interpreted biologically. In addition, we show the potential of such methods for the exploration of cancer etiology. As well as identifying groups of genes which are co-regulated and important for the progression of a disease, we have also developed methods to assist in identification of the regulatory elements controlling these genes. The first such method, relies on the output of a genome-wide supervised regulatory element predictor, to identify the TFs and sequence features important for successful prediction of these regulatory elements. The supervised predictor uses genome-wide ChIP-SEQ data to train a classifier based on k-mer features (4 bases long) from a single chromosome to predict ChIP-SEQ binding sites genome-wide. The output of the classifier is a set of weighted k-mer features selected via recursive feature elimination. Our algorithm takes these weighted features as input and combines them based on statistical over-representation in the positive training set to build PWMs representing the sequence features important for prediction of ChIP-SEQ binding sites. In a study of four TFs (STAT1, MYC, JUN and GABP), we show that the most important sequence feature for prediction of TF binding sites are CpG dinucleotides. We suggest this is likely due to the increased likelihood that CpGs change DNA methylation status, an appealing function for TFBSs. In addition, we show that many of the remaining features represent the canonical binding site for the TF, and PWMs for known partner TFs. These results highlight that binding of TFs in humans is dependent not only on the recognition of the canonical binding site of the TF, but its surrounding sequence features such as partner TF binding sites. Following on from the idea of importance of partner TF binding, and using our previous estimate of a 100kb search space around TF target gene TSSs, we developed a novel cis-regulatory module prediction algorithm. The goal of the algorithm is to provide the set of TFs, and their binding locations, that control the regulation of set of putatively co-regulated genes. Specifically, the algorithm takes as input, a set of co-regulated genes and a PWM for a single TF expected to control the regulation of those genes. Using this PWM to seed the cis-regulatory module (CRM) search, the algorithm can accurately predict CRM composition and location when partner TFs are known, or unknown (partner PWMs are predicted de novo). Comparison against existing CRM predictors shows that our algorithm compares favorably when using empirical mapping of two transcription factors, STAT1 and C-JUN, involved in the response of HelaS3 and K562 cells to IFNgamma treatment as a gold-standard. We demonstrate that a `simulated knockout' of C-JUN using the predicted CRMs recapitulates target gene expression in an in vivo mouse C-JUN knockout model. In addition, predicted CRMs were used to determine the effects of partner transcription factors on estrogen receptor mediated tamoxifen response. We show that tamoxifen preferentially targets estrogen receptors bound downstream of their target gene and that partner transcription factors binding with estrogen receptor provides protection against the effects of tamoxifen. Furthermore, we apply our CRM predictor and `simulated knockout' analysis to clusters of co-regulated genes identified using our gene ontology based clustering algorithm. The resulting gene regulatory networks not only recapitulate known biology in metastasis and metabolism in cancer, but provide novel insight into how particular TFs control this behaviour. Finally, we present computational approaches for identifying disease associated SNPs and CNVs that are likely to disrupt the binding of a transcription factor which is critical in the progression of the disease. This is important as determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score. We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS. As well as SNPs impacting on regulatory elements, copy-number variations CNVs can also have an effect. By integrating CNV data and expression profiling data across cancer patient cohorts, and TSS-TFBS binding pairs from ChIP-SEQ and expression profiling data, we are able to identify regions where a CNV is present in the TFBS but not in the target gene. Using these regions, we employ trend testing to find individuals with a loss, normal or gain in copy number that corresponds to reduced, normal or increased expression of the target gene, respectively. Of the candidate CNVs identified, we show that the CNVs affecting two target genes, GEMIN4 and DDX9 are directly linked to overall survival in serous cystadenocarcinoma. In summary, this thesis presents a range of computational methods and analyses which improve our understanding of transcriptional regulation and how this affects progression of various human diseases.
  • Item
    Thumbnail Image
    Meeting or beating cash flow forecasts: market response, future performance and real activities management
    Al Shidi, Saif ( 2011)
    Analysts’ cash flow forecasts are not new to the market. However, academic research interest in them is relatively new. Recent studies investigating cash flow forecasts show that these forecasts can affect positively the quality of accruals (McInnis and Collins, 2011) and the accuracy of earnings forecasts (Call et al, 2009). Other studies argue that these forecasts can also have market implications (DeFond and Hung, 2003). They document that the market responds positively to meeting or beating cash flow forecasts (Melendez et al, 2005, Zhang, 2008, Brown and Pinello, 2008). This thesis builds on prior research by (1) addressing the association between meeting or beating cash flow forecasts and future operating performance, (2) analyzing the market response to firms meeting or beating cash flow forecasts using real activities management, (3) studying future market returns and operating performance for companies meeting or beating cash flow forecasts using real activities management, and (4) investigating the market response to meeting or beating cash flow forecasts vis-à-vis the market response to meeting or beating accruals forecasts in contexts related to high levels of accruals, capital intensity, poor financial health and change in institutional settings. Using quarterly data, empirical findings from this thesis confirm prior research results that the market responds positively to meeting or beating cash flow forecasts. This provides evidence that the market reward to meeting or beating cash flow forecasts exists for shorter reporting periods, which further supports the argument that meeting or beating cash flow forecasts provides value relevant information to market participants With regard to the association between meeting or beating cash flow forecasts and future performance, regression results show that subsequent period operating performance is higher for firms meeting or beating cash flow forecasts than for firms missing cash flow forecasts. This finding suggests that meeting or beating cash flow forecasts conveys positive information about future performance, which can explain the positive market response to this event. In relation to the use of real activities management by firms meeting or beating cash flow forecasts, empirical results show that the market responds positively to firms meeting or beating cash flow forecast using real activities management. The results further suggest that using real activities management does not affect negatively the market response to meeting or beating cash flow forecasts. These results are consistent with market’s inability to detect real activities management even in contexts where companies have a strong motivation to engage in these decisions like the cash flow forecasts context. This may explain why companies that meet or beat cash flow forecasts opt to use these methods. Furthermore, empirical findings show that subsequent quarter’s operating performance is lower for firms using real activities management to meet or beat cash flow forecasts. However, the results from future returns models do not provide conclusive evidence on these returns reflecting the negative future performance implications of real activities management. This implies that market participants are unable to process the future negative implications of, or reverse the contemporaneous response to, the use of real activities management to meet or beat cash flow forecasts even after they have been realized. In relation to the effect of contextual factors on the market response to meeting or beating cash flow forecasts, empirical findings show that changes in institutional settings in the aftermath of accounting scandals increased the market response to meeting or beating cash flow forecasts, while other contexts like capital intensity, accruals, and financial health do not have significant effects on that market response. The increase in the market response reflects higher emphasis by market participants on cash flow information after the enactment of regulatory changes following high profile bankruptcies in the early 2000s. Finally, in relation to the effect of contextual factors on the difference in market response to meeting or beating cash flow forecasts and accruals forecasts, regression results suggest that the market response to meeting or beating cash flow forecasts exceeds that to meeting or beating accruals forecasts. The difference is higher in settings where cash flow information is perceived to be more useful than accruals information. This result is consistent with prior research findings that the relative advantage of cash flows over accruals is context specific. These findings can have implications in relation to issues relating to market rationality, the role of contexts in relation to accounting information and the consequences of corporate decisions, which will contribute to various streams of the accounting literature like the analysts’ forecasts, earnings management, capital market and information content literatures.
  • Item
    Thumbnail Image
    Improving scheduling by learning
    SCHUTT, ANDREAS ( 2011)
    Scheduling problems appear in many industrial problems with different facets and requirements of a solution. A solution is a schedule of a set of activities subject to constraints such as precedence relations and resource restrictions on the maximum number of concurrent activities. This dissertation presents new deductive techniques for precedence and cumulative resource constraints in constraint programming to improve solution processes of combinatorial scheduling problems, in particular, and combinatorial problems, in general. Due to their intractability, many schedulers either solve a simplified problem or are tailored to a specific problem, but are inflexible with respect to changes in the problem statement. Constraint Programming (CP) offers a powerful framework for combinatorial problems which also tackles the demand of flexibility of changes in the problem statement due to a strict separation of problem modelling, search algorithms, and high-specialised deduction techniques. Moreover, recent advanced CP solvers such as lazy clause generation (LCG) additionally include sophisticated conflict learning technologies. Their efficiency depends, amongst other things, on reusable explanations formulated by deductive algorithms. Unit two variable per inequality (UTVPI) constraints are one of the largest integer constraint class that is solvable in polynomial time. These constraints can model precedence relations between activities. A new theoretical result about reasoning of UTVPI systems is shown, based on that, new incremental deductive algorithms are created for manipulating UTVPI constraints including satisfiability, implication, and explanation. These algorithms are asymptotically faster on sparse UTVPI systems than current algorithms. Cumulative constraints enforce resource restrictions in scheduling problems. We show how, by adding explanation to the cumulative constraint in an LCG solver, we can solve resource-constrained project scheduling problems far faster than other comparable methods. Furthermore, a complete LCG-based approach including cumulative constraints is developed for an industrial carpet cutting problem. This approach outperforms the current incomplete method on all industrial instances provided.
  • Item
    Thumbnail Image
    Improving combinatorial optimization
    Chu, Geoffrey G. ( 2011)
    Combinatorial Optimization is an important area of computer science that has many theoretical and practical applications. In this thesis, we present important contributions to several different areas of combinatorial optimization, including nogood learning, symmetry breaking, dominance, relaxations and parallelization. We develop a new nogood learning technique based on constraint projection that allows us to exploit subproblem dominances that arise when two different search paths lead to subproblems which are identical on the remaining unlabeled variables. On appropriate problems, this nogood learning technique provides orders of magnitude speedup compared to a base solver which does not learn nogoods. We present a new symmetry breaking technique called SBDS-1UIP, which is an extension of Symmetry Breaking During Search (SBDS). SBDS-1UIP uses symmetric versions of the 1UIP nogoods derived by Lazy Clause Generation solvers to prune symmetric parts of the search space. We show that SBDS-1UIP can exploit at least as many symmetries as SBDS, and that it is strictly more powerful on some problems, allowing us to exploit types of symmetries that no previous general symmetry breaking technique is capable of exploiting. We present two new general methods for exploiting almost symmetries (symmetries which are broken by a small number of constraints). The first is to treat almost symmetries as conditional symmetries and exploit them via conditional symmetry breaking constraints. The second is to modify SDBS-1UIP to handle almost symmetries. Both techniques are capable of producing exponential speedups on appropriate problems. We examine three reasonably well known problems: the Minimization of Open Stacks Problem, the Talent Scheduling Problem (CSPLib prob039), and the Maximum Density Still Life Problem (CSPLib prob032). By applying various powerful techniques such as nogood learning, dynamic programming, dominance and relaxations, we are able to solve these problems many orders of magnitude faster than the previous state of the art. We identify cache contention as a serious bottleneck that severely limit the amount of speedup achievable when parallelizing SAT and LCG solvers on multi-core systems. We alter the data structures used in the state of the art SAT solver MiniSat to be more cache aware, leading to a sequential SAT solver that is some 80% faster on average and a parallel SAT solver that is some 140% faster on average. We examine an important issue in parallel search that has not been properly addressed in the literature. Many work stealing schemes used for load balancing completely ignore the branching heuristic, and instead try to maximize the granularity of the work stolen. This can result in many of the processors searching in unfruitful parts of the search tree. We analyze this problem and develop a new work stealing scheme called confidence based work stealing, which partitions the work based on how strong the branching heuristic is at each node. The new parallel algorithm produced near linear or super linear speedup on all the problems we tested.
  • Item
    Thumbnail Image
    Software debugging using program spectra
    Lee, Hua Jie ( 2011)
    This thesis focuses on debugging using program spectra. Program spectra captures the dynamic behaviour of a program indicating which program statements are executed by respective test cases, which include pass and fail cases. By using this information, we use functions to rank all statements to locate bugs. Statements ranked top of the ranking are more likely to be buggy. We refer to these functions as spectra metrics. In a traditional debugging task, the programmer often has to examine program execution step-by-step within a block or function of program code that is most likely to be buggy. Using program spectra information can help the programmer to narrow down to those program statements that are most likely to be buggy. The thesis contributes to the theoretical understanding of debugging single bug programs using program spectra. We propose several spectra metrics and also review other metrics suggested for bug localization. Some of the metrics that have been previously employed in other domain areas such as biological science have been adapted for the debugging area and their effectiveness have been evaluated. We propose several methods to help improve the precision of bug localization. We show by employing more information such as frequency execution and coverage of test cases can improve bug localization performance significantly. Our extensive evaluation on several benchmarks, namely Siemens Test Suite, subset of Unix Test Suite, Concordance, and Space indicate that our proposed spectra metrics are effective in improving bug localization performance. This thesis work advances the state-of-the-art of bug localization and consequently has great potential to improve the effectiveness of debugging software.