Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Boolean and ranked information retrieval for biomedical systematic reviewing
    POHL, STEFAN ( 2012)
    Evidence-based medicine seeks to base clinical decisions on the best currently available scientific evidence and is becoming accepted practice. A key role is played by systematic reviews, which synthesize the biomedical literature and rely on different information retrieval methods to identify a comprehensive set of relevant studies. With Boolean retrieval, the primary retrieval method in this application domain, relevant documents are often excluded from consideration. Ranked retrieval methods are able to mitigate this problem, but current approaches are either not applicable, or they do not perform as well as the Boolean method. In this thesis, a ranked retrieval model is identified that is applicable to systematic review search and also effective. The p-norm approach to extended Boolean retrieval, which generalizes the Boolean model but, to some extent, also introduces ranking, is found to have a particularly promising prospect: identifying a greater fraction of relevant studies when typical numbers of documents are reviewed, but also possessing properties important during the query formulation phase and for the overall retrieval process. Moreover, efficient methods available for ranked keyword retrieval models are adapted to extended Boolean models. The query processing methods presented in this thesis result in significant speed ups of a factor of 2 to 9, making this retrieval model an attractive choice in practice. Finally, in support of the retrieval process during the subsequent update of systematic reviews, a query optimization method is devised that makes use of the knowledge about the properties of relevant and irrelevant studies to boost the effectiveness of the search process.
  • Item
    Thumbnail Image
    Experimental evaluation of information retrieval systems
    RAVANA, SRI DEVI ( 2011)
    Comparative evaluations of information retrieval systems using test collections are based on a number of key premises, including that representative topic sets can be created, suitable relevance judgments can be generated, and that systems can be sensibly compared based on its retrieval correctness over the selected topic. The performance over each topic is measured using a chosen evaluation metric. In recent years, statistical analysis has been applied to further assess the significance of such measurements. This thesis makes several contributions to this experimental methodology by addressing realistic issues faced by researchers using test collections for evaluating information retrieval systems. First, it is common for the performance of a system on a set of topics to be represented by a single overall performance score such as the average. Therefore, we explore score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. We show that an adjusted geometric mean provides more consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization achieves the same outcome in a more consistent manner. We suggest that care with the selection of central tendency measure and evaluation metric in representing system effectiveness can yield more consistent system rankings than previously. Our second contribution is related to the efforts taken to reduce the experimental cost due to growing document corpus sizes and hence increasing number of relevance judgments required for reliable system evaluation. We introduce smoothing of retrieval effectiveness scores using previous knowledge of system performance, to raise the quality of evaluations without incurring additional experimental cost. Smoothing balances results from prior incomplete query sets against limited additional complete information, in order to obtain more refined system orderings than would be possible on the new queries alone. Finally, it is common for only partial relevance judgments to be used when comparing retrieval system effectiveness, in order to control experimental cost. We consider the uncertainty introduced into per-topic effectiveness scores by pooled judgments, and measure the effect that incomplete evidence has on both the systems scores that are generated, and also on the quality of paired system comparisons. We propose score estimation methods to handle the uncertainty introduced into effectiveness scores in the face of missing relevance judgments.
  • Item
    Thumbnail Image
    Measurement in information retrieval evaluation
    Webber, William Edward ( 2010)
    Full-text retrieval systems employ heuristics to match documents to user queries. Retrieval correctness cannot, therefore, be formally proven, but must be evaluated through human assessment. To make evaluation automatable and repeatable, assessments of which documents are relevant to which queries are collected in advance, to form a test collection. Collection-based evaluation has been the standard in retrieval experiments for half a century, but only recently have its statistical foundations been considered. This thesis makes several contributions to the reliable and efficient measurement of the behaviour and effectiveness of information retrieval systems. First, the high variability in query difficulty makes effectiveness scores difficult to interpret, analyze, and compare. We therefore propose the standardization of scores, based on the observed results of a set of reference systems for each query. We demonstrate that standardization controls variability and enhances comparability. Second, while testing evaluation results for statistical significance has been established as standard practice, the importance of ensuring that significance can be reliably achieved for a meaningful improvement (the power of the test) is poorly understood. We introduce the use of statistical power analysis to the field of retrieval evaluation, finding that most test collections cannot reliably detect incremental improvements in performance. We also demonstrate the pitfalls in predicting score standard deviation during design-phase power analysis, and offer some pragmatic methodological suggestions. Third, in constructing a test collection, it is not feasible to assess every document for relevance to every query. The practice instead is to run a set of systems against the collection, and pool their top results for assessment. Pooling is potentially biased against systems which are neither included in nor similar to the pooled set. We propose a robust, empirical method for estimating the degree of pooling bias, through performing a leave-one-out experiment on fully pooled systems and adjusting unpooled scores accordingly. Fourth, there are many circumstances in which one wishes directly to compare the document rankings produced by different retrieval systems, independent of their effectiveness. These rankings are top-weighted, non-conjoint, and of arbitrary length, and no suitable similarity measures have been described for such rankings. We propose and analyze such a rank similarity measure, called rank-biased overlap, and demonstrate its utility, on real and simulated data. Finally, we conclude the thesis with an examination of the state and function of retrieval evaluation. A survey of published results shows that there has been no measurable improvement in retrieval effectiveness over the past decade. This lack of progress has been obscured by the general use of uncompetitive baselines in published experiments, producing the appearance of substantial and statistically significant improvements for new systems without actually advancing the state of the art.