Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Experimental evaluation of information retrieval systems
    RAVANA, SRI DEVI ( 2011)
    Comparative evaluations of information retrieval systems using test collections are based on a number of key premises, including that representative topic sets can be created, suitable relevance judgments can be generated, and that systems can be sensibly compared based on its retrieval correctness over the selected topic. The performance over each topic is measured using a chosen evaluation metric. In recent years, statistical analysis has been applied to further assess the significance of such measurements. This thesis makes several contributions to this experimental methodology by addressing realistic issues faced by researchers using test collections for evaluating information retrieval systems. First, it is common for the performance of a system on a set of topics to be represented by a single overall performance score such as the average. Therefore, we explore score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. We show that an adjusted geometric mean provides more consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization achieves the same outcome in a more consistent manner. We suggest that care with the selection of central tendency measure and evaluation metric in representing system effectiveness can yield more consistent system rankings than previously. Our second contribution is related to the efforts taken to reduce the experimental cost due to growing document corpus sizes and hence increasing number of relevance judgments required for reliable system evaluation. We introduce smoothing of retrieval effectiveness scores using previous knowledge of system performance, to raise the quality of evaluations without incurring additional experimental cost. Smoothing balances results from prior incomplete query sets against limited additional complete information, in order to obtain more refined system orderings than would be possible on the new queries alone. Finally, it is common for only partial relevance judgments to be used when comparing retrieval system effectiveness, in order to control experimental cost. We consider the uncertainty introduced into per-topic effectiveness scores by pooled judgments, and measure the effect that incomplete evidence has on both the systems scores that are generated, and also on the quality of paired system comparisons. We propose score estimation methods to handle the uncertainty introduced into effectiveness scores in the face of missing relevance judgments.