Improving the Reliability and Robustness of Information Retrieval Evaluation
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusOpen Access
© 2019 Ziying Yang
Batch evaluation techniques are often used to measure and compare the performance of Information Retrieval (IR) systems. In these approaches, IR evaluation metrics score the systems' runs against a set of ground-truth knowledge represented as relevance judgments for each one of a set of topics. Those system-topic scores are then compared, so that the superior system - if one exists - can be identified. Chapter 2 describes these processes in detail, including defining several commonly-used IR evaluation metrics, and introducing a range of associated techniques for collecting relevance judgments. Chapter 3 considers what happens when the document-scoring model creates ties; that is, when the similarity scores assigned to documents by the IR system are the same. In particular, the role of tied similarity scores in past TREC experimentation is measured, and possible strategies for handling ties in IR evaluations are introduced. Tied similarity scores may be caused by score rounding within similarity metrics, usually undertaken for efficiency. In further experiments we suggest deliberately grouping documents as ties, to discover the extent to which similarity score rounding can be tolerated in a quest to allow faster query processing without greatly losing effectiveness. Chapter 4 explores the potential risk to IR evaluation reproducibility that might result from the use of incomplete relevance judgments, focusing on estimating the reliability of IR system effectiveness scores when evaluated by recall-based metrics. Such effectiveness scores for metrics such as average precision (AP) and normalized discounted cumulative gain (NDCG) can be associated with corresponding parameter values for utility-based metrics such as rank-biased precision (RBP), for which residual scores can be computed and used to bound the uncertainty associated with unjudged documents. We found that while the uncertainty of recall-based metrics could be very high when the number of unjudged documents is large, in practical measurements the effect was less. Even so, we suggest that researchers should report uncertainties of system effectiveness scores via the residual of a weighted-precision metrics such as RBP, in addition to carrying out statistical tests for establishing metric score consistency. The relevance judgments that form a foundation of all batch evaluations have conventionally been assessed by small numbers of trained experts using ordinal relevance scales with two or more relevance categories. Judgments collected by such scales often contain large numbers of ties: documents in the same category that cannot be separated by relevance. Collecting judgments on a scale with higher fidelity can help understand users' perceptions of relevance, and allow the functions used for mapping relevance categories to numeric gain scores to be refined. Chapter 5 considers these issues in detail, and proposes a judgment solicitation approach that requires pairwise forced-choice decisions to collect relevance judgments using three different scales: preference, absolute relevance, and relevance ratio, using a crowd-sourcing platform which provides a large number of non-specialist assessors. We investigate the variation of the normalized relevance judgments generated by answers associated with these three methods, and compare them with three forms of previous judgments: NIST binary, Sormunen and Magnitude Estimation. We measure the number of assessed documents, average assessing speed, average document length, assessing inconsistency, accuracy and method preferences of workers, and consider which of those factors might affect the quality of relevance assessments. Chapter 6 brings together these three related investigations, summarizes the findings of the thesis, and describes a range of avenues for possible future work.
KeywordsInformation Retrieval; IR Evaluation; Evaluation Reliability; Relevance Judgments
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References