Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 28
  • Item
    No Preview Available
  • Item
    No Preview Available
    The Impact of Judgment Variability on the Consistency of Offline Effectiveness Measures
    Rashidi, L ; Zobel, J ; Moffat, A (ASSOC COMPUTING MACHINERY, 2024-01)
    Measurement of the effectiveness of search engines is often based on use of relevance judgments. It is well known that judgments can be inconsistent between judges, leading to discrepancies that potentially affect not only scores but also system relativities and confidence in the experimental outcomes. We take the perspective that the relevance judgments are an amalgam of perfect relevance assessments plus errors; making use of a model of systematic errors in binary relevance judgments that can be tuned to reflect the kind of judge that is being used, we explore the behavior of measures of effectiveness as error is introduced. Using a novel methodology in which we examine the distribution of “true” effectiveness measurements that could be underlying measurements based on sets of judgments that include error, we find that even moderate amounts of error can lead to conclusions such as orderings of systems that statistical tests report as significant but are nonetheless incorrect. Further, in these results the widely used recall-based measures AP and NDCG are notably more fragile in the presence of judgment error than is the utility-based measure RBP, but all the measures failed under even moderate error rates. We conclude that knowledge of likely error rates in judgments is critical to interpretation of experimental outcomes.
  • Item
    No Preview Available
    Tradeoff Options for Bipartite Graph Partitioning
    Mackenzie, J ; Petri, M ; Moffat, A (IEEE COMPUTER SOC, 2023-08-01)
  • Item
    No Preview Available
    Anytime Ranking on Document-Ordered Indexes
    Mackenzie, J ; Petri, M ; Moffat, A (ASSOC COMPUTING MACHINERY, 2022-01)
    Inverted indexes continue to be a mainstay of text search engines, allowing efficient querying of large document collections. While there are a number of possible organizations, document-ordered indexes are the most common, since they are amenable to various query types, support index updates, and allow for efficient dynamic pruning operations. One disadvantage with document-ordered indexes is that high-scoring documents can be distributed across the document identifier space, meaning that index traversal algorithms that terminate early might put search effectiveness at risk. The alternative is impact-ordered indexes, which primarily support top- disjunctions but also allow for anytime query processing, where the search can be terminated at any time, with search quality improving as processing latency increases. Anytime query processing can be used to effectively reduce high-percentile tail latency that is essential for operational scenarios in which a service level agreement (SLA) imposes response time requirements. In this work, we show how document-ordered indexes can be organized such that they can be queried in an anytime fashion, enabling strict latency control with effective early termination. Our experiments show that processing document-ordered topical segments selected by a simple score estimator outperforms existing anytime algorithms, and allows query runtimes to be accurately limited to comply with SLA requirements.
  • Item
    Thumbnail Image
    Efficient immediate-access dynamic indexing
    Moffat, A ; Mackenzie, J (ELSEVIER SCI LTD, 2023-01-09)
    In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a “normal” static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting.
  • Item
    Thumbnail Image
    Efficient query processing techniques for next-page retrieval
    Mackenzie, J ; Petri, M ; Moffat, A (SPRINGER, 2022-03)
    Abstract In top-k ranked retrieval the goal is to efficiently compute an ordered list of the highest scoring k documents according to some stipulated similarity function such as the well-known BM25 approach. In most implementation techniques a min-heap of size k is used to track the top scoring candidates. In this work we consider the question of how best to retrieve the second page of search results, given that a first page has already been computed; that is, identification of the documents at ranks $$k+1$$ k + 1 to 2k for some query. Our goal is to understand what information is available as a by-product of the first-page scoring, and how it can be employed to accelerate the second-page computation, assuming that the second-page of results is required for only a fraction of the query load. We propose a range of simple, yet efficient, next-page retrieval techniques which are suitable for accelerating Document-at-a-Time mechanisms, and demonstrate their performance on three large text collections.
  • Item
    Thumbnail Image
    Entropic relevance: A mechanism for measuring stochastic process models discovered from event data
    Alkhammash, H ; Polyvyanyy, A ; Moffat, A ; García-Bañuelos, L (Elsevier BV, 2022)
    There are many fields of computing in which having access to large volumes of data allows very precise models to be developed. For example, machine learning employs a range of algorithms that deliver important insights based on analysis of data resources. Similarly, process mining develops algorithms that use event data induced by real-world processes to support the modeling of – and hence understanding and long-term improvement of – those processes. In process mining, the quality of the learned process models is assessed using conformance checking techniques, which measure how well the models represent and generalize the data. This article presents the entropic relevance measure for conformance checking of stochastic process models, which are models that also provide information in regard to the likelihood of observing each sequence of observed events. Accurate stochastic conformance measurement allows identification of models that describe the data better, including the captured sequences of process events and their frequencies, with information about the likelihood of the described processes being an essential step toward simulating and forecasting future processes. Entropic relevance represents a blend between the traditional precision and recall quality criteria in conformance checking, in that it both penalizes observed processes that the model does not describe, and also penalizes processes that are permitted by the model yet were not observed. Entropic relevance can be computed in time linear in the size of the input data; and measures a fundamentally different phenomenon than other existing measures. Our evaluation over industrial datasets confirms the feasibility of using the measure in practice.
  • Item
    Thumbnail Image
    Compact inverted index storage using general-purpose compression libraries
    Petri, M ; Moffat, A (WILEY, 2018-04)
    Summary Efficient storage of large inverted indexes is one of the key technologies that support current web search services. Here we re‐examine mechanisms for representing document‐level inverted indexes and within‐document term frequencies, including comparing specialized methods developed for this task against recent fast implementations of general‐purpose adaptive compression techniques. Experiments with the Gov2‐URL collection and a large collection of crawled news stories show that standard compression libraries can provide compression effectiveness as good as or better than previous methods, with decoding rates only moderately slower than reference implementations of those tailored approaches. This surprising outcome means that high‐performance index compression can be achieved without requiring the use of specialized implementations.
  • Item
  • Item
    Thumbnail Image
    Efficient distributed selective search
    Kim, Y ; Callan, J ; Culpepper, JS ; Moffat, A (SPRINGER, 2017-06)