Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 14
  • Item
    No Preview Available
    Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study
    Suster, S ; Baldwin, T ; Lau, JH ; Yepes, AJ ; Iraola, DM ; Otmakhova, Y ; Verspoor, K (JMIR PUBLICATIONS, INC, 2023-03-13)
    BACKGROUND: Assessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of individual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. OBJECTIVE: We proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). METHODS: We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. RESULTS: Our quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 F1 (P=.68; R=0.92) and imprecision at 0.75 F1 (P=.66; R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower (F1 in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 F1. When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 F1. We also found that the results varied depending on the supporting information that is provided as an input to the models. CONCLUSIONS: Different factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.
  • Item
    Thumbnail Image
    Cloze Evaluation for Deeper Understanding of Commonsense Stories in Indonesian
    Koto, F ; Baldwin, T ; Lau, JH (ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2022-01-01)
  • Item
    Thumbnail Image
    One Country, 700+Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
    Aji, AF ; Winata, GI ; Koto, F ; Cahyawijaya, S ; Romadhony, A ; Mahendra, R ; Kurniawan, K ; Moeljadi, D ; Prasojo, RE ; Baldwin, T ; Lau, JH ; Ruder, S (ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2022)
  • Item
    Thumbnail Image
    The patient is more dead than alive: exploring the current state of the multi-document summarization of the biomedical literature
    Otmakhova, Y ; Verspoor, K ; Baldwin, T ; Lau, JH (ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2022)
  • Item
    Thumbnail Image
    Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?
    Koto, F ; Lau, JH ; Baldwin, T (ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2022)
  • Item
    No Preview Available
    FFCI: A Framework for Interpretable Automatic Evaluation of Summarization
    Koto, F ; Baldwin, T ; Lau, JH (AI ACCESS FOUNDATION, 2022)
    In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, semantic textual similarity (STS), next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings.
  • Item
    Thumbnail Image
    INDOBERTWEET: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
    Koto, F ; Lau, JH ; Baldwin, T (Association for Computational Linguistics, 2021-01-01)
    We present INDOBERTWEET, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolinguallytrained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.
  • Item
    Thumbnail Image
    Top-down discourse parsing via sequence labelling
    Koto, F ; Lau, JH ; Baldwin, T (ACL, 2021-01-01)
    We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both traditional recurrent models and modern pre-trained transformer models for the task, and additionally introduce a novel dynamic oracle for top-down parsing. Based on the Full metric, our proposed LSTM model sets a new state-of-the-art for RST parsing.
  • Item
    Thumbnail Image
    Semi-automatic Triage of Requests for Free Legal Assistance
    Mistica, M ; Lau, JH ; Merrifield, B ; Fazio, K ; Baldwin, T (Association for Computational Linguistics, 2021)
    Free legal assistance is critically underresourced, and many of those who seek legal help have their needs unmet. A major bottleneck in the provision of free legal assistance to those most in need is the determination of the precise nature of the legal problem. This paper describes a collaboration with a major provider of free legal assistance, and the deployment of natural language processing models to assign area-of-law categories to real-world requests for legal assistance. In particular, we focus on an investigation of models to generate efficiencies in the triage process, but also the risks associated with naive use of model predictions, including fairness across different user demographics.
  • Item
    Thumbnail Image
    Evaluating the Efficacy of Summarization Evaluation across Languages
    Koto, F ; Lau, JH ; Baldwin, T ; Xia, F ; Zong, C ; Li, W ; Navigli, R (ACL, 2021-01-01)
    While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall). Based on this, we evaluate 19 summarization evaluation metrics, and find that using multilingual BERT within BERTScore performs well across all languages, at a level above that for English.