Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 15
  • Item
    Thumbnail Image
    Unsupervised Cross-Lingual Transfer of Structured Predictors without Source Data
    Kurniawan, K ; Frermann, L ; Schulz, P ; Cohn, T (Association for Computational Linguistics, 2022-01-01)
  • Item
    Thumbnail Image
    Optimising Equal Opportunity Fairness in Model Training
    Shen, A ; Han, X ; Cohn, T ; Baldwin, T ; Frermann, L (ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2022)
  • Item
    Thumbnail Image
    ChemTables: a dataset for semantic classification on tables in chemical patents
    Zhai, Z ; Druckenbrodt, C ; Thorne, C ; Akhondi, SA ; Dat, QN ; Cohn, T ; Verspoor, K (BMC, 2021-12-11)
    Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called CHEMTABLES, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on CHEMTABLES. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The CHEMTABLES dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables .
  • Item
    Thumbnail Image
    PTST-UoM at SemEval-2021 Task 10: Parsimonious Transfer for Sequence Tagging
    Kurniawan, K ; Frermann, L ; Schulz, P ; Cohn, T (Association for Computational Linguistics, 2021)
  • Item
    Thumbnail Image
    Evaluating Debiasing Techniques for Intersectional Biases
    Subramanian, S ; Han, X ; Baldwin, T ; Cohn, T ; Frermann, L (Association for Computational Linguistics, 2021-01-01)
    Bias is pervasive in NLP models, motivating the development of automatic debiasing techniques. Evaluation of NLP debiasing methods has largely been limited to binary attributes in isolation, e.g., debiasing with respect to binary gender or race, however many corpora involve multiple such attributes, possibly with higher cardinality. In this paper we argue that a truly fair model must consider 'gerrymandering' groups which comprise not only single attributes, but also intersectional groups. We evaluate a form of bias-constrained model which is new to NLP, as well an extension of the iterative nullspace projection technique which can handle multiple protected attributes.
  • Item
    Thumbnail Image
    PPT: Parsimonious parser transfer for unsupervised cross-lingual adaptation
    Kurniawan, K ; Frermann, L ; Schulz, P ; Cohn, T (ACL, 2021-01-01)
    Cross-lingual transfer is a leading technique for parsing low-resource languages in the absence of explicit supervision. Simple 'direct transfer' of a learned model based on a multilingual input encoding has provided a strong benchmark. This paper presents a method for unsupervised cross-lingual transfer that improves over direct transfer systems by using their output as implicit supervision as part of self-training on unlabelled text in the target language. The method assumes minimal resources and provides maximal flexibility by (a) accepting any pre-trained arc-factored dependency parser; (b) assuming no access to source language data; (c) supporting both projective and non-projective parsing; and (d) supporting multi-source transfer. With English as the source language, we show significant improvements over state-of-the-art transfer models on both distant and nearby languages, despite our conceptually simpler approach. We provide analyses of the choice of source languages for multi-source transfer, and the advantage of non-projective parsing. Our code is available online.
  • Item
    Thumbnail Image
    Fairness-aware Class Imbalanced Learning
    Subramanian, S ; Rahimi, A ; Baldwin, T ; Cohn, T ; Frermann, L (Association for Computational Linguistics, 2021-01-01)
    Class imbalance is a common challenge in many NLP tasks, and has clear connections to bias, in that bias in training data often leads to higher accuracy for majority groups at the expense of minority groups. However there has traditionally been a disconnect between research on class-imbalanced learning and mitigating bias, and only recently have the two been looked at through a common lens. In this work we evaluate long-tail learning methods for tweet sentiment and occupation classification, and extend a margin-loss based approach with methods to enforce fairness. We empirically show through controlled experiments that the proposed approaches help mitigate both class imbalance and demographic biases.
  • Item
    Thumbnail Image
    Commonsense Knowledge in Word Associations and ConceptNet
    Liu, C ; Cohn, T ; Frermann, L (Association for Computational Linguistics, 2021)
  • Item
    Thumbnail Image
    Learning coupled policies for simultaneous machine translation using imitation learning
    Arthur, P ; Cohn, T ; Haffari, G (ACL, 2021-01-01)
    We present a novel approach to efficiently learn a simultaneous translation model with coupled programmer-interpreter policies. First, we present an algorithmic oracle to produce oracle READ/WRITE actions for training bilingual sentence-pairs using the notion of word alignments. This oracle actions are designed to capture enough information from the partial input before writing the output. Next, we perform a coupled scheduled sampling to effectively mitigate the exposure bias when learning both policies jointly with imitation learning. Experiments on six language-pairs show our method outperforms strong baselines in terms of translation quality while keeping the translation delay low.
  • Item
    Thumbnail Image
    It is Not as Good as You Think! Evaluating Simultaneous Machine Translation on Interpretation Data
    Zhao, J ; Arthur, P ; Haffari, G ; Cohn, T ; Shareghi, E (Association for Computational Linguistics, 2021-01-01)
    Most existing simultaneous machine translation (SiMT) systems are trained and evaluated on offline translation corpora. We argue that SiMT systems should be trained and tested on real interpretation data. To illustrate this argument, we propose an interpretation test set and conduct a realistic evaluation of SiMT trained on offline translations. Our results, on our test set along with 3 existing smaller scale language pairs, highlight the difference of up-to 13.83 BLEU score when SiMT models are evaluated on translation vs interpretation data. In the absence of interpretation training data, we propose a translation-to-interpretation (T2I) style transfer method which allows converting existing offline translations into interpretation-style data, leading to up-to 2.8 BLEU improvement. However, the evaluation gap remains notable, calling for constructing large-scale interpretation corpora better suited for evaluating and developing SiMT systems.