Now showing 1 - 4 of 4
ItemSource-Free Transductive Transfer Learning for Structured PredictionKurniawan, Kemal Maulana ( 2023-07)Current transfer learning approaches require two strong assumptions: the source domain data is available and the target domain has labelled data. These assumptions are problematic when both the source domain data is private and the target domain has no labelled data. Thus, we consider the source-free unsupervised transfer setup in which the assumptions are violated across both languages and domains (genres). To transfer structured prediction models in the source-free setting, we propose two methods: Parsimonious Parser Transfer (PPT) designed for single-source transfer of dependency parsers across languages, and PPTX which is the multi-source version of PPT. Both methods outperform baselines. We then propose to improve PPTX with logarithmic opinion pooling (PPTX-LOP), and find that it is an effective multi-source transfer method for structured prediction in general. Next, we study if our proposed source-free transfer methods provide improvements when pretrained language models (PTLMs) are employed. We first propose Parsimonious Transfer for Sequence Tagging (PTST) which is a variation of PPT designed for sequence tagging. Then, we evaluate PTST and PPTX-LOP on domain adaptation of semantic tasks using PTLMs. We show that for globally normalised models, PTST and PPTX-LOP improve precision and recall respectively. Besides unlabelled data, the target domain may have models trained on various tasks (but not the task of interest). To investigate if these models can be used successfully to improve performance in source-free transfer, we propose two methods. We find that leveraging these models can improve recall over direct transfer with one of the proposed methods. Finally, we critically discuss and conclude the findings in this thesis. We cover relevant subsequent work and close with a discussion on limitations and future work.
ItemTable Semantic Learning for Chemical PatentsZhai, Zenan ( 2023-03)New chemical compounds discovered in commercial research are usually first disclosed in patents. Only a small fraction of these new compounds will appear in scientific literature, and only after a lengthy delay of on average 1-3 years after disclosure in patents. This implies that chemical patents are crucial and timely resources for novelty checking, validation, and understanding compound prior art. Hence, patents are an important knowledge resource for researchers in industry and academia. Natural Language Processing (NLP) is developing rapidly and has shown substantial performance on a wide range of information extraction tasks. However, the NLP community mainly focuses on unstructured text in the general domain. There is still a lack of datasets and information extraction methods focused on processing semi-structured texts and chemical patents. In this thesis, we focus on improving automatic table semantic learning performance for chemical patents. Most modern NLP methods use pre-trained word embeddings as part of their inputs. It has been shown that word embeddings pre-trained on in-domain data can help improve the performance of models that take them as inputs. Hence, we start with laying the foundation for the evaluation of table semantic learning models on chemical patents by pre-training word embeddings with in-domain data. Our experiments on a collection of chemical patent datasets show that the use of the created embeddings can help improve performance on named-entity recognition, co-reference resolution, and table semantic classification tasks. Next, to address the lack of training data, we present a new dataset for the semantic classification task in chemical patents. The baseline results generated by existing table semantic learning methods show that neural machine learning models are better than non-neural baselines. However, these approaches sacrifice either the 2-D structure of tables or sequential information between cells. Finally, we propose a novel approach that addresses this limitation. The proposed method adopts a novel quad-directional recurrent layer for capturing sequential information between neighboring cells in both vertical and horizontal directions. We then combine it with an image processing model based on a convolutional neural network that captures regional features in the 2D structure. We show that the proposed methods perform better than existing methods on the semantic classification of chemical patent tables. To further show the efficacy of the model, we adapt it to the table cell-level syntactic classification task. We show that the proposed model achieved substantial performance on a novel web table dataset we created for this task.
ItemFrom Discourse and Keyphrases, to Language Modeling in Automatic SummarizationFajri, Fajri ( 2022)This thesis aims to enhance single-document automatic summarization by exploring four different spectrums: language model, discourse, keyphrases, and evaluation systems. First, progress on language models and automatic summarization has predominantly been in English, and it leaves open the question of whether they function effectively in other languages. To address this, we perform a case study on the Indonesian language by releasing two pre-trained language models (i.e. IndoBERT, IndoBERTweet) and two large-scale summarization corpora (i.e. Liputan6 and LipKey). While our findings suggest that the current progress indeed effectively works in Indonesian, we found particular challenges in evaluating Indonesian text summarization because of morphological variation, synonyms, and abbreviations in system-generated summaries. Second, modern summarization systems are built on pre-trained language models which serve as the foundation models. However, it is still unclear whether these language models truly learn the summarization task, or they simply memorize the pattern of the input document and human-written summaries. In this thesis, we argue that these language models are still imperfect, and investigate the benefits of discourse information and keyphrases for summarization systems. This is because discourse provides information relating to text organization, while keyphrases capture succinct and salient words about the text. To test this hypothesis, we first perform discourse probing on pre-trained language models to understand the extent to which they capture discourse relations, and introduce a novel approach to discourse parsing - which aims to recover the discourse structure given a document. We then explicitly incorporate discourse and keyphrases into summarization systems and found the qualities of machine-generated summaries improve. Lastly, despite significant progress in the development of summarization models, both automatic and manual evaluations of text summarization are less studied. Reliable and scalable evaluation is critical to measure the research progress in summarization, and ROUGE as the de facto summarization evaluation is inadequate. ROUGE only evaluates summary quality by comparing word overlap between machine-generated and human-written summaries, while broader aspects such as faithfulness (the extent to which the generated summary contains genuine details found in the document) and linguistic quality of summaries (e.g. fluency of the language) are not covered. The last contribution of this thesis is a comprehensive automatic evaluation framework for text summarization compiling prominent aspects used in the manual evaluations of prior works. This proposal is introduced as the FFCI framework that consists of four aspects: faithfulness, focus, coverage, and inter-sentential coherence, and we propose methods to automatically assess summarization quality based on these four aspects.
ItemA multi-faceted approach to document quality assessmentShen, Aili ( 2020)Document quality assessment, due to its complexity and subjectivity, requires considering information from multiple sources and aspects, to capture quality indicators. Grammaticality, readability, stylistics, structure, correctness, and expertise depth reflect the quality of documents from different aspects, with varying importance across different domains. Automatic quality assessment has obvious benefits in terms of time saving and tractability in contexts where the volume of documents is large. In the case of dynamic documents (possibly with multiple authors), such as in the case of Wikipedia, it is particularly pertinent, as any edit potentially has implications for the quality label of that document. In this thesis, we focusing on improving the performance of document quality assessment systems and measure the uncertainty of document quality assessment systems. This thesis addresses four research questions: (1) How can we capture visual features not present in the document text, such as images and visual layout, to enhance representations learned from text content? (2) How can we make use of hand-crafted features widely adopted in traditional machine learning approaches in the context of neural networks, to generate a more accurate document quality assessment system? (3) How can we model the inherent subjectivity of quality assessment in evaluating the performance of quality assessment systems? and (4) Can a quality assessment system detect whether there are intruder sentences in documents and identify the span of any such intruder sentences, given that they interrupt the coherence of documents, thereby lowering their quality? To address the first research question, we propose to use Inception V3 (Szegedy et al., 2016), a widely used visual model in computer vision, to capture visual features from visual renderings of documents, based on the observation that visual renderings of documents can capture these visual features. Inception V3 compares favourably to textual-based models over the Wikipedia and academic paper reviewing datasets. We further propose a joint model to predict document quality by combining visual and textual features. We observe further improvements over both Wikipedia and academic paper reviewing datasets, indicating complementary between visual and textual features, and the general applicability of our proposed method. Next, we propose two methods to enhance the capacity of neural models in predicting the quality of documents by utilising hand-crafted features. In the first method, we propose to concatenate hand-crafted features with neural learned high-level representations, assuming that neural model-learned features may not have captured all the information carried by these hand-crafted features. The second method, on the other hand, utilises hand-crafted features to guide neural model learning by explicitly attending to feature indicators when learning the relationship between the input and target variables, rather than simply concatenating hand-crafted features. Experimental results demonstrate the superiority of our proposed methods over baselines. To imitate people’s disagreement over the inherently subjective task of document quality assessment, we propose to measure the uncertainty in document quality predictions. We investigate two methods: Gaussian processes (GPs) (Rasmussen and Williams, 2006) and random forests (RFs) (Breiman, 2001), which provide not only a prediction of the document quality but also the uncertainty over their predictions. We also propose an asymmetric cost, considering the prediction uncertainty, which is used to measure the performance of two methods in the scenario, where decision-making processes based on model predictions can lead to different costs. Lastly, we propose a new task of detecting whether there is an intruder sentence in a document, generated by replacing an original sentence with a similar sentence from a second document. Existing datasets in coherence detection are not suitable for our task as they are either too small for training current data-hungry models on or do not specify the span of incoherent text. To benchmark model performance over this task, we construct a large-scale dataset consisting of documents from English Wikipedia and CNN news articles. Experimental results show that pre-trained language models which incorporate larger document contexts in pretraining perform remarkably well in-domain, but experience a substantial drop cross-domain. In follow-up analysis based on human annotations, substantial divergences from human intuitions were observed, pointing to limitations in their ability to model document coherence. Further results over a linguistic probe dataset show that pre-trained models fail to identify some linguistic characteristics that affect document coherence, suggesting room to improve for them to truly capture document coherence, and motivating the construction of a dataset with intruder text at the intra-sentential level.