Computing and Information Systems - Theses
Permanent URI for this collection
Now showing 1 - 3 of 3
ItemAnaphora Resolution in Procedural Text - from Domain to DomainFang, Biaoyan ( 2022)Anaphora is an important and frequent concept in any form of discourse. It describes the use of expressions referring back to expressions used earlier in text, to avoid repetition. Anaphora resolution aims at resolving these reference relations in discourse and forms a core task in natural language understanding. It mainly contains two anaphoric types: coreference and bridging. While much effort has been targeted at anaphora resolution, most research has focused on these two anaphoric types separately. Specifically, anaphora research mostly focuses on coreference, modeling it from different perspectives across various resources. Bridging, on the other hand, has not been studied comprehensively. Different work analyzes bridging differently, leading to inconsistencies in bridging definitions. The lack of attention to bridging also brings challenges in capturing comprehensive anaphora phenomena in discourse -- only modeling coreference is not sufficient to capture complex anaphoric relations in text. It is becoming increasingly important to have both coreference and bridging annotated. Additionally, most existing anaphora research is based on declarative text. Procedural text, a common type of text, has received limited attention despite the richness and importance of anaphora phenomena in it, leaving much room for further exploration. In this thesis, we focus on anaphora resolution in procedural text, studying both coreference and bridging based on two common types of procedural text, chemical patents and recipes, and show that our proposed anaphora frameworks are well suited for procedural text. The four research questions we address in this thesis are: (1) How to model anaphora resolution in chemical patents? (2) How to combine different types of anaphora resolution? (3) How to incorporate external knowledge into anaphora resolution? (4) How to generalize our anaphora resolution model to domains apart from the biochemical domain? We address the first research question by proposing domain-specific anaphora annotation guidelines for chemical patents, targeting both coreference and bridging and incorporating general and domain-specific knowledge via in-depth investigations. We resolve ambiguities in bridging definitions by limiting the anaphoric relations to four specific subtypes related to the chemical domain while maintaining high coverage of anaphora phenomena. We achieve high IAA on the created ChEMU-Ref corpus, well above existing bridging corpora and demonstrating the reliability of the created dataset. To address the second research question, we propose an end-to-end joint training anaphora resolution model for coreference and bridging, adopting an end-to-end coreference resolution framework (Lee et al., 2017, 2018). Through empirical experiments on off-the-shelf anaphora corpora, we show the benefits of joint training for bridging. However, the impact on coreference is not clear. We argue that it could be due to ambiguity in the definition of bridging. To validate our hypothesis, we further experiment on two high-quality anaphora corpora with clear anaphora definitions, the ChEMU-Ref and RecipeRef (details in the last research question) datasets, and show the potential in improving both tasks through joint training, indicating the benefits of joint learning of coreference and bridging on high-quality anaphora corpora. Next, we address the third research question from the perspective of the utilization of pretrained language models based on the proposed end-to-end joint training framework, experimenting on the ChEMU-Ref corpus. We show that even with simple replacements, replacing generic language models (e.g. ELMo (Peters et al., 2018)) with domain pretrained language models (e.g. CHELMO (Zhai et al., 2019)), models achieve better performance, suggesting the potential of incorporating external knowledge for domain-specific anaphora resolution. Further explorations on recurrent neural network based and transformer based language models provide deeper insights, and suggest that different approaches might be needed to fully utilize different types of pretrained language models. For the last research question, we generalize the anaphora annotation framework developed for chemical patents to recipes with domain adjustments by detailed analysis of the similarities and differences between these two types of procedural text. Through in-depth comparison, we propose a more generic anaphora annotation framework for procedural text, designing in a hierarchy based on the state of entities. Based on the proposed annotation framework, we create the RecipeRef corpus, capturing rich anaphora phenomena in recipes, maintaining high IAA scores, and suggesting the feasibility of generalizing this framework to other procedural text. We observe further improvement from transfer learning, i.e. pretraining on the ChEMU-Ref dataset and fine-tuning on the RecipeRef dataset, indicating the transformation of general procedural knowledge in this domain. In summary, this thesis studies anaphora resolution in procedural text, particularly based on chemical patents and recipes, two common types of procedural text, and fills the gap in modeling and resolving anaphora resolution in this area. This establishes a firm base and contributes towards further research in anaphora resolution over procedural text.
ItemMulti-Granular Webpage Information Extraction and Analysis via Deep Joint LearningDai, Yimeng ( 2020)The number of webpages is growing exponentially, which results in a great volume of unstructured information on the web. It takes time either to fully comprehend a webpage or to retrieve relevant information from a complex webpage. Analyzing unstructured webpage and extracting structured information from the webpage automatically is crucial. In this study, we aim to develop algorithms for multi-granular webpage information extraction and analysis to facilitate webpage information understanding. We investigate the problem at three levels of granularity, i.e., micro, meso and macro levels. For every level, we focus on one extraction and analysis task, although the algorithms we developed are general and can be applied to many other similar tasks. At the micro level, we aim to extract webpage entities that have diverse forms, and focus on the application of person name recognition. We propose a fine-grained annotation scheme based on anthroponymy and create the first dataset for fine-grained name recognition. We propose a joint model that learns the different name form classes with two sub-neural networks while fusing the learned signals through co-attention and gated fusion mechanisms. Experimental results show that our annotations can be utilised in different ways to improve the recognition performance. At the meso level, we study the relationships between webpage entities and blocks with a focus on the application of joint recognition of names and publications. We address the person name recognition and publication string recognition tasks in academic homepages jointly based on the insight that the two tasks are inherently correlated. We propose a joint model to capture the interdependencies between entities. We also capture global position patterns of blocks and local position patterns of entities in the model learning process. Empirical results on real datasets show that our model outperforms the state-of-the-art publication string recognition model and person name recognition model. Experimental results also show that our model outperforms baseline joint models. At the macro level, we aim to provide hierarchical analysis for webpages from diverse domains. We introduce the Webpage Briefing (WB) task, which aims to generate a summary of a webpage in a hierarchical manner, starting at the top is an abstract and general description of the topic of the webpage page, followed by high level key attributes extracted from the webpage, and then lower level key attributes, which contain concrete and specific key information. We propose to perform webpage briefing by identifying and summarizing the informative contents, which mimic human behaviour of understanding a complex webpage. We propose a novel Dual Distillation method that has a teacher-student architecture with dual distillation. We further propose a Triple Distillation method to better exploit the inherent correlation of specific key attributes and general topics of webpages. We finally propose a novel Triple Joint model that has a triple joint learning architecture with signal exchange and enhancement mechanisms. Experimental results show the superiority of Bi-Distill method and Tri-Distill over baseline methods. Experimental results also show that Tri-Join outperforms baseline single-task models and baseline jointly trained models.
ItemPushing the boundaries of deep parsingMACKINLAY, ANDREW ( 2012)I examine the application of deep parsing techniques to a range of Natural Language Processing tasks as well as methods to improve their performance. Focussing specifically on the English Resource Grammar, a hand-crafted grammar of English based on the Head-Driven Phrase Structure Grammar formalism, I examine some techniques for improving parsing accuracy in diverse domains and methods for evaluating these improvements. I also evaluate the utility of the in-depth linguistic analyses available from this grammar for some specific NLP applications such as biomedical information extraction, as well as investigating other applications of the semantic output available from this grammar.