Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Table Semantic Learning for Chemical Patents
    Zhai, Zenan ( 2023-03)
    New chemical compounds discovered in commercial research are usually first disclosed in patents. Only a small fraction of these new compounds will appear in scientific literature, and only after a lengthy delay of on average 1-3 years after disclosure in patents. This implies that chemical patents are crucial and timely resources for novelty checking, validation, and understanding compound prior art. Hence, patents are an important knowledge resource for researchers in industry and academia. Natural Language Processing (NLP) is developing rapidly and has shown substantial performance on a wide range of information extraction tasks. However, the NLP community mainly focuses on unstructured text in the general domain. There is still a lack of datasets and information extraction methods focused on processing semi-structured texts and chemical patents. In this thesis, we focus on improving automatic table semantic learning performance for chemical patents. Most modern NLP methods use pre-trained word embeddings as part of their inputs. It has been shown that word embeddings pre-trained on in-domain data can help improve the performance of models that take them as inputs. Hence, we start with laying the foundation for the evaluation of table semantic learning models on chemical patents by pre-training word embeddings with in-domain data. Our experiments on a collection of chemical patent datasets show that the use of the created embeddings can help improve performance on named-entity recognition, co-reference resolution, and table semantic classification tasks. Next, to address the lack of training data, we present a new dataset for the semantic classification task in chemical patents. The baseline results generated by existing table semantic learning methods show that neural machine learning models are better than non-neural baselines. However, these approaches sacrifice either the 2-D structure of tables or sequential information between cells. Finally, we propose a novel approach that addresses this limitation. The proposed method adopts a novel quad-directional recurrent layer for capturing sequential information between neighboring cells in both vertical and horizontal directions. We then combine it with an image processing model based on a convolutional neural network that captures regional features in the 2D structure. We show that the proposed methods perform better than existing methods on the semantic classification of chemical patent tables. To further show the efficacy of the model, we adapt it to the table cell-level syntactic classification task. We show that the proposed model achieved substantial performance on a novel web table dataset we created for this task.