Multi-Granular Webpage Information Extraction and Analysis via Deep Joint Learning
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusThis item is embargoed and will be available on 2023-02-19. This item is currently available to University of Melbourne staff and students only, login required.
© 2020 Yimeng Dai
The number of webpages is growing exponentially, which results in a great volume of unstructured information on the web. It takes time either to fully comprehend a webpage or to retrieve relevant information from a complex webpage. Analyzing unstructured webpage and extracting structured information from the webpage automatically is crucial. In this study, we aim to develop algorithms for multi-granular webpage information extraction and analysis to facilitate webpage information understanding. We investigate the problem at three levels of granularity, i.e., micro, meso and macro levels. For every level, we focus on one extraction and analysis task, although the algorithms we developed are general and can be applied to many other similar tasks. At the micro level, we aim to extract webpage entities that have diverse forms, and focus on the application of person name recognition. We propose a fine-grained annotation scheme based on anthroponymy and create the first dataset for fine-grained name recognition. We propose a joint model that learns the different name form classes with two sub-neural networks while fusing the learned signals through co-attention and gated fusion mechanisms. Experimental results show that our annotations can be utilised in different ways to improve the recognition performance. At the meso level, we study the relationships between webpage entities and blocks with a focus on the application of joint recognition of names and publications. We address the person name recognition and publication string recognition tasks in academic homepages jointly based on the insight that the two tasks are inherently correlated. We propose a joint model to capture the interdependencies between entities. We also capture global position patterns of blocks and local position patterns of entities in the model learning process. Empirical results on real datasets show that our model outperforms the state-of-the-art publication string recognition model and person name recognition model. Experimental results also show that our model outperforms baseline joint models. At the macro level, we aim to provide hierarchical analysis for webpages from diverse domains. We introduce the Webpage Briefing (WB) task, which aims to generate a summary of a webpage in a hierarchical manner, starting at the top is an abstract and general description of the topic of the webpage page, followed by high level key attributes extracted from the webpage, and then lower level key attributes, which contain concrete and specific key information. We propose to perform webpage briefing by identifying and summarizing the informative contents, which mimic human behaviour of understanding a complex webpage. We propose a novel Dual Distillation method that has a teacher-student architecture with dual distillation. We further propose a Triple Distillation method to better exploit the inherent correlation of specific key attributes and general topics of webpages. We finally propose a novel Triple Joint model that has a triple joint learning architecture with signal exchange and enhancement mechanisms. Experimental results show the superiority of Bi-Distill method and Tri-Distill over baseline methods. Experimental results also show that Tri-Join outperforms baseline single-task models and baseline jointly trained models.
KeywordsInformation Extraction; Entity Extraction; Multi-task Learning; Text Summary; Pattern Recognition and Data Mining; Artificial Intelligence; Natural Language Processing
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References