Towards highly accurate publication information extraction from academic homepages
AffiliationComputing and Information Systems
Document TypeMasters Research thesis
Access StatusOpen Access
© 2018 Yiqing Zhang
More and more researchers list their research profiles in academic homepages online. Publications from a researcher's academic homepage contain rich information, such as the researcher's fields of expertise, research interests, and collaboration network. Extracting publication information from academic homepages is an essential step in automatic profile analysis, which enables many applications such as academic search, bibliometrics and citation analysis. The publications extracted from academic homepages can also be a supplementary source for bibliographic databases. We investigate two publication extraction problems in this thesis: (i) Given an academic homepage, how can we precisely extract all the individual publication strings from the homepage? Here, a publication string is a text string that describes a publication record. We call this problem publication string extraction. (ii) Given a publication string, how can we extract different fields, such as publication authors, publication title, and publication venue, from the publication string? We call this problem publication field extraction. There are two types of traditional approaches to these two problems, rule-based approaches and machine learning based approaches. Rule-based approaches cannot accommodate the large variety of styles in the homepages, and they require significant efforts in rule designing. Machine learning based approaches rely on a large amount of high-quality training data as well as suitable model structures. To tackle these challenges, we first collect two datasets and annotate them manually. We propose a training data enhancement method to generate large sets of semi-real data for training our models. For the publication string extraction problem, we propose a PubSE model that can model the structure of a publication list in both line-level and webpage-level. For the publication field extraction problem, we propose an Adaptive Bi-LSTM-CRF model that can utilize the generated and the manually labeled training data to the full extent. Extensive experiment results show that the proposed methods outperform the state-of-the-art methods in the publication extraction problems studied.
Keywordsbibliographic data mining; information extraction; machine learning; LSTM
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References