Extracting named entities from tables in biomedical literature
AuthorWONG, WERN LI
AffiliationFaculty of Engineering, Computer Science and Software Engineering
MetadataShow full item record
Document TypeHonours thesis
CitationWong, W. L. (2008). Extracting named entities from tables in biomedical literature. Honours thesis, Faculty of Engineering, Computer Science and Software Engineering, The University of Melbourne.
Access StatusOpen Access
Deposited with permission of the author. © 2008 Wern Li Wong
Information overload is a problem for many researchers, particularly in the biomedical field, where more articles are being written every year than is feasible for a single person to read. Information extraction aims to provide a solution to this problem by extracting information from the text data in articles and presenting the main points in a compact, immediately usable form, but the tabular data in those papers pose a specific challenge that, to the best of our knowledge, has yet to be properly addressed. Tables are attractive because they summarise and highlight relevant information in a semi-structured way, but currently there are few approaches that take advantage of the data they contain. This project aims to investigate automated methods for extracting information from tables scraped from journal articles on hereditary nonpolypopsis colorectal cancer. We explored the use of various heuristics and machine learning techniques to extract mutation information from tables. The tables are first scraped from the journal papers in HTML format and segmented into table vectors; the vectors pertaining to mutations are subsequently identified and examined at the cell level to extract mutations. Our results show that efficient identification of mutations from tables is possible through the use of machine learning techniques, using vector header words as features.
Keywordsinformation extraction from biomedical literature; tabular data; heuristics; identification of mutations from tables; machine learning techniques; vector header words
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References