Computing and Information Systems

In natural language processing (NLP), a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with part-of-speech labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based: a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabelled data. Previous work has tended to focus on applying new algorithms to the problem of adding hand-tuned features to assist in classifying difficult instances. Using these methods, a number of distinct approaches have plateaued to similar accuracy figures of 96.9 ± 0.3%. Here we approach the problem of improving accuracy in POS tagging from a unique angle. We use a representative set of tagging algorithms and attempt to optimise performance by modifying the inventory of tags (or tagset) used in the pre-labelled training data . We modify tagsets by systematically mapping the tags of the training data to anew tagset. Our aim is to produce a tagset which is more conducive to automatic POS tagging by more accurately reflecting the underlying linguistic distinctions which should be encoded in a tagset. The mappings are reversible, enabling the original tags to be trivially recovered, which facilitates comparison with previous work and between competing mappings. We explore two different broad sources of these mappings. Our primary focus is on using linguistic insight to determine potentially useful distinctions which we can then evaluate empirically. We also evaluate an alternative data-driven approach for extracting patterns of regularity in a tagged corpus. Our experiments indicate the approach is not as successful as we had predicted. Our most successful mappings were data-driven, which give improvements of approximately0.01% in token level accuracy over the development set using specific taggers, with increments of 0.03% over the test set. We show a wide range of linguistically motivated modifications which cause a performance decrement, while the best linguistic approaches maintain performance approximately over the development data and produce up to 0.05%improvement over the development data. Our results lead us to believe that this line of research is unlikely to provide significant gains over conventional approaches to POS tagging.

I examine the application of deep parsing techniques to a range of Natural Language Processing tasks as well as methods to improve their performance. Focussing specifically on the English Resource Grammar, a hand-crafted grammar of English based on the Head-Driven Phrase Structure Grammar formalism, I examine some techniques for improving parsing accuracy in diverse domains and methods for evaluating these improvements. I also evaluate the utility of the in-depth linguistic analyses available from this grammar for some specific NLP applications such as biomedical information extraction, as well as investigating other applications of the semantic output available from this grammar.

Computing and Information Systems - Theses

Permanent URI for this collection

Filters

Date

Author

Subject

Type

Settings

Sort By

Results per page

Statistics

Citations

Search Results