Natural language processing for resource-poor languages
Citations
Altmetric
Author
Duong, LongDate
2017Affiliation
Computing and Information SystemsMetadata
Show full item recordDocument Type
PhD thesisAccess Status
Open AccessDescription
© Dr Long Duong
Abstract
Natural language processing (NLP) aims, broadly speaking, to teach computers to understand human language. This is hard as the computer must comprehend many facets of language such as semantics, syntax, pragmatics and phonology which are difficult to characterize formally, let alone encode as computer instructions. It is even harder for so-called low-resource languages where the annotated resources are very limited. There are approximately 7,000 languages in the world, but of these only a small fraction (20 languages) are considered high-resource languages. Low-resource languages are in dire need of tools and resources to overcome the resource barrier such that advances in NLP can deliver more widespread benefits. Despite the lack of annotated data, there are some unannotated data resources which might be beneficial for low-resource languages including parallel data, bilingual lexical resources or clues from related languages. However, the means for effectively incorporating these resources to improve the performance of low-resource NLP is an open research question, and the target of this thesis. Out of 7,000 languages, half of them do not have a writing system and many are falling out of use. It is estimated that by the end of this century, half of the world’s languages will be extinct. It is necessary to extend the current NLP techniques to unwritten languages to process and document the languages before they are gone forever.
Transfer learning provides an important opportunity for low-resource NLP, whereby annotation is transferred from a source resource-rich language to a target resource poor-language. In this thesis, we successfully apply transfer learning for many low-resource NLP tasks in both semi-supervised and unsupervised settings. We show that only a small amount of annotated text in the target language is sufficient to achieve a large performance improvement by incorporating a resource-rich source language into the model. We report successful applications to low-resource part-of-speech tagging and dependency parsing. We observe improvement in both cascade training, where the model is trained in sequential order, and joint-training. Where no annotated data is available, we instead propose unsupervised transfer learning techniques taking advantage of crosslingual word embeddings. We propose crosslingual syntactic word embeddings where words in both source and target languages are mapped to a shared low-dimensional space based on syntactic context, without using any additional resources such as parallel text. We observe consistent improvements when using crosslingual syntactic embeddings for unsupervised dependency parsing. In the setting where bilingual resources such as bilingual dictionaries are available, we perform better lexical transfer learning via crosslingual word embeddings. We achieve competitive results on both unsupervised crosslingual document classification and dependency parsing tasks. In the extremely low-resource scenario of unwritten languages, we experiment with the neural, attentional model to learn directly from speech in the unwritten language and translations in the higher resource language. Through our preliminary experimentation, we demonstrated the feasibility of the task.
Despite the limited focus on part-of-speech tagging, dependency parsing and unwritten language processing, the proposed methods are universal and can be extended to other low-resource NLP tasks. We have considered different levels of resource requirements including semi-supervised learning, unsupervised learning and our preliminary attempt at unwritten language processing. We concluded that (1) processing low-resource languages is hard but can be made possible using transfer learning, (2) there are complementary resources to compensate for the lack of annotated data even for low-resource languages and (3) unwritten languages are very common and should be included in any low-resource natural language processing system, bringing many new challenges for effective modelling.
Keywords
low-resource languages processing; deep learning; transfer learning; word embeddings; crosslingual word embeddings; speech translation; speech alignmentExport Reference in RIS Format
Endnote
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
Refworks
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References

