University Library
  • Login
A gateway to Melbourne's research publications
Minerva Access is the University's Institutional Repository. It aims to collect, preserve, and showcase the intellectual output of staff and students of the University of Melbourne for a global audience.
View Item 
  • Minerva Access
  • Engineering
  • Computing and Information Systems
  • Computing and Information Systems - Theses
  • View Item
  • Minerva Access
  • Engineering
  • Computing and Information Systems
  • Computing and Information Systems - Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

    Natural language processing for resource-poor languages

    Thumbnail
    Download
    Natural Language Processing for Resource-Poor Languages (Completed thesis) (1.520Mb)

    Citations
    Altmetric
    Author
    Duong, Long
    Date
    2017
    Affiliation
    Computing and Information Systems
    Metadata
    Show full item record
    Document Type
    PhD thesis
    Access Status
    Open Access
    URI
    http://hdl.handle.net/11343/192938
    Description

    © Dr Long Duong

    Abstract
    Natural language processing (NLP) aims, broadly speaking, to teach computers to understand human language. This is hard as the computer must comprehend many facets of language such as semantics, syntax, pragmatics and phonology which are difficult to characterize formally, let alone encode as computer instructions. It is even harder for so-called low-resource languages where the annotated resources are very limited. There are approximately 7,000 languages in the world, but of these only a small fraction (20 languages) are considered high-resource languages. Low-resource languages are in dire need of tools and resources to overcome the resource barrier such that advances in NLP can deliver more widespread benefits. Despite the lack of annotated data, there are some unannotated data resources which might be beneficial for low-resource languages including parallel data, bilingual lexical resources or clues from related languages. However, the means for effectively incorporating these resources to improve the performance of low-resource NLP is an open research question, and the target of this thesis. Out of 7,000 languages, half of them do not have a writing system and many are falling out of use. It is estimated that by the end of this century, half of the world’s languages will be extinct. It is necessary to extend the current NLP techniques to unwritten languages to process and document the languages before they are gone forever. Transfer learning provides an important opportunity for low-resource NLP, whereby annotation is transferred from a source resource-rich language to a target resource poor-language. In this thesis, we successfully apply transfer learning for many low-resource NLP tasks in both semi-supervised and unsupervised settings. We show that only a small amount of annotated text in the target language is sufficient to achieve a large performance improvement by incorporating a resource-rich source language into the model. We report successful applications to low-resource part-of-speech tagging and dependency parsing. We observe improvement in both cascade training, where the model is trained in sequential order, and joint-training. Where no annotated data is available, we instead propose unsupervised transfer learning techniques taking advantage of crosslingual word embeddings. We propose crosslingual syntactic word embeddings where words in both source and target languages are mapped to a shared low-dimensional space based on syntactic context, without using any additional resources such as parallel text. We observe consistent improvements when using crosslingual syntactic embeddings for unsupervised dependency parsing. In the setting where bilingual resources such as bilingual dictionaries are available, we perform better lexical transfer learning via crosslingual word embeddings. We achieve competitive results on both unsupervised crosslingual document classification and dependency parsing tasks. In the extremely low-resource scenario of unwritten languages, we experiment with the neural, attentional model to learn directly from speech in the unwritten language and translations in the higher resource language. Through our preliminary experimentation, we demonstrated the feasibility of the task. Despite the limited focus on part-of-speech tagging, dependency parsing and unwritten language processing, the proposed methods are universal and can be extended to other low-resource NLP tasks. We have considered different levels of resource requirements including semi-supervised learning, unsupervised learning and our preliminary attempt at unwritten language processing. We concluded that (1) processing low-resource languages is hard but can be made possible using transfer learning, (2) there are complementary resources to compensate for the lack of annotated data even for low-resource languages and (3) unwritten languages are very common and should be included in any low-resource natural language processing system, bringing many new challenges for effective modelling.
    Keywords
    low-resource languages processing; deep learning; transfer learning; word embeddings; crosslingual word embeddings; speech translation; speech alignment

    Export Reference in RIS Format     

    Endnote

    • Click on "Export Reference in RIS Format" and choose "open with... Endnote".

    Refworks

    • Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References


    Collections
    • Computing and Information Systems - Theses [355]
    Minerva AccessDepositing Your Work (for University of Melbourne Staff and Students)NewsFAQs

    BrowseCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects
    My AccountLoginRegister
    StatisticsMost Popular ItemsStatistics by CountryMost Popular Authors