Automatic understanding of unwritten languages
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusOpen Access
© 2017 Dr. Oliver Adams
Many of the world's languages are falling out of use without a written record and minimal linguistic documentation. Language documentation is a slow process and there are an insufficient number of linguists working to ensure the world's languages are documented before they die out. This thesis addresses automatic understanding of unwritten languages in order to perform tasks such as phonemic transcription and bilingual lexicon induction. The automation of such tasks promises to improve the leverage of field linguists and ultimately speed up the language documentation process. Modelling endangered languages is challenging due to the nature of the available data, which is typically not written text but limited quantities of recorded speech. Manually annotated information in the form of lexicons and grammars is typically also limited. Since the languages are spoken, the most efficient way of sourcing data is to collect speech in the language. Most speakers of endangered languages are bilingual or multilingual, so acquiring spoken translations works to the strength of the speakers. Key approaches described in this thesis make use of bilingual data, in particular translated speech, which consists of segments of endangered language speech paired with translations in a larger language. Such data is important for relating the source language speech with a larger language. Additionally, the application of monolingual phoneme transcription is also explored, since it has direct applicability in more traditional phonemic transcription workflows. The overarching question is this: what can be automatically learnt about the languages with the data we have available, and how can this help automate language documentation? We first consider translation modelling of accurate phoneme transcriptions. This assumption allows us to investigate the feasibility of phoneme--word translation and the effectiveness of inferring bilingual lexical items from such data in isolation from confounding acoustic factors. A second investigation explores how bilingual lexicons can be used to improve language models, which are crucial components of speech recognition and machine translation systems. In a third set of experiments we remove the assumption of accurate transcriptions and investigate operating in the face of acoustic uncertainty. Experiments in this space demonstrate that translated speech can improve automatic phoneme transcription even without a prior translation model. Finally, we make a step towards further generalisability, exploring acoustic modelling in resource-scarce environments without a lexicon or language model. In particular, we assess the use of automatic phoneme and tone transcription on Yongning Na, a threatened tonal language spoken in south-west China. Beyond quantitative investigation, we report on the use of this method in linguistic documentation of Na. Its effectiveness has led to its incorporation into the language documentation workflow for Na.
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References