School of Languages and Linguistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 18
  • Item
    Thumbnail Image
    Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS)
    Foley, B ; Arnold, J ; Coto-Solano, R ; Durantin, G ; Ellison, TM ; van Esch, D ; Heath, S ; Kratochvíl, F ; Maxwell-Smith, Z ; Nash, D ; Olsson, O ; Richards, M ; San, N ; Stoakes, H ; Thieberger, N ; Wiles, J (ISCA, 2018)
    Machine learning has revolutionized speech technologies for major world languages, but these technologies have generally not been available for the roughly 4,000 languages with populations of fewer than 10,000 speakers. This paper describes the development of ELPIS, a pipeline which language documentation workers with minimal computational experience can use to build their own speech recognition models, resulting in models being built for 16 languages from the Asia-Pacific region. ELPIS puts machine learning speech technologies within reach of people working with languages with scarce data, in a scalable way. This is impactful since it enables language communities to cross the digital divide, and speeds up language documentation. Complete automation of the process is not feasible for languages with small quantities of data and potentially large vocabularies. Hence our goal is not full automation, but rather to make a practical and effective workflow that integrates machine learning technologies.
  • Item
    Thumbnail Image
    The pacific expansion: Optimizing phonetic transcription of archival corpora
    Billington, R ; Stoakes, H ; Thieberger, N (ISCA, 2021-01-01)
    For most of the world’s languages, detailed phonetic analyses across different aspects of the sound system do not exist, due in part to limitations in available speech data and tools for efficiently processing such data for low-resource languages. Archival language documentation collections offer opportunities to extend the scope and scale of phonetic research on low-resource languages, and developments in methods for automatic recognition and alignment of speech facilitate the preparation of phonetic corpora based on these collections. We present a case study applying speech modelling and forced alignment methods to narrative data for Nafsan, an Oceanic language of central Vanuatu. We examine the accuracy of the forced-aligned phonetic labelling based on limited speech data used in the modelling process, and compare acoustic and durational measures of 17,851 vowel tokens for 11 speakers with previous experimental phonetic data for Nafsan. Results point to the suitability of archival data for large-scale studies of phonetic variation in low-resource languages, and also suggest that this approach can feasibly be used as a starting point in expanding to phonetic comparisons across closely-related Oceanic languages.
  • Item
    Thumbnail Image
    Multilingualism in Cyberspace - Longevity for Documentation of Small Languages
    Thieberger, N (Interregional Library Cooperation Centre, 2012)
  • Item
    Thumbnail Image
    Be Not Like the Wind: Access to Language and Music Records, Next Steps
    Thieberger, N ; Harris, A (European Language Resources Association (ELRA), 2020)
    Language archives play an important role in keeping records of the world’s languages safe. Accessible audio recordings held in archives can be used by speakers of small and endangered languages, and their communities, and provide a base for further research and documentation. There is an urgent need for historical analog tape recordings to be located and digitised, as they will soon be unplayable. PARADISEC holds records in 1228 languages. We run training for language documentation and are developing technologies to localise access to language records. A concerted effort is needed to support language archives and sustain language diversity.
  • Item
    No Preview Available
    The Language Documentation Quartet
    Musgrave, S ; Thieberger, N (University of Colorado at Boulder, 2021)
    As we noted in an earlier paper (Musgrave & Thieberger 2012), the written description of a language is an essentially hypertextual exercise, linking various kinds of material in a dense network. An aim based on that insight is to provide a model that can be implemented in tools for language documentation, allowing instantiation of the links always followed in writing a grammar or a dictionary, tracking backwards and forwards to the texts and media as the source of authority for claims made in an analysis. Our earlier paper described our initial efforts to encode Heath’s (1984) grammar, texts (1980), and dictionary (1982) of Nunggubuyu, an Australian language from eastern Arnhemland. We chose this body of work because it was written with many internal links between the three volumes. The links are all encoded with textual indexes which looked to be ready to be instantiated as automated hyperlinks once the technology was available. In this paper, we discuss our progress in identifying how the four component parts of a description (grammar, text, dictionary, media, henceforth the quartet) can be interlinked, what are the logical points at which to join them, and whether there are practical limits to how far this linking should be carried. We suggest that the problems which are exposed in this process can inform the development of an abstract or theoretical data structure for each of the components and these in turn can provide models for language documentation work which can feed into hypertext presentations of the type we are developing.
  • Item
    Thumbnail Image
    Building capacity for community-led documentation in Erakor, Vanuatu
    Krajinović, A ; Billington, R ; Emil, L ; Kaltap̃au, G ; Thieberger, N ; Vetulani, Z ; Paroubek, P (Wydawnictwo Nauka i Innowacje, 2019)
    Close collaboration between community members and visiting researchers offers mutual benefits, including opportunities for new research insights and an expanded scope for supporting language maintenance and developing practical materials. We discuss a collaboration in Erakor, Vanuatu aiming to build the capacity of community-based researchers to undertake and sustain language and cultural documentation projects. We focus on the technical and procedural skills required to collect, manage, and work with audio and video data, and give an overview of the outcomes of a community-led project after initial training. We discuss the benefits and challenges of this type of project from the perspective of the community researchers and the external linguists. We show that the community-led project in Erakor, in which data management and archiving are incorporated into the documentation process, has crucial benefits for both the community and the linguists. Two most salient benefits are: a) long-term documentation of linguistic and cultural practices calibrated towards community’s needs, and b) collections of large quantities of data of good phonetic quality, which, besides being readily available for research, have a great potential for training and testing emerging language technologies based on machine learning.
  • Item
    Thumbnail Image
    Prosodic marking of focus in Nafsan
    Fletcher, J ; Billington, R ; Thieberger, N ; Calhoun, S ; Escudero, P ; Tabain, M ; Warren, P (Australasian Speech Science and Technology Association, 2019)
    Languages use a variety of means to realise informational structure categories like topicalisation and focus. The interaction between prosody and focus realisation strategies was examined in Nafsan, a Southern Oceanic language of Vanuatu, in a series of tasks that were designed to explore prosodic realisation of informational and contrastive focus on nouns that were subjects or objects in mini-dialogues where word-order was manipulated. All speakers produced utterance-initial or utterance-final focal elements with a major pitch movement associated with the focused noun (subject or object). Focused nouns were also realised with a wider pitch and often realised in their own prosodic phrase compared to the same item in non-focal contexts. There was also significant syllable lengthening at the right edge of in-focus words. In utterance-initial contexts, post-focal material in Nafsan was almost always produced in a relatively compressed pitch range and there was evidence of de-phrasing of non-focal nouns regardless of utterance position, suggesting prosodic phrasing patterns similar to other languages with edge-marking prominence.
  • Item
    Thumbnail Image
    Unlocking the archives
    Barwick, L ; Thieberger, N ; Ferreira, V ; Ostler, N (FEL, 2018)
    The popular expression ‘locked in the archive’ suggests that items are impossible to find and access once they are archived. Benefiting from new technologies, digital language and music archives nowadays provide an increasing number of records online in and about the world’s small languages. Just six of these archives list between them over 31,000 items, representing something like 2,300 languages. We can certainly do better at making records more widely available—especially records from small, marginalised and sometimes isolated communities—but how do we build pathways for re-use? We discuss the practice of the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) through the rubric provided by the FAIR principles. Building resources for learning and teaching language, history and culture, revitalising local performance traditions or reinforcing social identity through broadcasting are all possible pathways for future re-use of archival material. Ultimately, it is up to community members to decide on what they will do with archival materials once they have access; and it is up to language archives to listen and do our best to keep the pathways open to enable that.
  • Item
    Thumbnail Image
    Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS)
    Foley, B ; Arnold, J ; Coto-Solano, R ; Durantin, G ; Mark, E ; van Esch, D ; Heath, S ; Kratochvíl, F ; Maxwell-Smith, Z ; Nash, D ; Olsson, O ; Richards, M ; San, N ; Stoakes, H ; Thieberger, N ; Wiles, J (International Speech Communication Association, 2018-08-30)
    Machine learning has revolutionised speech technologies for major world languages, but these technologies have generally not been available for the roughly 4,000 languages with populations of fewer than 10,000 speakers. This paper describes the development of Elpis, a pipeline which language documentation workers with minimal computational experience can use to build their own speech recognition models, resulting in models being built for 16 languages from the Asia-Pacific region. Elpis puts machine learning speech technologies within reach of people working with languages with scarce data, in a scalable way. This is impactful since it enables language communities to cross the digital divide, and speeds up language documentation. Complete automation of the process is not feasible for languages with small quantities of data and potentially large vocabularies. Hence our goal is not full automation, but rather to make a practical and effective workflow that integrates machine learning technologies.
  • Item
    Thumbnail Image
    Acoustic correlates of prominence in Nafsan
    Billington, R ; Fletcher, J ; Thieberger, N ; Volchok, B ; Epps, J ; Wolfe, J ; Smith, J ; Jones, C (Australasian Speech Science and Technology Association, 2018)
    Though Oceanic languages are often described as preferring primary stress on penultimate syllables, many different patterns have been noted across and within language families, and may interact with segmental and phonotactic factors. This is exemplified across linguistically diverse Vanuatu. However, both impressionistic and instrumentally-based escriptions of prosodic patterns and their correlates are limited for languages of this region. This paper presents preliminary acoustic and durational results for Nafsan, an Oceanic language of Vanuatu, which suggest a preference for prominence at the right edge of words, with fundamental frequency as a primary correlate.