School of Languages and Linguistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 11
  • Item
    Thumbnail Image
    Community-Led Documentation of Nafsan (Erakor, Vanuatu)
    Krajinovic, A ; Billington, R ; Emil, L ; Kaltapau, G ; Thieberger, N ; Vetulani, Z ; Paroubek, P ; Kubis, M (SPRINGER INTERNATIONAL PUBLISHING AG, 2022)
    We focus on a collaboration between community members and visiting linguists in Erakor, Vanuatu, aiming to build the capacity of community-based researchers to undertake and sustain documentation of Nafsan, the local indigenous language. We focus on the technical and procedural skills required to collect, manage, and work with audio and video data, and give an overview of the outcomes of a community-led documentation after initial training. We discuss the benefits and challenges of this type of project from the perspective of the community researchers and the external linguists. We show that community-led documentation such as this project in Erakor, in which data management and archiving are incorporated into the documentation process, has crucial benefits for both the community and the linguists. The two most salient benefits are: a) long-term documentation of linguistic and cultural practices calibrated towards community’s needs, and b) collection of larger quantities of data by community members, and often of better quality and scope than those collected by visiting linguists, which, besides being readily available for research, have a great potential for training and testing emerging language technologies for less-resourced languages, such as Automatic Speech Recognition (ASR).
  • Item
    Thumbnail Image
    Reflections on software and technology for language documentation
    Arkhipov, A ; Thieberger, N ( 2020-01-01)
    Technological developments in the last decades enabled an unprecedented growth in volumes and quality of collected language data. Emerging challenges include ensuring the longevity of the records, making them accessible and reusable for fellow researchers as well as for the speech communities. These records are robust research data on which verifiable claims can be based and on which future research can be built, and are the basis for revitalization of cultural practices, including language and music performance. Recording, storage and analysis technologies become more lightweight and portable, allowing language speakers to actively participate in documentation activities. This also results in growing needs for training and support, and thus more interaction and collaboration between linguists, developers and speakers. Both cutting-edge speech technologies and crowdsourcing methods can be effectively used to overcome bottlenecks between different stages of analysis. While the endeavour to develop a single all-purpose integrated workbench for documentary linguists may not be achievable, investing in robust open interchange formats that can be accessed and enriched by independent pieces of software seems more promising for the near future.
  • Item
    Thumbnail Image
    When Your Data is My Grandparents Singing. Digitisation and Access for Cultural Records, the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)
    Thieberger, N ; Harris, A (Ubiquity Press, Ltd., 2022-04-04)
    In this paper we discuss the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC), a research repository that explicitly aims to act as a conduit for research outputs to a range of audiences, both within and outside of academia. PARADISEC has been operating for 19 years, and has grown to hold over 390,000 files currently totaling 150 terabytes and representing 1,312 languages, many of them from Papua New Guinea and the Pacific. Our focus is on recordings and transcripts in the many small languages of the world, the songs and stories that are unique cultural expressions. While this research data is created for a particular project, it has huge value beyond academic research as it is typically oral tradition recorded in places where little else has been recorded. There is an increasing focus in academia on reproducible research and research data management, and repositories are the key to successful data management. We discuss the importance for research practice of having discipline-specific repositories. The data in our work is also cultural material that has value to the people recorded and their descendants, it is their grandparents and so we, as outsider researchers, have special responsibilities to treat the materials with respect and to ensure they are accessible to the people we have worked with.
  • Item
    Thumbnail Image
    Digital curation and access to recordings of traditional cultural performance.
    Thieberger, N ; Harris, A (UNESCO, 2021)
    Being home to over a quarter of the world’s languages, the Pacific is a particularly good place to focus on how language records can be made accessible. The creation and description of research records has not always been a priority for humanities academics and any records that are created have typically not been provided with good archival solutions. This is despite these records often being of cultural or historical relevance beyond academia. Many cultural agencies struggle to keep track of recordings they have made, and it is the same for many researchers. Often it is only when researchers prepare recordings for archiving that they realize how many (or few) are described adequately, or have been transcribed or translated.
  • Item
    Thumbnail Image
    The Pacific Expansion: Optimizing phonetic transcription of archival corpora
    Billington, R ; Stoakes, H ; Thieberger, N (ISCA-INT SPEECH COMMUNICATION ASSOC, 2021)
    For most of the world’s languages, detailed phonetic analyses across different aspects of the sound system do not exist, due in part to limitations in available speech data and tools for efficiently processing such data for low-resource languages. Archival language documentation collections offer opportunities to extend the scope and scale of phonetic research on low-resource languages, and developments in methods for automatic recognition and alignment of speech facilitate the preparation of phonetic corpora based on these collections. We present a case study applying speech modelling and forced alignment methods to narrative data for Nafsan, an Oceanic language of central Vanuatu. We examine the accuracy of the forced-aligned phonetic labelling based on limited speech data used in the modelling process, and compare acoustic and durational measures of 17,851 vowel tokens for 11 speakers with previous experimental phonetic data for Nafsan. Results point to the suitability of archival data for large-scale studies of phonetic variation in low-resource languages, and also suggest that this approach can feasibly be used as a starting point in expanding to phonetic comparisons across closely-related Oceanic languages.
  • Item
    No Preview Available
    Breathing digital life into Oceanic language corpora
    Vernaudon, J ; Thieberger, N ; Bambridge, T ; Parent, T (OpenEdition, 2021-01-01)
  • Item
    Thumbnail Image
    Be Not Like the Wind: Access to Language and Music Records, Next Steps
    Thieberger, N ; Harris, A (European Language Resources Association (ELRA), 2020)
    Language archives play an important role in keeping records of the world’s languages safe. Accessible audio recordings held in archives can be used by speakers of small and endangered languages, and their communities, and provide a base for further research and documentation. There is an urgent need for historical analog tape recordings to be located and digitised, as they will soon be unplayable. PARADISEC holds records in 1228 languages. We run training for language documentation and are developing technologies to localise access to language records. A concerted effort is needed to support language archives and sustain language diversity.
  • Item
    No Preview Available
    The Language Documentation Quartet
    Musgrave, S ; Thieberger, N (University of Colorado at Boulder, 2021)
    As we noted in an earlier paper (Musgrave & Thieberger 2012), the written description of a language is an essentially hypertextual exercise, linking various kinds of material in a dense network. An aim based on that insight is to provide a model that can be implemented in tools for language documentation, allowing instantiation of the links always followed in writing a grammar or a dictionary, tracking backwards and forwards to the texts and media as the source of authority for claims made in an analysis. Our earlier paper described our initial efforts to encode Heath’s (1984) grammar, texts (1980), and dictionary (1982) of Nunggubuyu, an Australian language from eastern Arnhemland. We chose this body of work because it was written with many internal links between the three volumes. The links are all encoded with textual indexes which looked to be ready to be instantiated as automated hyperlinks once the technology was available. In this paper, we discuss our progress in identifying how the four component parts of a description (grammar, text, dictionary, media, henceforth the quartet) can be interlinked, what are the logical points at which to join them, and whether there are practical limits to how far this linking should be carried. We suggest that the problems which are exposed in this process can inform the development of an abstract or theoretical data structure for each of the components and these in turn can provide models for language documentation work which can feed into hypertext presentations of the type we are developing.
  • Item
    Thumbnail Image
    Technology in Support of Languages of The Pacific: Neo-Colonial or Post-Colonial?
    Thieberger, N (Logos Verlag, 2020)
    The Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) has been digitising recordings of traditional cultural expression, oral tradition, and music (TCE) for 17 years. A major motivation for this work is the return of these recordings to where they were made. On the one hand there is social justice in preserving records of languages that are under-represented in the internet and cultural institutions, and making them accessible in what can be characterised as a postcolonial restitution of these records. On the other hand, if it is first world academics doing this work, it risks being yet another colonial appropriation of Indigenous knowledge. In this paper I explore some of these issues to help set directions both for our own work, and for future similar projects. “From ancient times to the present, disquieting use has been made of archival records to establish, document, and perpetuate the influence of power elites.” (Jimerson, 2007: 254). A quarter of the world’s languages are found in the Pacific. In communities sustained over many hundreds of years by local economies, the globalised world impinges through urbanisation and encroaching metropolitan languages, particularly in media, accelerating language change and language shift. Technology, in the form of computers, digital files, and ways of working with them, is a first world product, access to it is costly, and the interface to it is never in a local language but always in a major metropolitan language. Training and experience in using technology is not easily obtained, leading to a divide between those who are able to use it and those who are consumers of it, typically via expensive internet connections. How can a new kind of archival enterprise “establish, document, and perpetuate” the languages and their speakers, in order to counter what Jimerson calls the influence of power elites.
  • Item
    Thumbnail Image
    Phonetic evidence for phonotactic change in Nafsan (South Efate)
    Billington, R ; Thieberger, N ; Fletcher, J (Pacini Editore SpA, 2020)
    Nafsan, an Oceanic language of central Vanuatu, is notable for the complex phonotactic structures it exhibits compared to languages spoken further to the north, and compared to the general preference for CV syllables among Oceanic languages. Various types of heterorganic consonant clusters are found in syllable onsets, and are thought to have arisen from the loss of selected medial vowels. Medial vowel deletion is suggested to be a process of change which has been underway for some time in the language, but the details of how this process operates have not been fully clear. Unresolved questions relating to the status of length in the vowel system and the location of lexical prominence have posed a challenge to arriving at a detailed description of vowel deletion and its consequences. Drawing together recent phonetic analyses and previous work, this paper provides an overview of phonotactic structures in contemporary Nafsan and outlines the main factors which lead to the deletion of medial vowels and result in the complex syllable onsets observed today.