Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Lexical Semantics of the Long Tail
    Wada, Takashi ( 2023-12)
    Natural language data is characterised by containing a variety of long-tail instances. For instance, whilst there exists an abundance of text data on the web for major languages such as English, there is a dearth of data for a great number of minor languages. Furthermore, when we look at the corpus data in each language, it usually consists of a very small number of high-frequency words and a plethora of long-tail expressions that are not commonly used in text, such as scientific jargon and multiword expressions. Generally, those long-tail instances draw little attention from the research community, largely because they often have a biased interest in a handful of resource-rich languages and models' overall performance on a specific task, which is, in many cases, not heavily influenced by the long-tail instances in text. In this thesis, we aim to shed light on the long-tail instances in language and explore NLP models that represent their lexical semantics effectively. In particular, we focus on the three types of long-tail instances, namely, extremely low-resource languages, rare words, and multiword expressions. Firstly, for extremely low-resource languages, we propose a new cross-lingual word embedding model that works well with very limited data, and show its effectiveness on the task of aligning semantically equivalent words between high- and low-resource languages. For evaluation, we conduct experiments that involve three endangered languages, namely Yongning Na, Shipibo-Konibo and Griko, and demonstrate that our model performs well on real-world language data. Secondly, with regard to rare words, we first investigate how well recent embedding models can capture lexical semantics in general on lexical substitution, where given a target word in context, a model is tasked with retrieving its synonymous words. To this end, we propose a new lexical substitution method that effectively makes use of existing embedding models, and show that it performs very well on English and Italian, especially for retrieving low-frequency substitutes. We also reveal a couple of limitations of current embedding models: (1) they are highly affected by morphophonetic and morphosyntactic biases, such as article–noun agreement in English and Italian; and (2) they often represent rare words poorly when they are segmented into multiple subwords. To address the second limitation, we propose a new method that performs very well in predicting synonyms of rare words, and demonstrate its effectiveness on lexical substitution and simplification. Lastly, to represent multiword expressions (MWEs) effectively, we propose a new method that paraphrases MWEs with more literal expressions that are easier to understand, e.g. swan song with final performance. Compared to previous approaches that resort to human-crafted resources such as dictionaries, our model is fully unsupervised and relies on monolingual data only, making it applicable to resource-poor languages. For evaluation, we perform experiments in two high-resource languages (English and Portuguese) and one low-resource language (Galician), and demonstrate that our model generates high-quality paraphrases of MWEs in all languages, and aids pre-trained sentence embedding models to encode sentences that contain MWEs by paraphrasing them with literal expressions.
  • Item
    Thumbnail Image
    Discovering syntactic phenomena with and within precision grammars
    Letcher, Ned ( 2018)
    Precision grammars are hand-crafted computational models of human languages that are capable of parsing text to yield syntactic and semantic analyses. They are valuable for applications requiring the accurate extraction of semantic relationships and they also enable hypothesis testing of holistic grammatical theories over quantities of text impossible to analyse manually. Their capacity to generate linguistically accurate analyses over corpus data also supports another application: augmenting linguistic descriptions with query facilities for retrieving examples of syntactic phenomena. In order to construct such queries, it is first necessary to identify the signature of target syntactic phenomena within the analyses produced by the precision grammar in use. This is often a difficult process, however, as analyses within the descriptive grammar can diverge from those in the precision grammar due to differing theoretical assumptions made by the two resources, the use of different sets of data to inform their respective analyses, and the exigencies of implementing a large-scale formalised analyses. In this thesis, I present my research into developing methods for improving the discoverability of syntactic phenomena within precision grammars. This includes the construction of a corpus annotated with syntactic phenomena which supports the development of syntactic phenomenon discovery methodologies. Included within this context is the investigation of strategies for measuring inter-annotator agreement over textual annotations for which annotators both segment and label text---a property that traditional kappa-like measures do not support. The second facet of my research involves the development of an interactive methodology—and accompanying implementation—for navigating the alignment between dynamic characterisations of syntactic phenomena and the internal components of HPSG precision grammars associated with these phenomena. In addition to supporting the enhancement of descriptive grammars with precision grammars, this methodology has the potential to improve the accessibility of precision grammars themselves, enabling people not involved in their development to explore their internals using familiar syntactic phenomena, as well as allowing grammar engineers to navigate their grammars through the lens of analyses that are different to those found in the grammar.