Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Generalized language identification
    LUI, MARCO ; ( 2014)
    Language identification is the task of determining the natural language that a document or part thereof is written in. The central theme of this thesis is generalized language identification, and deals with eliminating the assumptions that limit the applicability of language identification techniques to specific settings that may not be representative of real-world use cases for automatic language identification techniques. Research to date has treated language identification as a supervised machine learning problem, and in this thesis I argue that such a characterization is inadequate, showing how standard document representations do not take into account the variation in a language between different sources of text, and developing a representation that is robust to such variation. I also develop a method that allows for language identification of multilingual documents, i.e. documents that contain text in more than one language. Finally, I investigate the robustness of existing off-the-shelf language identification methods on a novel and challenging domain.