Open source corpus analysis tools for Malay
Citations
Altmetric
Author
BALDWIN, TIMOTHY; Awab, Su'adDate
2006Source Title
Proceedings, the 5th International Conference on Language Resources and Evaluation (LREC2006)University of Melbourne Author/s
Baldwin, TimothyAffiliation
Engineering: Department of Computer Science and Software EngineeringMetadata
Show full item recordDocument Type
Conference PaperCitations
Baldwin, T., & Awab, S. (2006). Open source corpus analysis tools for Malay. In, Proceedings, the 5th International Conference on Language Resources and Evaluation (LREC2006), Genoa, Italy.Access Status
Open AccessAbstract
Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.
Keywords
Malay; tokeniser; lemmatiser; morphological analyserExport Reference in RIS Format
Endnote
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
Refworks
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References