Abstract
Accelerating the process of data collection, annotation, and analysis is an urgent need for linguistic fieldwork and documentation of endangered languages (Bird, 2009). Our experiments describe how we maximize the quality for the Nepal Bhasa syntactic complement structure chunking model. Native speaker language consultants were trained to annotate a minimally selected raw data set (Suarez et al, 2019). Embedded clauses, matrix verbs, and embedded verbs were annotated. We apply both statistical training algorithms and transfer learning in our training, including Naive Bayes, MaxEnt, and fine-tuning the pre-trained mBERT model (Devlin et al, 2018). We show that with limited annotated data, the model is already sufficient for the task. The modeling resources we used are largely available for many other endangered languages. The practice is easy to duplicate for training a shallow parser for other endangered languages.
Original language | English (US) |
---|---|
Title of host publication | COMPUTEL 2022 - 5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, Proceedings of the Workshop |
Editors | Sarah Moeller, Antonios Anastasopoulos, Antti Arppe, Aditi Chaudhary, Atticus Harrigan, Josh Holden, Jordan Lachler, Alexis Palmer, Shruti Rijhwani, Lane Schwartz |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 61-67 |
Number of pages | 7 |
ISBN (Electronic) | 9781955917308 |
State | Published - 2022 |
Event | 5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, COMPUTEL 2022 - Dublin, Ireland Duration: May 26 2022 → May 27 2022 |
Publication series
Name | COMPUTEL 2022 - 5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, Proceedings of the Workshop |
---|
Conference
Conference | 5th Workshop on the Use of Computational Methods in the Study of Endangered Languages, COMPUTEL 2022 |
---|---|
Country/Territory | Ireland |
City | Dublin |
Period | 5/26/22 → 5/27/22 |
Bibliographical note
Funding Information:We thank our Nepal Bhasa native speaker consultants for their time and efforts with providing us the annotation help.
Publisher Copyright:
© 2022 Association for Computational Linguistics.