Abstract
This work focuses on the task of finding latent vector representations of the words in a corpus. In particular, we address the issue of what to do when there are multiple languages in the corpus. Prior work has, among other techniques, used canonical correlation analysis to project pre-trained vectors in two languages into a common space. We propose a simple and scalable method that is inspired by the notion that the learned vector representations should be invariant to translation between languages. We show empirically that our method outperforms prior work on multilingual tasks, matches the performance of prior work on monolingual tasks, and scales linearly with the size of the input data (and thus the number of languages being embedded).
Original language | English (US) |
---|---|
Title of host publication | Conference Proceedings - EMNLP 2015 |
Subtitle of host publication | Conference on Empirical Methods in Natural Language Processing |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 1084-1088 |
Number of pages | 5 |
ISBN (Electronic) | 9781941643327 |
State | Published - 2015 |
Event | Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Lisbon, Portugal Duration: Sep 17 2015 → Sep 21 2015 |
Publication series
Name | Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing |
---|
Other
Other | Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 |
---|---|
Country/Territory | Portugal |
City | Lisbon |
Period | 9/17/15 → 9/21/15 |
Bibliographical note
Publisher Copyright:© 2015 Association for Computational Linguistics.