Processing South Asian Languages Written in the Latin Script: the\n Dakshina Dataset

Abstract

This paper describes the Dakshina dataset, a new resource consisting of text\nin both the Latin and native scripts for 12 South Asian languages. The dataset\nincludes, for each language: 1) native script Wikipedia text; 2) a romanization\nlexicon; and 3) full sentence parallel data in both a native script of the\nlanguage and the basic Latin alphabet. We document the methods used for\npreparation and selection of the Wikipedia text in each language; collection of\nattested romanizations for sampled lexicons; and manual romanization of\nheld-out sentences from the native script collections. We additionally provide\nbaseline results on several tasks made possible by the dataset, including\nsingle word transliteration, full sentence transliteration, and language\nmodeling of native script and romanized text. Keywords: romanization,\ntransliteration, South Asian languages\n

References

Page 1

	Year	Citations

Page 1