Publication | Open Access
Processing South Asian Languages Written in the Latin Script: the\n Dakshina Dataset
32
Citations
24
References
2020
Year
This paper describes the Dakshina dataset, a new resource consisting of text\nin both the Latin and native scripts for 12 South Asian languages. The dataset\nincludes, for each language: 1) native script Wikipedia text; 2) a romanization\nlexicon; and 3) full sentence parallel data in both a native script of the\nlanguage and the basic Latin alphabet. We document the methods used for\npreparation and selection of the Wikipedia text in each language; collection of\nattested romanizations for sampled lexicons; and manual romanization of\nheld-out sentences from the native script collections. We additionally provide\nbaseline results on several tasks made possible by the dataset, including\nsingle word transliteration, full sentence transliteration, and language\nmodeling of native script and romanized text. Keywords: romanization,\ntransliteration, South Asian languages\n
| Year | Citations | |
|---|---|---|
Page 1
Page 1