Publication | Open Access
A Hybrid Approach for Transliterated Word-Level Language Identification
22
Citations
27
References
2015
Year
Unknown Venue
EngineeringCorpus LinguisticsText MiningNatural Language ProcessingLanguage DocumentationInformation RetrievalWord-level LanguageComputational LinguisticsLanguage EngineeringLanguage StudiesNamed-entity RecognitionMachine TranslationHybrid ApproachLinguisticsCross-language RetrievalComputer ScienceToken Level BanglaLanguage RecognitionRoman ScriptText ProcessingSpeech Translation
In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair.
| Year | Citations | |
|---|---|---|
Page 1
Page 1