Publication | Closed Access
"ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification
34
Citations
18
References
2014
Year
EngineeringPart-of-speech TaggingMultilingualismPsycholinguisticsLanguage VariationMultilingual PretrainingLanguage LearningCorpus LinguisticsHai BhaiText MiningApplied LinguisticsNatural Language ProcessingLanguage DocumentationSocial MediaData ScienceLanguage TestingComputational LinguisticsLanguage AcquisitionLanguage StudiesMachine TranslationCodemixed TextLanguage RecognitionCode-mixed DatasetLinguistics
Language identification is a necessary prerequisite for processing any user generated text, where the language is unknown. It becomes even more challenging when the text is code-mixed, i.e., two or more languages are used within the same text. Such data is commonly seen in social media, where further challenges might arise due to contractions and transliterations. The existing language identification systems are not designed to deal with codemixed text, and as our experiments show, perform poorly on a synthetically created code-mixed dataset for 28 languages.We propose extensions to an existing approach for word level language identification. Our technique not only outperforms the existing methods, but also makes no assumption about the language pairs mixed in the text a common requirement of the existing word level language identification systems.This study shows that word level language identification is most likely to confuse between languages which are linguistically related (e.g., Hindi and Gujarati, Czech and Slovak), for which special disambiguation techniques might be required.
| Year | Citations | |
|---|---|---|
Page 1
Page 1