Publication | Closed Access
Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining
20
Citations
14
References
2010
Year
EngineeringIntelligent Information RetrievalMultilingual PretrainingTransliteration GenerationCorpus LinguisticsText MiningNatural Language ProcessingLanguage DocumentationInformation RetrievalComputational LinguisticsLanguage EngineeringLanguage StudiesMachine TranslationTerminology ExtractionCross-language RetrievalCrosslingual Information RetrievalRetrieval Augmented GenerationMsr IndiaMicrosoft Research IndiaLinguistics
This report documents the participation of Microsoft Research India (MSR India) in the Crosslingual Information Retrieval (CLIR) evaluation organized by the Forum for Information Retrieval Evaluation 2010 [FIRE 2010]. MSR India participated in two crosslingual evaluation tasks, namely the HindiEnglish and Tamil-English crosslingual tasks, in addition to the English-English monolingual task. Our core CLIR engine employed a language modeling based approach using query likelihood based document ranking and a probabilistic translation lexicon learned from English-Hindi and English-Tamil parallel corpora. In addition, we employed two specific techniques to deal with out-of-vocabulary terms in the crosslingual runs: first, generating transliterations directly or transitively, and second, mining possible transliteration equivalents from the documents retrieved in the firstpass. We show experimentally that each of these techniques significantly improved the overall retrieval performance of our crosslingual IR system. Our system, using all of the topic-description-and-narrative information, achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual EnglishEnglish task; in crosslingual tasks, our systems achieved a peak performance of a MAP of 0.4977 in Hindi-English and 0.4145 in the Tamil-English. The post-task analyses indicate that the mining of appropriate transliterations from the top results of the first-pass retrieval achieved enhanced the crosslingual performance of our system overall, in addition to enhancing individual performance of more queries. Our Hindi-English crosslingual retrieval performance was nearly equal (~97%) to the English-English monolingual retrieval performance, indicating the effectiveness of our approaches to handle OOV‟s to enhance the baseline performance of our CLIR system.
| Year | Citations | |
|---|---|---|
Page 1
Page 1