Publication | Closed Access
Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries
61
Citations
11
References
2021
Year
Unknown Venue
EngineeringSemantic WebCorpus LinguisticsCode-mixed QueriesText MiningNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsEntity RecognitionLanguage EngineeringCode-mixed Web QueriesLanguage StudiesQuery ExpansionNamed-entity RecognitionMachine TranslationEntity DisambiguationNlp TaskKnowledge DiscoveryWeb QueriesSemantic ParsingLinguistics
Named entity recognition (NER) for Web queries is very challenging. Queries often do not consist of well-formed sentences, and contain very little context, with highly ambiguous queried entities. Code-mixed queries, with entities in a different language than the rest of the query, pose a particular challenge in domains like e-commerce (e.g. queries containing movie or product names). This work tackles NER for code-mixed queries, where entities and non-entity query terms co-exist simultaneously in different languages. Our contributions are twofold. First, to address the lack of code-mixed NER data we create EMBER, a large-scale dataset in six languages with four different scripts. Based on Bing query data, we include numerous language combinations that showcase real-world search scenarios. Secondly, we propose a novel gated architecture that enhances existing multi-lingual Transformers with a Mixture-of-Experts model to dynamically infuse multi-lingual gazetteers, allowing it to simultaneously differentiate and handle entities and non-entity query terms in multiple languages. Experimental evaluation on code-mixed queries in several languages shows that our approach efficiently utilizes gazetteers to recognize entities in code-mixed queries with an F1=68%, an absolute improvement of +31% over a non-gazetteer baseline.
| Year | Citations | |
|---|---|---|
Page 1
Page 1