Publication | Closed Access
NOVO: Learnable and Interpretable Document Identifiers for Model-Based IR
23
Citations
12
References
2023
Year
Unknown Venue
EngineeringLarge Language ModelCorpus LinguisticsText MiningWord EmbeddingsNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsDocument ClassificationLanguage StudiesLanguage ModelsNamed-entity RecognitionModel-based Information RetrievalMachine TranslationInterpretable Document IdentifiersNovo DocidsNlp TaskKnowledge DiscoveryComputer ScienceInformation ExtractionRetrieval Augmented GenerationVector Space ModelLinguisticsModel-based Ir
Model-based Information Retrieval (Model-based IR) has gained attention due to advancements in generative language models. Unlike traditional dense retrieval methods relying on dense vector representations of documents, model-based IR leverages language models to retrieve documents by generating their unique discrete identifiers (docids). This approach effectively reduces the requirements to store separate document representations in an index. Most existing model-based IR approaches utilize pre-defined static docids, i.e., these docids are fixed and are not learnable by training on the retrieval tasks. However, these docids are not specifically optimized for retrieval tasks, which makes it difficult to learn semantics and relationships between documents and achieve satisfactory retrieval performance. To address the above limitations, we propose Neural Optimized VOcabularial (NOVO) docids. NOVO docids are unique n-gram sets identifying each document. They can be generated in any order to retrieve the corresponding document and can be optimized through training to better learn semantics and relationships between documents. We propose to optimize NOVO docids through query denoising modeling and retrieval tasks, allowing for optimizing both semantic and token representations for such docids. Experiments on two datasets under the normal and zero-shot settings show that NOVO exhibits strong performance in more effective and interpretable model-based IR.
| Year | Citations | |
|---|---|---|
Page 1
Page 1