Publication | Closed Access
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus
23
Citations
32
References
2007
Year
Unknown Venue
Cluster ComputingLarge CorpusEngineeringBig Data IndexingFeature IndexSemantic WebCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsData IntegrationLanguage StudiesData ManagementIndex PartitioningContent-based Document RoutingDocument RoutingSimilarity SearchKnowledge DiscoveryText IndexingComputer ScienceDistributed Query ProcessingData IndexingSearch Engine IndexingIndexing TechniqueLinguistics
We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document.
| Year | Citations | |
|---|---|---|
Page 1
Page 1