Video Google: a text retrieval approach to object matching in videos

TLDR

The study proposes an object and scene retrieval system that locates all instances of a user‑outlined object within a video. It represents objects with viewpoint‑invariant region descriptors, tracks them over time to filter noise, and applies a text‑retrieval style inverted index with vector quantization and ranking to produce a ranked list of key frames, illustrated on two feature films. The system delivers immediate, ranked key‑frame results, effectively functioning like Google for video content.

Abstract

We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.

References

Page 1

	Year	Citations
Object recognition from local scale-invariant features David Lowe EngineeringFeature DetectionBiometricsLocalizationRobust Feature	1999	16.1K
The anatomy of a large-scale hypertextual Web search engine Sergey Brin, Lawrence M. Page Computer Networks and ISDN Systems Search TechnologySearch Engine OptimizationEngineeringInformation RetrievalData Science	1998	15.8K
Robust wide-baseline stereo from maximally stable extremal regions Jiřı́ Matas, Ondřej Chum, M. Urban, Image and Vision Computing Geometric ModelingMachine VisionImage AnalysisEngineeringStereo Vision	2004	3.7K
Local grayvalue invariants for image retrieval C. Schmid, Roger Mohr IEEE Transactions on Pattern Analysis and Machine Intelligence Image AnalysisInformation RetrievalData ScienceMachine VisionPattern Recognition	1997	1.4K
Robust Wide Baseline Stereo from Maximally Stable Extremal Regions Jiřı́ Matas, Ondřej Chum, M. Urban, Geometric ModelingWide-baseline Stereo ProblemMachine VisionImage AnalysisEngineering	2002	1.3K
A performance evaluation of local descriptors Krystian Mikolajczyk, C. Schmid EngineeringFeature DetectionImage RetrievalBiometricsLocalization	2003	952
Reliable feature matching across widely separated views Adam Baumberg EngineeringMachine LearningArbitrary ViewpointsRobust FeatureAffine Texture Invariants	2002	624
Local feature view clustering for 3D object recognition David Lowe EngineeringMultiple ImagesLocal Feature ViewTraining Image3D Computer Vision	2005	481
Wide Baseline Stereo Matching based on Local, Affinely Invariant Regions Tinne Tuytelaars, Luc Van Gool EngineeringStereo ImagingImage MosaicingLocalizationImage Analysis	2000	446
Shape-adapted smoothing in estimation of 3-D shape cues from affine deformations of local 2-D brightness structure Tony Lindeberg, Jonas Gårding Image and Vision Computing Geometric ModelingMachine VisionImage AnalysisEngineering3D Vision	1997	264

Page 1