Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

TLDR

Books provide fine‑grained visual details and high‑level semantic states that evolve through a narrative. This work seeks to align books with their movie releases to supply rich, semantically detailed visual explanations beyond existing captions. The authors train an unsupervised neural sentence embedding from a large book corpus, pair it with a video‑text neural embedding for clip‑sentence similarity, and fuse the signals with a context‑aware CNN. The resulting system attains strong quantitative alignment performance and yields diverse qualitative examples demonstrating its utility across multiple tasks.

Abstract

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in the current datasets. To align movies and books we propose a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

References

Page 1

	Year	Citations
Long Short-Term Memory Sepp Hochreiter, Jürgen Schmidhuber Neural Computation	1997	93.8K
Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba UvA-DARE (University of Amsterdam) Artificial IntelligenceMathematical ProgrammingModel OptimizationMachine VisionMachine Learning	2014	84.5K
Going deeper with convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Image ClassificationDeep Neural NetworksImage AnalysisMachine LearningData Science	2015	46.2K
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation Kyunghyun Cho, Bart van Merriënboer, Çaǧlar Gülçehre, EngineeringMachine LearningRecurrent Neural NetworkLanguage ProcessingText Mining	2014	23.7K
BLEU Kishore Papineni, Salim Roukos, Todd J. Ward, Natural Language ProcessingComputer-assisted TranslationEngineeringCorpus LinguisticsHuman Evaluation	2001	20.9K
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov, Kai Chen, Greg S. Corrado arXiv (Cornell University) EngineeringMachine LearningVector SpaceLarge Language ModelCorpus Linguistics	2013	18.1K
Neural Machine Translation by Jointly Learning to Align and Translate Dzmitry Bahdanau arXiv (Cornell University) Natural Language ProcessingComputer-assisted TranslationStructured PredictionSequence ModellingEngineering	2014	14.6K
Sequence to Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le arXiv (Cornell University) Structured PredictionEngineeringMachine LearningSequential LearningSmt System	2014	13.3K
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Jun‐Young Chung, Çaǧlar Gülçehre, Kyunghyun Cho, arXiv (Cornell University) MusicStructured PredictionEngineeringMachine LearningSequential Learning	2014	10.7K
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningNatural Language ProcessingMultimodal Llm	2015	7.5K

Page 1