BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and\n Visual Relationship Detection

Abstract

Multimodal representation learning is gaining more and more interest within\nthe deep learning community. While bilinear models provide an interesting\nframework to find subtle combination of modalities, their number of parameters\ngrows quadratically with the input dimensions, making their practical\nimplementation within classical deep learning pipelines challenging. In this\npaper, we introduce BLOCK, a new multimodal fusion based on the\nblock-superdiagonal tensor decomposition. It leverages the notion of block-term\nranks, which generalizes both concepts of rank and mode ranks for tensors,\nalready used for multimodal fusion. It allows to define new ways for optimizing\nthe tradeoff between the expressiveness and complexity of the fusion model, and\nis able to represent very fine interactions between modalities while\nmaintaining powerful mono-modal representations. We demonstrate the practical\ninterest of our fusion model by using BLOCK for two challenging tasks: Visual\nQuestion Answering (VQA) and Visual Relationship Detection (VRD), where we\ndesign end-to-end learnable architectures for representing relevant\ninteractions between modalities. Through extensive experiments, we show that\nBLOCK compares favorably with respect to state-of-the-art multimodal fusion\nmodels for both VQA and VRD tasks. Our code is available at\nhttps://github.com/Cadene/block.bootstrap.pytorch.\n

References

Page 1

	Year	Citations

Page 1