Concepedia

Publication | Closed Access

Deep Multimodal Neural Architecture Search

91

Citations

28

References

2020

Year

TLDR

Designing effective neural networks is crucial in deep multimodal learning, yet most existing approaches are task‑specific and manually engineered, limiting generalization. The authors propose a generalized deep multimodal neural architecture search framework to automatically design architectures for diverse multimodal tasks. They construct a deep encoder‑decoder backbone built from primitive operations selected from a predefined pool, attach task‑specific heads, and use a gradient‑based NAS algorithm to efficiently learn optimal architectures for each task. Experimental results on five datasets demonstrate that MMnasNet outperforms state‑of‑the‑art methods on visual question answering, image‑text matching, and visual grounding.

Abstract

Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding.

References

YearCitations

Page 1