Publication | Closed Access
Reinforced Structured State-Evolution for Vision-Language Navigation
36
Citations
31
References
2022
Year
Artificial IntelligenceLanguage GroundingNavigation StateStructured PredictionEngineeringSequential LearningIntelligent SystemsEmbodied AgentNatural Language ProcessingMultimodal LlmVisual GroundingRobot LearningMachine TranslationCognitive ScienceMachine VisionVision RoboticsVision Language ModelComputer ScienceWorld ModelComputer VisionMaintained Navigation StateStructured State-evolutionLinguistics
Vision-and-language Navigation (VLN) task requires an embodied agent to navigate to a remote location following a natural language instruction. Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as the navigator. In such a paradigm, the sequence model predicts action at each step through a maintained navigation state, which is generally represented as a one-dimensional vector. However, the crucial navigation clues (i.e., object-level environment layout) for embodied navigation task is discarded since the maintained vector is essentially unstructured. In this paper, we propose a novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN. Specifically, we utilise the graph-based feature to represent the navigation state instead of the vector-based state. Accordingly, we devise a Reinforced Layout clues Miner (RLM) to mine and detect the most crucial layout graph for long-term navigation via a customised reinforcement learning strategy. Moreover, the Structured Evolving Module (SEM) is proposed to maintain the structured graph-based state during navigation, where the state is gradually evolved to learn the object-level spatial-temporal relationship. The experiments on the R2R and R4R datasets show that the proposed SEvol model improves VLN models' performance by large margins, e.g., +3% absolute SPL accuracy for NvEM and +8% for EnvDrop on the R2R test set.
| Year | Citations | |
|---|---|---|
Page 1
Page 1