Secondary Realignment: An Embodied Intelligent Operational Framework Integrating Vision-Language and Action Two-Stage Models

Jinjiang Lin, Yuan Lu, Han Li, Xiaolong Cai, Enyi Chen, Jiansheng Guan^*
School of Electrical Engineering and Automation, Xiamen University of Technology, Xiamen, China
* Corresponding Author: Jiansheng Guan. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.077916

Received 19 December 2025; Accepted 03 March 2026; Published online 29 April 2026

Download PDF

Abstract

Manipulating objects based on verbal commands in cluttered environments remains a critical challenge in robotic arm research. Verbal commands possess high semantic abstraction, while precise grasping and placement actions rely on fine-grained geometric perception. The disparity between these two domains is the primary cause of operational errors. Particularly in certain cluttered scenarios, visual-spatial noise and background redundancy further disrupt attention distribution, significantly degrading the generalization capabilities of existing methods in unseen environments. To address these issues, this paper proposes the Secondary Realignment (SR) framework. It decouples vision-language alignment and vision-action alignment into two stages, mitigating semantic-geometric discrepancies through a hierarchical approach to substantially reduce errors in cross-modal mapping. Simultaneously, to address noise and redundancy in visual-language features, we design a Deep Sparse Self-Attention (DSSA) module. This module dynamically fuses sparse and dense attention mechanisms through self-learning parameters, adaptively enhancing relevant features while suppressing irrelevant noise. Extensive simulation experimental results demonstrate that compared to the state-of-the-art method A2, our approach achieves 9.7%, 9.9%, and 17.6% higher task success rates in grasping, placing, and pick-and-place tasks, respectively, further validating its effectiveness.

Keywords

Robot grasping; visual language model; language-conditional grasping; attention mechanism

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

39

View
7

Download
0

Like

Combing Type-Aware Attention and Graph Convolutional Networks for Event Detection
Kun Ding, Lu Xu, Ming Liu, Xiaoxiong...
Intrusion Detection Based on Bidirectional Long Short-Term Memory with Attention Mechanism
Yongjie Yang, Shanshan Tu, Raja...
A Novel Action Transformer Network for Hybrid Multimodal Sign Language Recognition
Sameena Javaid, Safdar Rizvi
Deep Attention Network for Pneumonia Detection Using Chest X-Ray Images
Sukhendra Singh, Sur Singh Rawat,...
A Survey on Image Semantic Segmentation Using Deep Learning Techniques
Jieren Cheng, Hua Li, Dengbo Li,...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Secondary Realignment: An Embodied Intelligent Operational Framework Integrating Vision-Language and Action Two-Stage Models

Abstract

Keywords

39

7

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link