Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.077916
Special Issues
Table of Content

Open Access

ARTICLE

Secondary Realignment: An Embodied Intelligent Operational Framework Integrating Vision-Language and Action Two-Stage Models

Jinjiang Lin, Yuan Lu, Han Li, Xiaolong Cai, Enyi Chen, Jiansheng Guan*
School of Electrical Engineering and Automation, Xiamen University of Technology, Xiamen, China
* Corresponding Author: Jiansheng Guan. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.077916

Received 19 December 2025; Accepted 03 March 2026; Published online 29 April 2026

Abstract

Manipulating objects based on verbal commands in cluttered environments remains a critical challenge in robotic arm research. Verbal commands possess high semantic abstraction, while precise grasping and placement actions rely on fine-grained geometric perception. The disparity between these two domains is the primary cause of operational errors. Particularly in certain cluttered scenarios, visual-spatial noise and background redundancy further disrupt attention distribution, significantly degrading the generalization capabilities of existing methods in unseen environments. To address these issues, this paper proposes the Secondary Realignment (SR) framework. It decouples vision-language alignment and vision-action alignment into two stages, mitigating semantic-geometric discrepancies through a hierarchical approach to substantially reduce errors in cross-modal mapping. Simultaneously, to address noise and redundancy in visual-language features, we design a Deep Sparse Self-Attention (DSSA) module. This module dynamically fuses sparse and dense attention mechanisms through self-learning parameters, adaptively enhancing relevant features while suppressing irrelevant noise. Extensive simulation experimental results demonstrate that compared to the state-of-the-art method A2, our approach achieves 9.7%, 9.9%, and 17.6% higher task success rates in grasping, placing, and pick-and-place tasks, respectively, further validating its effectiveness.

Keywords

Robot grasping; visual language model; language-conditional grasping; attention mechanism
  • 39

    View

  • 7

    Download

  • 0

    Like

Share Link