Home / Journals / CMC / Online First / doi:10.32604/cmc.2025.072508
Special Issues
Table of Content

Open Access

ARTICLE

Transformer-Driven Multimodal for Human-Object Detection and Recognition for Intelligent Robotic Surveillance

Aman Aman Ullah1,2,#, Yanfeng Wu1,#, Shaheryar Najam3, Nouf Abdullah Almujally4, Ahmad Jalal5,6,*, Hui Liu1,7,8,*
1 Guodian Nanjing Automation Co., Ltd., Nanjing, 210003, China
2 Department of Biomedical Engineering, Riphah International University, I-14, Islamabad, 44000, Pakistan
3 Department of Electrical Engineering, Bahria University, H-11, Islamabad, 44000, Pakistan
4 Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, 11671, Saudi Arabia
5 Department of Computer Science, Air University, E-9, Islamabad, 44000, Pakistan
6 Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, 02841, Republic of Korea
7 Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Artificial Intelligence (School of Future Technology), Nanjing University of Information Science and Technology, Nanjing, 210003, China
8 Cognitive Systems Lab, University of Bremen, Bremen, 28359, Germany
* Corresponding Author: Ahmad Jalal. Email: email; Hui Liu. Email: email
# These authors contributed equally to this work
(This article belongs to the Special Issue: Advances in Object Detection and Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.072508

Received 28 August 2025; Accepted 29 October 2025; Published online 26 December 2025

Abstract

Human object detection and recognition is essential for elderly monitoring and assisted living however, models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings. To address this, we present SCENET-3D, a transformer-driven multimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline. In the first stage, scene analysis, rich geometric and texture descriptors are extracted from RGB frames, including surface-normal histograms, angles between neighboring normals, Zernike moments, directional standard deviation, and Gabor-filter responses. In the second stage, scene-object analysis, non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information. In the third stage, human-pose estimation, silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints, which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons. Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities. Experiments on UCLA (95.8%), ETRI-Activity3D (89.4%), and CAD-120 (91.2%) demonstrate that combining pseudo-3D skeletons with rich scene-object fusion significantly improves generalizable activity recognition, enabling safer elderly care, natural human–robot interaction, and robust context-aware robotic perception in real-world environments.

Keywords

Human object detection; elderly care; RGB-based pose estimation; scene context analysis; object recognition Gabor features; point cloud reconstruction
  • 360

    View

  • 128

    Download

  • 0

    Like

Share Link