Aman Aman Ullah1,2,#, Yanfeng Wu1,#, Shaheryar Najam3, Nouf Abdullah Almujally4, Ahmad Jalal5,6,*, Hui Liu1,7,8,*
CMC-Computers, Materials & Continua, Vol.87, No.1, 2026, DOI:10.32604/cmc.2025.072508
- 10 February 2026
Abstract Human object detection and recognition is essential for elderly monitoring and assisted living however, models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings. To address this, we present SCENET-3D, a transformer-driven multimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline. In the first stage, scene analysis, rich geometric and texture descriptors are extracted from RGB frames, including surface-normal histograms, angles between neighboring normals, Zernike moments, directional standard deviation, and Gabor-filter responses. In the second stage, scene-object analysis, non-human objects… More >