Home / Journals / CMC / Online First / doi:10.32604/cmc.2025.071988
Special Issues
Table of Content

Open Access

ARTICLE

Intelligent Human Interaction Recognition with Multi-Modal Feature Extraction and Bidirectional LSTM

Muhammad Hamdan Azhar1,2,#, Yanfeng Wu1,#, Nouf Abdullah Almujally3, Shuaa S. Alharbi4, Asaad Algarni5, Ahmad Jalal2,6, Hui Liu1,7,8,*
1 Guodian Nanjing Automation Co., Ltd., Nanjing, 600268, China
2 Faculty of Computing and AI, Air University, Islamabad, 44000, Pakistan
3 Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, 11671, Saudi Arabia
4 Department of Information Technology, College of Computer, Qassim University, Buraydah, 52571, Saudi Arabia
5 Department of Computer Sciences, Faculty of Computing and Information Technology, Northern Border University, Rafha, 91911, Saudi Arabia
6 Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, 02841, Republic of Korea
7 Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, 210044, China
8 Cognitive Systems Lab, University of Bremen, Bremen, 28359, Germany
* Corresponding Author: Hui Liu. Email: email
# These authors contributed equally to this work
(This article belongs to the Special Issue: Advances in Image Recognition: Innovations, Applications, and Future Directions)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.071988

Received 17 August 2025; Accepted 22 October 2025; Published online 29 December 2025

Abstract

Recognizing human interactions in RGB videos is a critical task in computer vision, with applications in video surveillance. Existing deep learning-based architectures have achieved strong results, but are computationally intensive, sensitive to video resolution changes and often fail in crowded scenes. We propose a novel hybrid system that is computationally efficient, robust to degraded video quality and able to filter out irrelevant individuals, making it suitable for real-life use. The system leverages multi-modal handcrafted features for interaction representation and a deep learning classifier for capturing complex dependencies. Using Mask R-CNN and YOLO11-Pose, we extract grayscale silhouettes and keypoint coordinates of interacting individuals, while filtering out irrelevant individuals using a proposed algorithm. From these, we extract silhouette-based features (local ternary pattern and histogram of optical flow) and keypoint-based features (distances, angles and velocities) that capture distinct spatial and temporal information. A Bidirectional Long Short-Term Memory network (BiLSTM) then classifies the interactions. Extensive experiments on the UT Interaction, SBU Kinect Interaction and the ISR-UOL 3D social activity datasets demonstrate that our system achieves competitive accuracy. They also validate the effectiveness of the chosen features and classifier, along with the proposed system’s computational efficiency and robustness to occlusion.

Keywords

Human interaction recognition; keypoint coordinates; grayscale silhouettes; bidirectional long short-term memory network
  • 29

    View

  • 5

    Download

  • 0

    Like

Share Link