ARNet: Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos

Hussain Dawood; Marriam Nawaz; Tahira Nazir; Ali Javed; Abdul Khader; Hatoon AlSagri

doi:10.32604/cmes.2025.066415

Open Access icon Open Access

ARTICLE

ARNet: Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos

Hussain Dawood¹, Marriam Nawaz², Tahira Nazir³, Ali Javed², Abdul Khader Jilani Saudagar^4,*, Hatoon S. AlSagri⁴

1 School of Computing, Skyline University College, Sharjah, 1797, United Arab Emirates
2 Department of Software Engineering, University of Engineering and Technology-Taxila, Punjab, 47050, Pakistan
3 Department of Software Engineering and Computer Science, Riphah International University-Gulberg Green Campus, Islamabad, 46000, Pakistan
4 Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, 11432, Saudi Arabia

* Corresponding Author: Abdul Khader Jilani Saudagar. Email: email

Computer Modeling in Engineering & Sciences 2025, 144(1), 429-459. https://doi.org/10.32604/cmes.2025.066415

Received 08 April 2025; Accepted 16 June 2025; Issue published 31 July 2025

Abstract

Reliable human action recognition (HAR) in video sequences is critical for a wide range of applications, such as security surveillance, healthcare monitoring, and human-computer interaction. Several automated systems have been designed for this purpose; however, existing methods often struggle to effectively integrate spatial and temporal information from input samples such as 2-stream networks or 3D convolutional neural networks (CNNs), which limits their accuracy in discriminating numerous human actions. Therefore, this study introduces a novel deep-learning framework called the ARNet, designed for robust HAR. ARNet consists of two main modules, namely, a refined InceptionResNet-V2-based CNN and a Bi-LSTM (Long Short-Term Memory) network. The refined InceptionResNet-V2 employs a parametric rectified linear unit (PReLU) activation strategy within convolutional layers to enhance spatial feature extraction from individual video frames. The inclusion of the PReLU method improves the spatial information-capturing ability of the approach as it uses learnable parameters to adaptively control the slope of the negative part of the activation function, allowing richer gradient flow during backpropagation and resulting in robust information capturing and stable model training. These spatial features holding essential pixel characteristics are then processed by the Bi-LSTM module for temporal analysis, which assists the ARNet in understanding the dynamic behavior of actions over time. The ARNet integrates three additional dense layers after the Bi-LSTM module to ensure a comprehensive computation of both spatial and temporal patterns and further boost the feature representation. The experimental validation of the model is conducted on 3 benchmark datasets named HMDB51, KTH, and UCF Sports and reports accuracies of 93.82%, 99%, and 99.16%, respectively. The Precision results of HMDB51, KTH, and UCF Sports datasets are 97.41%, 99.54%, and 99.01%; the Recall values are 98.87%, 98.60%, 99.08%, and the F1-Score is 98.13%, 99.07%, 99.04%, respectively. These results highlight the robustness of the ARNet approach and its potential as a versatile tool for accurate HAR across various real-world applications.

Keywords

Action recognition; Bi-LSTM; computer vision; deep learning; InceptionResNet-V2; PReLU

Cite This Article

APA Style

Dawood, H., Nawaz, M., Nazir, T., Javed, A., Saudagar, A.K.J. et al. (2025). ARNet: Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos. Computer Modeling in Engineering & Sciences, 144(1), 429–459. https://doi.org/10.32604/cmes.2025.066415

Vancouver Style

Dawood H, Nawaz M, Nazir T, Javed A, Saudagar AKJ, AlSagri HS. ARNet: Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos. Comput Model Eng Sci. 2025;144(1):429–459. https://doi.org/10.32604/cmes.2025.066415

IEEE Style

H. Dawood, M. Nawaz, T. Nazir, A. Javed, A. K. J. Saudagar, and H. S. AlSagri, “ARNet: Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos,” Comput. Model. Eng. Sci., vol. 144, no. 1, pp. 429–459, 2025. https://doi.org/10.32604/cmes.2025.066415

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

ARNet: Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos

Abstract

Keywords

Cite This Article

1775

744

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link