Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition

Xiaorui Zhang; Xianglong Zeng; Wei Sun; Yongjun Ren; Tong Xu

doi:10.32604/csse.2023.035119

Open Access icon Open Access

ARTICLE

Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition

Xiaorui Zhang^1,2,3,*, Xianglong Zeng¹, Wei Sun^3,4, Yongjun Ren^1,2,3, Tong Xu⁵

1 Engineering Research Center of Digital Forensics, Ministry of Education, Jiangsu Engineering Center of Network Monitoring, School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing, 210044, China
2 Wuxi Research Institute, Nanjing University of Information Science & Technology, Wuxi, 214100, China
3 Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science & Technology, Nanjing, 210044, China
4 School of Automation, Nanjing University of Information Science & Technology, Nanjing, 210044, China
5 University of Southern California, Los Angeles, California, USA

* Corresponding Author: Xiaorui Zhang. Email: email

Computer Systems Science and Engineering 2023, 46(1), 671-686. https://doi.org/10.32604/csse.2023.035119

Received 08 August 2022; Accepted 26 October 2022; Issue published 20 January 2023

Abstract

Gesture recognition technology enables machines to read human gestures and has significant application prospects in the fields of human-computer interaction and sign language translation. Existing researches usually use convolutional neural networks to extract features directly from raw gesture data for gesture recognition, but the networks are affected by much interference information in the input data and thus fit to some unimportant features. In this paper, we proposed a novel method for encoding spatio-temporal information, which can enhance the key features required for gesture recognition, such as shape, structure, contour, position and hand motion of gestures, thereby improving the accuracy of gesture recognition. This encoding method can encode arbitrarily multiple frames of gesture data into a single frame of the spatio-temporal feature map and use the spatio-temporal feature map as the input to the neural network. This can guide the model to fit important features while avoiding the use of complex recurrent network structures to extract temporal features. In addition, we designed two sub-networks and trained the model using a sub-network pre-training strategy that trains the sub-networks first and then the entire network, so as to avoid the sub-networks focusing too much on the information of a single category feature and being overly influenced by each other’s features. Experimental results on two public gesture datasets show that the proposed spatio-temporal information encoding method achieves advanced accuracy.

Keywords

Dynamic gesture recognition; spatio-temporal information encoding; multimodal input; pre-training; score fusion

Cite This Article

X. Zhang, X. Zeng, W. Sun, Y. Ren and T. Xu, "Multimodal spatiotemporal feature map for dynamic gesture recognition," Computer Systems Science and Engineering, vol. 46, no.1, pp. 671–686, 2023. https://doi.org/10.32604/csse.2023.035119

BibTex EndNote RIS

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition

Abstract

Keywords

Cite This Article

837

424

1

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link