Zhengyuan Xu and Junxiao Yu contributed equally to this work
In this article, to reduce the complexity and improve the generalization ability of current gesture recognition systems, we propose a novel SE-CNN attention architecture for sEMG-based hand gesture recognition. The proposed algorithm introduces a temporal squeeze-and-excite block into a simple CNN architecture and then utilizes it to recalibrate the weights of the feature outputs from the convolutional layer. By enhancing important features while suppressing useless ones, the model realizes gesture recognition efficiently. The last procedure of the proposed algorithm is utilizing a simple attention mechanism to enhance the learned representations of sEMG signals to perform multi-channel sEMG-based gesture recognition tasks. To evaluate the effectiveness and accuracy of the proposed algorithm, we conduct experiments involving multi-gesture datasets Ninapro DB4 and Ninapro DB5 for both inter-session validation and subject-wise cross-validation. After a series of comparisons with the previous models, the proposed algorithm effectively increases the robustness with improved gesture recognition performance and generalization ability.
Hand gesture recognition is one of the most important perceptual channels in human-computer interaction. It is widely used in many fields, such as virtual reality, intelligent sign language translation for deaf-mute, rehabilitation therapy and assessment, bionic prosthesis, etc., and shows a broad potential in different applications [
Various research [
Early studies of sEMG gesture recognition focus on combining artificially designed EMG signal features and machine learning classification methods. After many years of study, features that properly represent sEMG signals covering time, frequency, and time-frequency domains have been well designed and examined [
With the development of science and technology, deep learning is well established and equipped with the ability of modeling and feature extraction [
From the previous work, gesture recognition based on the integration of deep learning and sEMG signal features including time, frequency, and time-frequency domains is widely proposed by scholars. Hu et al. [
Despite all the advantages from the previous work, several problems need to be taken into consideration: 1) simple architecture always results in a low recognition rate; 2) algorithms with high recognition rates have relatively poor real-time inference ability and usually require extremely complex architecture; 3) few groups do the “subject-wise [
To solve the mentioned problems as well as increase the recognition rate and generalization ability, we propose a novel squeeze-and-excite CNN (SE-CNN) attention architecture for sEMG-based hand gesture recognition. The main contributions to this paper are as follows:
The introduction of the temporal squeeze-and-excite block into the CNN frame is to establish the possible relations between channels and enhance important features while suppressing useless ones. Therefore, the robustness and recognition rate of the model can be sufficiently improved. A simple attention mechanism is added to the end of the model to capture space relations between features and enhance the learned representations of multi-channel signals. Batch normalization (BN) and parametric rectified linear unit (PReLU) are added to the CNN model to approximate any arbitrary function. This will also accelerate the convergence, prevent gradient explosion and vanishing, avoid overfitting, and enhance further expression of CNN, and therefore, improve the model’s generalization ability. To the best of our knowledge, the proposed hybrid strategy in combination with SE, CNN, and attention mechanism is the first to be applied to the sEMG-based hand gesture recognition field.
The organization of the rest of the paper is as follows. The details of the hybrid model are fully described in
The proposed SE-CNN attention architecture is composed of three parts as illustrated in
The proposed basic CNN architecture consists of three components:
Three Conv1D layers: The main parameters of the first Conv1D layer are: filter
Batch Normalization [
Then, we calculate the variance of the mini-batch with:
and followed by normalization:
Then scaling and migration are applied to
Parametric Rectified Linear Unit [
where
Squeeze-and-Excite Network [
At last, the resulting weight multiplies with the feature
Then, we get
The schematic diagram of the attention mechanism in this article is shown in
Finally, the output
For a fair comparison of existing models, we use the same datasets from reference [
The Ninapro DB4 dataset has collected signals from muscle activities through 12 active single-differential wireless electrodes of Cometa. Corresponding to the brachioradialis joint, 8 electrodes are placed around the forearm and evenly distributed. Another 2 electrodes are placed at the joint’s main movement points of the finger flexor and extensor muscles. The last 2 are placed on the main movement points of the biceps and triceps. 10 testers with 52 gesture movements, including an additional rest condition, are included in the Ninapro DB4 dataset. The sampling frequency of surface EMG is 2 kHz during data collection. Each movement is repeated 6 times.
The Ninapro DB5 dataset has collected muscle activity signals through two Thalmic Myo armbands which contain 16 active single-differential wireless electrodes. The first Myo armband is placed close to the elbow with the first transducer placed on the humeral joint. Closer to the hand, the second Myo armband is placed after the first one with a 22.5-degree angle. The sampling frequency is 200 Hz when collecting surface EMG signals. 10 testers with 52 gesture movements, including an additional rest condition, are included in the dataset. Each movement is repeated 6 times.
Based on the algorithm by Josephs et al. [
Training Environment: All the experiments are implemented in Python 3.7, Tensorflow_gpu 2.4, and Keras 2.4 on WIN 64 with Intel® Core™ i7-8700H CPU @3.2 GHz 3.19 GHz, 64G RAM, NVIDIA GeForce GTX 1080Ti 11 GB, CUDA 10.1.243, and CuDNN v7.6.3.30. Training Details: The batch size is set to 128 and the learning rate anneals from
The details of experimental specifications for Ninapro DB4 and Ninapro DB5 datasets are presented in
Dataset
Number
Number
Number
Number
Intact
Number
Number
Trials
Trials
Trials
Sampling
Ninapro DB4
53
12
17
23
10
12
6
1, 2, 4, 6
3
5
2000 Hz
Ninapro DB5
53
12
17
23
10
16
6
1, 2, 4, 6
3
5
200 Hz
Database | Method | Accuracy |
---|---|---|
Ninapro DB5, all gestures | Proposed method | |
Josephs et al. [ |
87.09% | |
Shen et al. [ |
72.09% | |
Ninapro DB5, wrist and functional gestures | Proposed method | 87.80% |
Josephs et al. [ |
87.13% | |
Wei et al. [ |
||
Ninapro DB5, just wrist gestures | Proposed method | |
Josephs et al. [ |
89.18% | |
Chen et al. [ |
67.42% | |
Wu et al. [ |
62% | |
Ninapro DB4, all gestures | Proposed method | |
Josephs et al. [ |
74.88% | |
Pizzolato et al. [ |
69.01% | |
Wei et al. [ |
60% |
From
Accuracy is the most straightforward and efficient metric to evaluate a model. However, sometimes, accuracy can be misleading, especially for imbalanced datasets. As shown in
Category
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Ninapro DB4
87936
1916
1733
1905
1712
1859
1669
1855
1718
1650
1398
1426
1660
1502
1391
1471
1416
1243
1589
1089
1431
1518
1495
1631
1596
1711
Ninapro DB5
129146
1763
1238
1606
1190
1458
1515
1591
1507
1119
1289
1087
1310
1640
1298
1290
1254
1271
1197
1461
1160
1377
1322
1528
1297
1374
Category
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Ninapro DB4
1396
1170
1685
1746
1660
1535
1692
1868
1745
1933
1742
1853
1697
1574
1700
1731
1687
1777
1794
1652
1789
1782
1711
1700
1626
1848
1699
Ninapro DB5
1177
1120
1573
1465
1296
1385
1349
1340
1466
1660
1240
1337
1388
1382
1449
1374
1357
1553
1679
1466
1508
1648
1507
1476
1527
1741
1745
Based on The proposed algorithm achieves a relatively high recognition rate in all experiments. The recognition rate for all gestures in Ninapro DB5 is 87.42%, for wrist and functional gestures in Ninapro DB5 is 87.80%, for just wrist gestures in Ninapro DB5 is 89.54%, and for all gestures in Ninapro DB4 is 77.61%. Compared to the algorithm proposed by Josephs et al. [
Poor generalization has always been an obstacle to the development of gesture recognition techniques. To further improve the generalization ability of the proposed algorithm, we include data from different subjects for training and testing. The details are shown in
From
Dataset | Method | Accuracy |
---|---|---|
Ninapro DB5, all gestures | Proposed method | |
Josephs et al. [ |
62.96% | |
Ninapro DB5, wrist and functional gestures | Proposed method | |
Josephs et al. [ |
64.43% | |
Ninapro DB5, just wrist gestures | Proposed method | |
Josephs et al. [ |
66.36% |
This article introduces a novel SE-CNN attention architecture for hand gesture recognition based on surface EMG signals. Adding the temporal SE module to the output of convolutional layers can efficiently highlight the expression of significant features, establish relationships among different channels, as well as illustrate the model’s ability to extract features. Additionally, the attention mechanism greatly assists in capturing spatial relationships between features and enhancing the learning ability in terms of multi-channel expression, which further boosts its robustness and recognition rates. The results of the experiments indicate that the proposed algorithm is valuable in gesture recognition systems. Based on the results above, our algorithm shows good performance in various datasets, including all gestures in Ninapro DB5 (recognition rate: 87.42%), wrist and functional gestures in Ninapro DB5 (recognition rate: 87.80%), just wrist gestures in Ninapro DB5 (recognition rate: 89.54%), and all gestures in Ninapro DB4 (recognition rate: 77.61%). The proposed algorithm has achieved the best performance compared to previous studies on Ninapro BD4. Ninapro DB5 and Ninapro DB4 datasets are two datasets from completely different sources in both data acquisition methods and sampling frequencies, but our algorithm still achieves relatively good outcomes on them. Moreover, the proposed algorithm performs well in evaluation metrics including precision, recall, and F1-score, which proves that our model has better adaptability when applying to different datasets, more relaxed requirements for data collection programs, and relatively fine classification performance when dealing with imbalanced data, ensuring the success in developing gesture recognition systems.
Poor generalization is one of the important factors in slowing down the development of recognition systems. Herein, we conduct an inter-subject validation on all gestures, wrist and functional gestures, and just wrist gestures from Ninapro DB5 respectively. 10 subjects from either dataset are collected and divided into independent training, testing, and validation sets by five-fold cross-validation method with corresponding recognition rates of 65.74% (53 gestures), 66.67% (41 gestures), and 69.47% (18 gestures). Compared with Josephs et al. [
Meanwhile, we interestingly find that the gestures involving similar muscle activities lead to unsatisfied performance and confusion classification results, such as wrist pronation (axis: middle finger) in wrist gestures, wrist supination (axis: little finger), and wrist pronation (axis: little finger). More experiments need to be done in the future, including further analysis on sEMG signals from gestures that share similar muscle activities. Also, the neural network needs to be further optimized for better classification in gestures involving the same muscle movement by investigating the characteristics of corresponding sEMG signals. In inter-subject validation, though the proposed algorithm has achieved a better result compared to previous studies, the recognition rate is still not very high. Therefore, further investigation is required to achieve better inter-subject validation performance.
In this article, we propose a novel SE-CNN attention architecture for sEMG-based hand gesture recognition. It consists of a basic CNN architecture, a temporal squeeze-and-excite architecture, and an attention mechanism. The main advantage of the proposed algorithm is the introduction of the temporal squeeze-and-excite block into a CNN-based gesture recognition architecture using sEMG signals. The temporal squeeze-and-excite block enhances important features by suppressing meaningless ones. Therefore, the model’s feature extraction is improved. Also, an attention mechanism is added at the end of the model to enhance the capability of representation learning of multi-channel signals. We conduct two experimental paradigms: 1) intro-session validation, and 2) subject-wise cross-validation. The results from intra-session validation using Ninapro DB5 and Ninapro DB4 datasets indicate that the proposed algorithm has improved the system robustness in hand gesture recognition tasks with a better recognition rate. Further analysis using subject-wise cross-validation in all gestures, wrist and functional gestures, and just wrist gestures in the Ninapro DB5 dataset shows that the proposed algorithm has better stability and generalization ability in multi-gesture recognition, facilitating the promotion of intelligent interactive systems based on surface EMGs in practical applications. However, the algorithm proposed in this article still has limitations, especially when classifying gestures with similar muscle activities, which demotes the application of EMG gesture recognition. Therefore, the proposed algorithm still has a lot of room for improvement. Further investigation is required in the future in terms of individual myoelectricity differences during gesture recognition.
The authors would like to thank the editor and reviewers for the valuable comments and suggestions.