Open Access
ARTICLE
Efficient Video Emotion Recognition via Multi-Scale Region-Aware Convolution and Temporal Interaction Sampling
1 College of Computer and Information Engineering, Nanjing Tech University, Nanjing, 211816, China
2 College of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing, 210044, China
3 College of Computer Science, Nanjing University of Information Science and Technology, Nanjing, 210044, China
4 College of Automation, Nanjing University of Information Science and Technology, Nanjing, 210044, China
5 College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, 211816, China
* Corresponding Author: Xiaorui Zhang. Email:
(This article belongs to the Special Issue: Advances in Deep Learning and Neural Networks: Architectures, Applications, and Challenges)
Computers, Materials & Continua 2026, 86(2), 1-19. https://doi.org/10.32604/cmc.2025.071043
Received 30 July 2025; Accepted 17 October 2025; Issue published 09 December 2025
Abstract
Video emotion recognition is widely used due to its alignment with the temporal characteristics of human emotional expression, but existing models have significant shortcomings. On the one hand, Transformer multi-head self-attention modeling of global temporal dependency has problems of high computational overhead and feature similarity. On the other hand, fixed-size convolution kernels are often used, which have weak perception ability for emotional regions of different scales. Therefore, this paper proposes a video emotion recognition model that combines multi-scale region-aware convolution with temporal interactive sampling. In terms of space, multi-branch large-kernel stripe convolution is used to perceive emotional region features at different scales, and attention weights are generated for each scale feature. In terms of time, multi-layer odd-even down-sampling is performed on the time series, and odd-even sub-sequence interaction is performed to solve the problem of feature similarity, while reducing computational costs due to the linear relationship between sampling and convolution overhead. This paper was tested on CMU-MOSI, CMU-MOSEI, and Hume Reaction. The Acc-2 reached 83.4%, 85.2%, and 81.2%, respectively. The experimental results show that the model can significantly improve the accuracy of emotion recognition.Keywords
The expression of human emotions is a chronological process [1]. Unlike text and images, videos provide temporal association information in addition to spatial content information. Consequently, there are a lot of research implications for video-based emotion identification, especially in the areas of content recommendation, psychological analysis, and human-computer interface. In recent years, an increasing number of researchers have been devoting their attention to the field of video-based emotion recognition and have made significant advancements in research [2–4].
Recently, researchers have been exploring the application of Transformer [5] and its improved models, such as Vision Transformer (VIT) [6], Swin Transformer [7], and TimeSformer [8] in the field of video-based emotion recognition. Wang et al. [9] used ResNet with graph convolution for spatial feature extraction and subsequently incorporated TimeSformer to effectively model global inter-frame dependency, yielding favorable outcomes. The combination of contrast learning and TimeSformer by Chaudhari et al. [10] yielded promising results in emotion recognition. Transformer is used extensively for creating global dependencies because of its multi-headed self-attention, but it still faces difficulties because of its high computational cost. Even though Differential Transformer [11], Slide-Transformer [12], and some other models exhibit superior self-attention capabilities, they still necessitate extensive pairwise computations. In addition, the self-attention mechanism applied to temporal series, as discovered by Shi et al. [13], introduces the issue of feature similarity. The probability matrix produced by self-attention causes the input feature matrix to rapidly converge to rank 1 at a bi-exponential rate. This tendency eventually affects the efficiency of model recognition by gradually increasing similarity within the series.
The spatial distribution of emotions in images, called the emotion region, has been consistently observed in existing studies [14]. The existing video emotion recognition models have small and single convolutional kernels, which only allow them to learn single-scale features from small regions [15,16], enabling them to perceive multi-scale regions simultaneously. The emotion regions in the from-far-to-near photos exhibit a wide range of sizes, ranging from tiny to huge, as depicted in Fig. 1a–c. To accommodate the scale variation, it is imperative for the model to possess spatial capabilities that enable feature extraction across diverse scales. In addition, Fig. 1c,d demonstrates that a single image can contain multiple emotion regions with varying sizes simultaneously. Moreover, these emotion regions are presented in discrete forms, including square and bar arrangements. This means it needs to understand regions of different scales.

Figure 1: Original image and the image with emotion region annotation
Researchers have shifted their focus back to convolution because of the high computational overhead of the transformer and the issue of feature similarity in the temporal dimension. Convolution reduces computational load and accelerates convergence because its complexity is directly correlated with the number of pixels in a picture. Temporally, SCINet [17] employs iterative down-sampling at different temporal resolutions to extract information, leveraging the convolution’s ability to handle data of any resolution. Spatially, the utilization of a large kernel offers a broader perspective. Large kernel convolutions are integrated into Transformer’s general architecture to create ConvNeXt [18], which requires less training time to achieve the same results as an equivalent Transformer model. A large effective receptive field is made possible by the use of large-kernel convolution, which facilitates more thorough spatial information exploration and the identification of entire emotion regions. SegNeXt [19] introduces multi-size branch convolution and enhances original features by the use of convolutional attention. The experimental findings show that in the region extraction task, convolutional attention performs better than Transformer’s self-attention. This study’s requirements for recognizing and utilizing emotion regions at various scales are satisfied by the idea of multi-size branch convolution.
This study, which is motivated by the previously described research, suggests a novel video-based emotion recognition model. Spatially, combine the large kernel decomposition of VAN with the multi-scale convolution of SegNeXt, and attempt to decompose the standard n × n large-kernel convolution into a 1 × n rectangular convolution. By breaking down the traditional convolution, the stripe convolution minimizes the computation required for large-kernel convolution. Meanwhile, generate attention weights for each scale of the emotional region. These attention weights are also used to integrate multi-scale emotional regions. Temporally, decompose long sequences by filtering odd and even frames in the sequence. Unlike SCINet, each layer uses element multiplication and convolution to achieve interaction between sub-sequence pairs, capturing temporal correlation information between odd and even subsequences with minimal overhead. The suggested model allows for the generation of global temporal dependency with less cost due to the linear overhead of convolution and sampling in terms of the number of pixels.
In summary, the main contributions are as follows:
i) We propose a multi-scale region-aware module. It utilizes multi-branch large-kernel stripe convolution to extract emotion regions of varying scales and shapes. In order to reduce computation, stripe convolution is also used. Then, using a fully connected layer’s pooling operation, attention weights are produced for each scale of emotion regions. This allows multi-scale emotion regions to be integrated using these weights, enhancing perceptual diversity while emphasizing important characteristics.
ii) The temporal series interactive sampling learning module is proposed. A multi-layer odd-even down-sampling is employed on the temporal series data. Each layer incorporates convolution and element-wise multiplication to facilitate interaction between pairs of odd-even sub-series, enabling the model to learn the temporal correlation information effectively. Compared to the Transformer, the proposed technique exhibits linear complexity related to the number of pixels in sampling and convolution operations. As a result, our model maintains good recognition accuracy during model training by establishing global temporal dependency with minimal computational resources.
The rest of the paper is organized as follows. Section 2 provides a concise overview of the related work associated with the model. Section 3 presents the general framework of the proposed model, followed by an in-depth elucidation of each proposed module. Section 4 describes the main experiments and verifies their validity. Section 5 summarizes the study and offers future perspectives.
This section describes in detail the work related to the proposed method in terms of both spatial information extraction and temporal information extraction.
2.1 Spatial Information Extraction
Since ResNet [20] solved the difficult training of deep convolutional neural networks, convolutional neural networks have achieved important advances in feature extraction. In order to break down the feature map into a multi-scale pyramid representation, Lin et al. [21] suggested using the Adaptive Context module. Li et al. [22] proposed a cross-layer feature distribution module that adaptively generates weights between multi-layer features and fuses them according to the weights. Res2Net [23] takes split feature maps and improves multi-scale representation through multipath superimposed small kernel convolution, thereby achieving good results in multi-scale feature extraction and fusion. However, most of these techniques use a fixed 3 × 3 convolution kernel to extract features from images, which limits the effective receptive field. Consequently, when managing downstream jobs, the model’s performance degrades. The dominance of convolutional neural networks in computer vision is challenged by Vision Transformer.
More and more models based on the visual Transformer, such as Segformer [24] and Swin Transformer [7], have shown superior performance in downstream visual tasks compared to CNN due to their excellent overall architecture and multi-head self-attention mechanism. In contrast, multi-head self-attention is essentially dense pairwise processing, which leads to higher computational complexity. Currently, in the spatial domain, ConvNeXt [18] imitates the Transformer architecture and introduces large-kernel convolution, which has similar effects to Transformer architecture models of the same scale, but reduces training time. The subsequent VAN [15] network proposes to use the Large Kernel Attention (LKA) mechanism to establish channel and spatial attention, and decompose the large kernel convolution to achieve it at a lower computational cost. SegNeXt [19] uses convolutional attention to improve original features and introduces multi-size branch convolution. It is demonstrated through a series of experiments that convolutional attention performs better in region feature extraction and augmentation than the Transformer’s self-attention. Table 1 shows the pros. and cons. of several spatial information extraction models.

Inspired by the aforementioned models, we propose a multi-scale region-aware module. This module adopts the four-layer architecture of Transformer; however, we substitute self-attention with convolution-based multi-scale region-aware attention to reduce the computational load and improve the emotion region extraction. By employing large kernel convolutions, we can expand the sensory field and extract spatial information from a larger neighborhood range within the image. At the same time, stripe convolution [25] helps alleviate the model’s computational burden while perceiving the stripe regions. Consequently, the proposed method yields the most diverse shapes and comprehensive information about extracted emotion regions. Additionally, the attention weights are generated for each scale of emotion regions by pooling operations in the whole connected layer, which highlights important emotion regions and takes inspiration from humans’ selective attention mechanism. Lastly, the weights of these multi-scale emotion zones are summed.
2.2 Temporal Information Extraction
With the in-depth study of Transformer and self-attention, VTN [26] applied sliding window to focus on temporal correlation in the time dimension and tried to introduce Transformer into the temporal series. Wang et al. [9] achieved good success in the video emotion identification community by combining TimeSformer for global dependency modeling between frames and using ResNet with graph convolution for spatial feature extraction. However, the extensive pairwise computation of self-attention without a priori knowledge of localization and translational isotropy still results in a larger computing cost. The introduction of the time dimension in Transformer also brings about the problem of feature similarity, that is, the self-attention probability matrix drives the input feature matrix to converge to rank 1 at a double exponential rate, resulting in an increase in sequence similarity and thus affecting the recognition performance of the model. Shi et al. [13] discovered that convolution and residual concatenation can help to solve this issue. Previously, TCN [27] attempted to handle the temporal series task by expanding the convolution with residual concatenation, in which features were extracted directly across time steps using convolution. PredCNN [28] combines a cascade multiplication unit (CMU) with CNN to capture the inter-frame dependency of the temporal series. At the moment, SimVP [29], which is totally based on CNN, extracts inter-frame dependency by concatenating large kernel convolutions in a bi-directional manner without adding any further tricks or intricate plans. Table 2 shows the pros. and cons. of several temporal information extraction models.

Motivated by the aforementioned method, we present a temporally interactive sample learning module. Layer by layer, the lengthy data are broken down using iterative odd-even down-sampling. To enable interaction between pairs of odd-even sub-series and further capture the temporal correlation information within them, each layer uses convolutional processes with element-wise multiplication. Moreover, the convolution is utilized as the interaction weights to enhance feature diversity and mitigate potential issues related to the feature similarity in self-attention. Additionally, each sub-series undergoes another round of convolution, enabling effective fusion of surrounding pixels and enhancing information fusion. The global temporal dependency emerges as the sampling-interaction develops. Global temporal dependency can be better modeled by the module thanks to the linear complexity of sampling and convolution with respect to the number of pixels, in contrast to the quadratic complexity of dense pairwise computing in Transformer.
The proposed multi-scale region-aware module perceives emotion regions at various scales spatially by employing multi-branch large-kernel stripe convolution. Additionally, by pooling completely linked layers, attention weights are produced in the emotion regions at each scale. This module can improve the diversity of perception and highlight the prominence of focused features. To learn temporal correlation information between pairs of odd-even sub-series, a temporal approach based on multiple convolutional interactions and iterative odd-even down-sampling is proposed. The approach preserves recognition accuracy while establishing global temporal connections with minimal overhead. The multi-scale region-aware module, temporal interaction sampling learning module, and model architecture are then explained in detail.
The architecture of the proposed model is shown in Fig. 2, which mainly consists of a multi-scale region-aware module and a temporal interaction sampling learning module. To get the combined region characteristics for each frame, fixed-length series are dynamically sampled from the source video, down-sampled by the Stem layer, and supplied to the multi-scale region perception module. The structure of the multi-scale region-aware module is shown in Fig. 2. The module adopts the four-layer architecture of Transformer but uses convolution-based multi-scale region-aware attention to replace self-attention for region extraction. A region attention aggregation submodule and a branching convolution submodule make up multi-scale region-aware attention. Using convolution kernels of various sizes on each branch, the branching convolution submodule extracts emotion regions of various scales in each frame. Subsequently, the region attention aggregation submodule calculates the attention weights of the emotion regions at each scale and generates combined region features by aggregating regions according to the weights. We build the spatial portion of the entire model by stacking multi-scale region-aware modules using a multi-stage pooling method, which produces new temporal series. The temporal interaction sampling learning module then receives the new series and uses multi-layer odd-even down-sampling to progressively break down the lengthy series. Each layer uses convolution and elemental multiplication to achieve interaction between pairs of odd-even series and learn the temporal correlation information between the two series. Additionally, by using convolution as the interaction weight, the feature similarity issue is resolved, and feature diversity is increased. Each series is then subjected to a second convolution, which blends the surrounding pixels to improve the information fusion. The global timing-dependent properties of the deep series are concatenated.

Figure 2: Model architecture diagram
The preprocessed video series is fed into the multi-scale region-aware module (MRA). This module’s multi-scale region-aware attention extracts emotion regions at various scales using deep convolution kernels of varying sizes. The model’s attention to important regions is then increased by applying attention weights to the feature maps of these regions. Fig. 3 depicts the branching convolution submodule (BCS) and the region attention aggregation submodule(RAAS) that make up the multi-scale region attention structure. Firstly, the branching convolution submodule extracts emotional regions of different scales in each frame. Subsequently, the region attention aggregation submodule computes the attention weights for each scale emotion region. Aggregating regions based on the weights yields the combined region features.

Figure 3: Multi-scale region attention structure
3.2.1 Branching Convolution Submodule
A multi-branch depth-wise separable convolution that detects multi-scale emotion zones and a small-kernel standard convolution that combines local data comprise the branching convolution submodule. A more detailed explanation of the submodule can be found below. A 3 × 3 standard convolutional kernel is first used to learn the detailed local regions of each frame, and a 1 × 1 convolution is connected for upscaling. Subsequently, three sets of depth-wise Separable convolutional kernels of different sizes, 1 × 5, 5 × 1, 1 × 7, 7 × 1, 1 × 11, 11 × 1, are used to perceive emotional regions at different scales. The use of depth-wise Separable convolutions can both reduce the computational effort by decomposing the standard convolution and provide the perception of stripe regions to enable more varied emotional regions to be extracted. The branching convolution process can be expressed as Eq. (1):
where
3.2.2 Region Attention Aggregation Submodule
Following the aforementioned branching convolution submodule, the frame’s emotion regions at various scales are retrieved. The region attention aggregation submodule uses two-layer full connectivity to adaptively modify each scale region’s weights in order to enhance the attention to valuable regions. To obtain the spatial information, we specifically compress the three-scale region characteristics independently using global average pooling (GAP). The compressed three features are made channel concatenated to obtain a channel-level multi-scale feature
where
Using the adaptive aggregation weights
where
Then, the enhanced features are connected to generate the combined region features
where
Afterwards,
We adopt a multi-stage pooling technique to stack multi-scale regional perceptual modules, both to further broaden the sensory field and extract high-level semantic information in emotional regions. Specifically, non-coverage down-sampling is applied after each multi-scale region-aware module by using 2 × 2 kernels with step size of 2. In the meantime, the LN layer—rather than the BN layer—is employed in ConvNeXt prior to down-sampling in order to stabilize the incoming data in the proper range. The final multi-scale region-aware module’s outputs are all combined to create a new temporal series Z.
3.2.3 Temporal Interaction Sampling Learning Module
In order to avoid the feature similarity issue among series and to create global temporal dependency with less overhead, we suggest a temporal interaction sampling learning module. The module uses iterative odd-even down-sampling to break down the temporal series layer by layer. Each layer leverages elemental multiplication and exponentiation to create interactions between pairs of odd-even series and uses convolution to extract features from sub-series in order to learn temporal correlation information. Convolution as interaction weights prevents the self-attention feature similarity issue while also increasing feature variety. The surrounding pixels are then mixed to improve the information fusion through another application of one-dimensional convolution of the interaction results. The sampling and convolution difficulty is linear with the number of pixels, allowing the module to simulate global temporal dependency with less overhead than the square-level complexity associated with dense pairwise computing in the Transformer. This preserves the accuracy of the model’s recognition while cutting down on training time.
The basic building block in the temporal interaction sampling learning module is the Convolutional Interaction Block (CIBlock). Its structure is shown in Fig. 4. The CIBlock decomposes the new temporal series

Figure 4: Convolutional interaction block structure
For element-by-element multiplication, the convolution results of the two subseries are transformed into exponential form and crossed. In order to enhance the information fusion and blend the surrounding pixels, the multiplication result is convolved again. The original features and the output of the second convolution are residually concatenated. The above interaction process can be written as Eqs. (5) and (6):
where
By putting several CIBlocks in a binary tree structure, the temporal interaction sampling learning module is built. Concatenating all of the deep level’s series and performing residual concatenation with the input from the temporal interaction sampling learning module yields the global emotion temporal feature T.
Subsequently the global emotion temporal feature
where
To prove the effectiveness of the proposed model, comparative experiments with other advanced models are conducted. Relevant ablation experiments are also done to show the necessity of the proposed component module.
In this study, CMU-MOSI [30], CMU-MOSEI [31] and Hume-Reaction [32] are selected as benchmark datasets for performance evaluation and comparison. CMU-MOSI consists of 2199 short monologue video clips of 89 speakers from different countries on YouTube. CMU-MOSEI has 3228 monologue video clips for a total of 65 h, some examples of which are shown in Fig. 5.

Figure 5: Selected examples of the CMU-MOSEI dataset
Over 70 h of video footage from 2222 participants make up Hume-Reaction. Every sample has seven different emotional reactions labeled on it. These seven emotional reactions are: surprise, admiration, excitement, fear, disgust, worry, and distress. The original frame rates of CMU-MOSI and CMU-MOSEI are both 30 frames per second, and Hume Reaction is 25 frames per second. CMU-MOSI and CMU-MOSEI use MTCNN to crop only the face and resize it uniformly to 224 × 224. The latter combines MTCNN with OpenPose to crop the face and upper body, also adjusted to 224 × 224. CMU-MOSI and CMU-MOSEI are native discourse-level annotations, and this study directly uses the original segmentation and labels. All datasets only use spatiotemporal visual features of RGB video frames, excluding audio and text features. Table 3 displays the specifics of each dataset’s training, validation, and test set divisions.

A computer with an Intel Core i7-10870 CPU and 16 GB RAM, and a remote server consisting of 2 NVIDIA GeForce RTX 3090 GPUs has been used. Pytorch 1.11 is the development environment, mostly utilized for hyperparameter tweaking, model training, and validation. While 64 is the batch size used during model training, 16 is the batch size for testing, and the frame length is 16. After 30 generations of the warmup approach, the learning rate is gradually lowered to zero using the cosine annealing procedure. A starting learning rate of 0.005 is used. In the gradient calculation, the Ranger optimizer is used to update the model parameters, which integrates AdamW [33], Lookahead [34], adaptive gradient clipping, and other techniques, having a stronger learning ability. The model is trained for regression, classification. All activation functions use GELU, which can be regarded as a combination of dropout and RELU. GELU introduces randomness to make the model training more robust.
In this experiment, Pearson Correlation (Corr) is used as a measure for the emotion regression task. Meanwhile, the seven-classification accuracy (Acc-7), the two-classification accuracy (Acc-2), and the F1-score (F1) value are used as a measure for the emotion classification task. For both CMU-MOSI and CMU-MOSEI datasets, the real number of emotion scores is discretized to obtain seven category labels, from which Acc-7 is calculated; meanwhile, non-negative emotion scores are used as positive categories, and negative scores as negative categories, from which Acc-2 and F1 values are calculated. For Hume-Reaction, the emotion category with the highest score was used to determine Acc-7; in the meantime, Acc-2 and F1 values were calculated using the negative categories of anxiety, disgust, pain, and fear, and the positive categories of adoration, joy, and surprise. This section uses FLOPs per frame and memory overhead as efficiency metrics to quantify temporal and spatial efficiency.
To evaluate the performance of our method on the video emotion recognition task, we choose four methods for comparison experiments.
ResNet+Bi-LSTM uses ResNet50 for spatial feature extraction, followed by Bi-LSTM to complete temporal feature extraction. PredRCNN [35] extracts and memorises spatial and temporal feature representations at the same time using ST-LSTM. ResNet3D [36] replaces each residual block internally with multiple 3D convolutional layers on top of ResNet while learning spatial relationships between video frames and temporal dynamics. ConvLSTM is added to ResNet3D via ResNet3D+ConvLSTM, enabling spatiotemporal attentional augmentation of previously extracted spatiotemporal characteristics. TimeSformer [8] uses the Transformer for spatiotemporal feature extraction. Dynamic Facial Expression Recognition Transformer (Former-DFER) [37] models temporal sequences’ global dependency by first extracting spatial features using a basic ResNet and then adding a two-stage TimeSformer. The model’s global time capacity allows it to produce good results. Deep Emotional Arousal Network (DEAN) [38] extracts temporal features using Bi-LSTM after using Transformer to aggregate relationships across various regions.
The input size is uniformly 224 × 224, and the sequence length is adapted according to the dataset; Among them, ResNet+Bi LSTM adopts ResNet50, combined with randomly initialized Bi LSTM layers, fine-tuned to convergence on the dataset of this study. ResNet3D (+ConvLSTM) uses ResNet3D-18 randomly initialized with ConvLSTM layers, and independently trains from scratch on three datasets. DEAN retains the original model’s multi-scale convolution+attention vision module, removes the audio branch, and trains from scratch; PredRNN references the basic metrics of the original paper (Kinetics-400 pre-training), while Former DFER removes the audio/text branches from the original multimodal and only retains the video ViT module. After fine-tuning the video branch parameters in this dataset, only video performance is obtained. The results of the comparative trials on emotion recognition are shown in Tables 2–4. The performance of multiple models is compared in Table 5 in terms of memory overhead and FLOPs per frame.


The data in Tables 4–6 show that ResNet+Bi-LSTM and PredRNN have already achieved good results in processing the video task by virtue of the spatio-temporal model structure. The subsequent ResNet3D+ ConvLSTM results demonstrate that another round of spatio-temporal enhancement of the extracted spatio-temporal features can be more effective in enriching the affective spatio-temporal features and enhancing the recognition accuracy. ResNet3D depends on the residual connections to stack the network deeper, thereby obtaining more temporal memories and achieving better results. After Transformer was added, the model as a whole performed better on large sample datasets like CMU-MOSEI and Hume-Reaction. This was due to the model utilizing Transformer’s global self-attention modeling capability and fully extracting correlation information within images and between video frames. Nevertheless, the limited improvement effect of Transformer-based models in smaller datasets is influenced by the lack of past knowledge. And such models have high memory overheads, high computational requirements, and long training times.

This paper combines the advantages of the above models and fully extracts and utilizes information from spatiotemporal and spatial dimensions through the deep stacking of the proposed modules. In the CMU-MOSEI and Hume Reaction datasets with sufficient samples, the Acc-7, Acc-2, F1, and Corr of our model have all been improved. This is because the model introduces multi-scale large kernel convolution and adaptive region attention to achieve multi-scale emotional region perception and fusion, and subsequently uses multi-layer sampling, convolution, and interaction in temporal feature extraction to establish global temporal dependency. However, in the CMU-MOSI dataset with insufficient data, due to the reliance on sufficient samples for multi-scale region-aware convolution and temporal interaction sampling to learn spatiotemporal feature distribution patterns, the Acc-7 metric has decreased compared to Transformer-based models. Meanwhile, analyzing the recognition failure cases, the model performs poorly on videos with extreme motion blur and severe occlusion.
Table 7 reflects the temporal and spatial overheads of different models. The model in this paper is based on sampling and convolution as a whole, compared to the square-level overhead brought about by dense pairwise computation in Transformer, the overheads of sampling and convolution are linear with the number of pixels. At the same time, convolution possesses the assumptions of a priori knowledge of locality, translation invariance, etc., so that the cost of the model is significantly reduced in both time and space, with Flops being reduced by nearly 2.5 times compared to Transformer-based models.

Fig. 6 shows the experimental results of the model on the three datasets of CMU-MOSI, CMU-MOSEI, and Hume-Reaction using bar charts in order to clearly depict the performance of the model provided in this chapter on the three datasets. Because CMU-MOSEI has a larger sample size, its indication identification is superior.

Figure 6: Experimental effects of the model in this paper on different datasets
The proposed video emotion recognition model, which combines spatial multi-scale region-aware and temporally interactive sampling, is mainly consisted of multi-scale region-aware module and temporal interactive sampling learning module. We created ablation experiments that remove each module one at a time for testing and examine how each module affects overall performance in order to confirm the necessity of each module. Table 8 displays the ablation experiment findings,

In this work, we aim to eliminate the branching convolution sub-module for multi-scale emotional region extraction and utilize ResNet 50’s standard conv 1 through conv 4 exclusively for feature extraction. The model’s performance on CMU-MOSEI declines by 0.007, 0.006, and 0.013, according to the results. The above results show that using large kernel convolution of different sizes to extract multi-scale emotion regions and obtain varied emotion information can improve the accuracy of emotion recognition. The region attention aggregation submodule, which produces region attention weights, was then eliminated. This time, the spatial components of affective feature extraction and fusion are not used; only the temporal interaction sample learning module is used. At this moment, the CMU-MOSEI model’s accuracy was still declining by 0.009, 0.007, and 0.012. The additional decline in accuracy suggests that, once the multi-scale affective regions have been recovered, significant regions should be emphasized and highlighted throughout the fusion process.
Secondly, this paper tries to remove some functional parts of the temporal interaction sampling learning module to test their effects. After removing the weight multiplication and inter-sequence interaction fusion, the accuracy of the model on CMU-MOSEI shows a different degree of decrease, including a decrease of 0.01 in the accuracy of seven classes, which indicates that the interaction of temporal information achieves good performance in multi-class recognition tasks. The accuracy of the model in seven classes on CMU-MOSEI continues to drop by 0.12 with each further removal of the quadratic convolutional reinforcement module.
The entire ablation experiment shows that attention must be paid to both the temporal and spatial dimensions. The recognition outcomes are influenced by global temporal dependency in the temporal dimension and multi-scale region extraction and fusion in the spatial dimension. The need for interaction and fusion is particularly clear in activities involving multiple classes.
In this study, we propose a video-based emotion recognition model that combines spatially multi-scale region perception with temporally interactive sampling. Spatially, combining VAN’s large kernel decomposition with SegNeXt’s multi-scale convolution, using multi-branch large kernel strip convolution to perceive multi-scale and multi-shaped emotion regions. Furthermore, to enhance the perceptual diversity and emphasize the predominance of focused features, attention weights are generated by pooling operations in a fully connected layer for the emotion region features at each scale. Temporally, decompose long sequences by filtering odd and even frames in the sequence. Unlike SCINet, each layer uses element multiplication and convolution to achieve interaction between subsequence pairs, capturing temporal correlation information between odd and even subsequences with minimal overhead. In contrast to the Transformer techniques, this approach establishes global temporal dependency with reduced computation and circumvents the self-attention-induced feature similarity issue. The experiment findings demonstrate that the suggested model more successfully achieves a trade-off between recognition accuracy and model complexity.
Future work will further focus on the following two aspects to improve model performance: (1) for better multi-modal fusion, we will explore the temporal alignment strategy of patches from different modalities. And in doing so, we will improve the cross-modal Transformer’s self-attention mechanism to create modal links that correlate temporally; (2) the multi-label image classification algorithm based on spatial attention and graph convolution proposed by Kang et al. [39] provides us with new ideas, and we will explore the application of attention and graph convolution in video emotion recognition.
Acknowledgement: We are grateful to Nanjing University of Information Science and Technology and Nanjing Tech University for providing study environment and computing equipment.
Funding Statement: This study was supported, in part, by the National Nature Science Foundation of China under Grant 62272236, 62376128; in part, by the Natural Science Foundation of Jiangsu Province under Grant BK20201136, BK20191401.
Author Contributions: Study conception and design: Xiaorui Zhang, Chunlin Yuan; data collection: Ting Wang; analysis and interpretation of results: Wei Sun, Chunlin Yuan; draft manuscript preparation: Chunlin Yuan, Wei Sun. All authors reviewed the results and approved the final version of the manuscript.
Availability of Data and Materials: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Ethics Approval: Not applicable. This paper does not contain any studies with human participants performed by any of the authors.
Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.
References
1. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41. doi:10.1609/aaai.v32i1.12021. [Google Scholar] [CrossRef]
2. Wang J, Wang C, Guo L, Zhao S, Wang D, Zhang S, et al. MDKAT: multimodal decoupling with knowledge aggregation and transfer for video emotion recognition. IEEE Trans Circuits Syst Video Technol. 2025;35(10):9809–22. doi:10.1109/TCSVT.2025.3571534. [Google Scholar] [CrossRef]
3. Zhang H, Meng Z, Luo M, Han H, Liao L, Cambria E, et al. Towards multimodal empathetic response generation: a rich text-speech-vision avatar-based benchmark. In: Proceedings of the ACM on Web Conference 2025; 2025 May 5–9; Sydney, NSW, Australia. p. 2872–81. doi:10.1145/3696410.3714739. [Google Scholar] [CrossRef]
4. Fu Y, Wu J, Wang Z. BeMERC: behavior-aware mllm-based framework for multimodal emotion recognition in conversation. arXiv:2503.23990. 2025. [Google Scholar]
5. Vaswani A, Shazeer N, Parmar N. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:30–44. [Google Scholar]
6. Dosovitskiy A, Beyer L, Kolesnikov A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929. 2020. [Google Scholar]
7. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada. p. 9992–10002. doi:10.1109/ICCV48922.2021.00986. [Google Scholar] [CrossRef]
8. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding? In: Proceedings of the 38th International Conference on Machine Learning; 2021. Vol. 139, p. 813–24. [Google Scholar]
9. Wang K, Lian Z, Sun L, Liu B, Tao J, Fan Y. Emotional reaction analysis based on multi-label graph convolutional networks and dynamic facial expression recognition transformer. In: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge; 2020 Oct 10–14; Lisbon, Portugal. p. 75–80. doi:10.1145/3551876.3554810. [Google Scholar] [CrossRef]
10. Chaudhari A, Bhatt C, Krishna A, Travieso-González CM. Facial emotion recognition with inter-modality-attention-transformer-based self-supervised learning. Electronics. 2023;12(2):288. doi:10.3390/electronics12020288. [Google Scholar] [CrossRef]
11. Ye T, Dong L, Xia Y. Differential transformer. arXiv:2410.05258. 2024. [Google Scholar]
12. Pan X, Ye T, Xia Z, Song S, Huang G. Slide-transformer: hierarchical vision transformer with local self-attention. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. p. 2082–91. doi:10.1109/CVPR52729.2023.00207. [Google Scholar] [CrossRef]
13. Shi D, Zhong Y, Cao Q, Ma L, Lit J, Tao D. TriDet: temporal action detection with relative boundary modeling. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. p. 18857–66. doi:10.1109/CVPR52729.2023.01808. [Google Scholar] [CrossRef]
14. You Q, Jin H, Luo J. Visual sentiment analysis by attending on local image regions. Proc AAAI Conf Artif Intell. 2017;31(1):231–7. doi:10.1609/aaai.v31i1.10501. [Google Scholar] [CrossRef]
15. Guo MH, Lu CZ, Liu ZN, Cheng MM, Hu SM. Visual attention network. Comput Vis Medium. 2023;9(4):733–52. [Google Scholar]
16. Ding X, Zhang X, Han J, Ding G. Scaling up your kernels to 31 × 31: revisiting large kernel design in CNNs. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. p. 11953–65. doi:10.1109/CVPR52688.2022.01166. [Google Scholar] [CrossRef]
17. Liu M, Zeng A, Lai Q, Xu Q. Time series is a special sequence: forecasting with sample convolution and inter-action. arXiv:2106.09305. 2021. [Google Scholar]
18. Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A ConvNet for the 2020s. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. p. 11966–76. doi:10.1109/CVPR52688.2022.01167. [Google Scholar] [CrossRef]
19. Guo MH, Lu CZ, Hou Q. Segnext: rethinking convolutional attention design for semantic segmentation. Adv Neural Inf Process Syst. 2022;35:1140–56. [Google Scholar]
20. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27–30; Las Vegas, NV, USA. p. 770–8. doi:10.1109/CVPR.2016.90. [Google Scholar] [CrossRef]
21. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA: IEEE; 2017. p. 936–44. doi:10.1109/CVPR.2017.106. [Google Scholar] [CrossRef]
22. Li Z, Lang C, Liew JH, Li Y, Hou Q, Feng J. Cross-layer feature pyramid network for salient object detection. IEEE Trans Image Process. 2021;30:4587–98. doi:10.1109/TIP.2021.3072811. [Google Scholar] [PubMed] [CrossRef]
23. Gao SH, Cheng MM, Zhao K, Zhang XY, Yang MH, Torr P. Res2Net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell. 2021;43(2):652–62. doi:10.1109/tpami.2019.2938758. [Google Scholar] [PubMed] [CrossRef]
24. Xie E, Wang W, Yu Z. SegFormer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst. 2021;34:12077–90. [Google Scholar]
25. Hou Q, Zhang L, Cheng MM, Feng J. Strip pooling: rethinking spatial pooling for scene parsing. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13–19; Seattle, WA, USA: IEEE; 2020. p. 4002–11. doi:10.1109/cvpr42600.2020.00406. [Google Scholar] [CrossRef]
26. Neimark D, Bar O, Zohar M, Asselmann D. Video transformer network. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW); 2020 Oct 11–17; Montreal, BC, Canada: IEEE; 2021. p. 3156–65. doi:10.1109/iccvw54120.2021.00355. [Google Scholar] [CrossRef]
27. Bai S, Kolter JZ, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271. 2018. [Google Scholar]
28. Xu Z, Wang Y, Long M, Wang J. PredCNN: predictive learning with cascade convolutions. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence; 2018 Jul 13–19; Stockholm, Sweden. p. 2940–7. doi:10.24963/ijcai.2018/408. [Google Scholar] [CrossRef]
29. Gao Z, Tan C, Wu L, Li SZ. SimVP: simpler yet better video prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–22; New Orleans, LA, USA: IEEE; 2022. p. 3160–70. doi:10.1109/CVPR52688.2022.00317. [Google Scholar] [CrossRef]
30. Zadeh A, Zellers R, Pincus E, Morency LP. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst. 2016;31(6):82–8. doi:10.1109/MIS.2016.94. [Google Scholar] [CrossRef]
31. Bagher Zadeh A, Liang PP, Poria S, Cambria E, Morency LP. Multimodal language analysis in the wild: cmu-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia. Stroudsburg, PA, USA: ACL; 2018. p. 2236–46. doi:10.18653/v1/p18-1208. [Google Scholar] [CrossRef]
32. Christ L, Amiriparian S, Baird A, Tzirakis P, Kathan A, Müller N, et al. The MuSe 2022 multimodal sentiment analysis challenge: humor, emotional reactions, and stress. In: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge; 2022 Oct 10–14; Lisbon, Portugal. p. 5–14. doi:10.1145/3551876.3554817. [Google Scholar] [CrossRef]
33. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv:1711.05101. 2017. [Google Scholar]
34. Zhang M, Lucas J, Hinton G, Ba J. Lookahead optimizer: k steps forward, 1 step back. In: Advances in neural information processing systems; 2019. 32 p. [Google Scholar]
35. Wang Y, Long M, Wang J, Gao Z, Predrnn Yu PS. Recurrent neural networks for predictive learning using spatiotemporal lstms. In: Advances in neural information processing systems; 2017. 30 p. [Google Scholar]
36. Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW); 2017 Oct 22–29; Venice, Italy: IEEE; 2017. p. 3154–60, 2017. doi:10.1109/ICCVW.2017.373. [Google Scholar] [CrossRef]
37. Zhao Z, Liu Q. Former-DFER: dynamic facial expression recognition transformer. In: Proceedings of the 29th ACM International Conference on Multimedia; 2021 Oct 20–24; Virtual Event, China. p. 1553–61. doi:10.1145/3474085.3475292. [Google Scholar] [CrossRef]
38. Zhang F, Li XC, Lim CP, Hua Q, Dong CR, Zhai JH. Deep emotional arousal network for multimodal sentiment analysis and emotion recognition. Inf Fusion. 2022;88(1):296–304. doi:10.1016/j.inffus.2022.07.006. [Google Scholar] [CrossRef]
39. Kang P, Hou J, Zhou H, Chen Z, Li C. Multi-label image classification algorithm based on spatial attention and graph convolution. Microelectron Comput. 2022;39(5):10–9. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools