Home / Journals / CMC / Online First / doi:10.32604/cmc.2025.071043
Special Issues
Table of Content

Open Access

ARTICLE

Efficient Video Emotion Recognition via Multi-Scale Region-Aware Convolution and Temporal Interaction Sampling

Xiaorui Zhang1,2,*, Chunlin Yuan3, Wei Sun4, Ting Wang5
1 College of Computer and Information Engineering, Nanjing Tech University, Nanjing, 211816, China
2 College of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing, 210044, China
3 College of Computer Science, Nanjing University of Information Science and Technology, Nanjing, 210044, China
4 College of Automation, Nanjing University of Information Science and Technology, Nanjing, 210044, China
5 College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, 211816, China
* Corresponding Author: Xiaorui Zhang. Email: email
(This article belongs to the Special Issue: Advances in Deep Learning and Neural Networks: Architectures, Applications, and Challenges)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.071043

Received 30 July 2025; Accepted 17 October 2025; Published online 12 November 2025

Abstract

Video emotion recognition is widely used due to its alignment with the temporal characteristics of human emotional expression, but existing models have significant shortcomings. On the one hand, Transformer multi-head self-attention modeling of global temporal dependency has problems of high computational overhead and feature similarity. On the other hand, fixed-size convolution kernels are often used, which have weak perception ability for emotional regions of different scales. Therefore, this paper proposes a video emotion recognition model that combines multi-scale region-aware convolution with temporal interactive sampling. In terms of space, multi-branch large-kernel stripe convolution is used to perceive emotional region features at different scales, and attention weights are generated for each scale feature. In terms of time, multi-layer odd-even down-sampling is performed on the time series, and odd-even sub-sequence interaction is performed to solve the problem of feature similarity, while reducing computational costs due to the linear relationship between sampling and convolution overhead. This paper was tested on CMU-MOSI, CMU-MOSEI, and Hume Reaction. The Acc-2 reached 83.4%, 85.2%, and 81.2%, respectively. The experimental results show that the model can significantly improve the accuracy of emotion recognition.

Keywords

Multi-scale; region-aware convolution; temporal interaction sampling; video emotion recognition
  • 186

    View

  • 25

    Download

  • 0

    Like

Share Link