A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions
A-Seong Moon, Haesung Kim, Ye-Chan Park, Jaesung Lee*
Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea
* Corresponding Author: Jaesung Lee. Email:
Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.076411
Received 20 November 2025; Accepted 19 January 2026; Published online 13 February 2026
Abstract
Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence, supported by the rapid progress in vision, audio, language, and physiological modeling. Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms, yet the field remains fragmented due to differences in feature alignment, temporal synchronization, modality reliability, and robustness to noise or missing inputs. This survey provides a comprehensive analysis of MER research from 2021 to 2025, consolidating advances in modality-specific representation learning, cross-modal feature construction, and early, late, and hybrid fusion paradigms. We systematically review visual, acoustic, textual, and sensor-based embeddings, highlighting how pre-trained encoders, self-supervised learning, and large language models have reshaped the representational foundations of MER. We further categorize fusion strategies by interaction depth and architectural design, examining how attention mechanisms, cross-modal transformers, adaptive gating, and multimodal large language models redefine the integration of affective signals. Finally, we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability, generalization, and interpretability. This survey aims to provide a unified perspective on multimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.
Keywords
Multimodal emotion recognition; multimodal learning; cross-modal learning; fusion strategies; representation learning