A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions

A-Seong Moon, Haesung Kim, Ye-Chan Park, Jaesung Lee^*
Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea
* Corresponding Author: Jaesung Lee. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.076411

Received 20 November 2025; Accepted 19 January 2026; Published online 13 February 2026

Download PDF

Abstract

Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence, supported by the rapid progress in vision, audio, language, and physiological modeling. Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms, yet the field remains fragmented due to differences in feature alignment, temporal synchronization, modality reliability, and robustness to noise or missing inputs. This survey provides a comprehensive analysis of MER research from 2021 to 2025, consolidating advances in modality-specific representation learning, cross-modal feature construction, and early, late, and hybrid fusion paradigms. We systematically review visual, acoustic, textual, and sensor-based embeddings, highlighting how pre-trained encoders, self-supervised learning, and large language models have reshaped the representational foundations of MER. We further categorize fusion strategies by interaction depth and architectural design, examining how attention mechanisms, cross-modal transformers, adaptive gating, and multimodal large language models redefine the integration of affective signals. Finally, we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability, generalization, and interpretability. This survey aims to provide a unified perspective on multimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.

Keywords

Multimodal emotion recognition; multimodal learning; cross-modal learning; fusion strategies; representation learning

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

275

View
47

Download
0

Like

Critical Relation Path Aggregation-Based Industrial Control Component Exploitable Vulnerability Reasoning
Zibo Wang, Chaobin Huo, Yaofang...
Augmented Deep Multi-Granularity Pose-Aware Feature Fusion Network for Visible-Infrared Person Re-Identification
Zheng Shi, Wanru Song, Junhao...
HCRVD: A Vulnerability Detection System Based on CST-PDG Hierarchical Code Representation Learning
Zhihui Song, Jinchen Xu, Kewei...
A New Framework for Software Vulnerability Detection Based on an Advanced Computing
Bui Van Cong, Cho Do Xuan
Learning Dual-Layer User Representation for Enhanced Item Recommendation
Fuxi Zhu, Jin Xie, Mohammed Alshahrani

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions

Abstract

Keywords

275

47

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link