Open Access iconOpen Access

ARTICLE

Cross-Modal Simplex Center Learning for Speech-Face Association

Qiming Ma, Fanliang Bu*, Rong Wang, Lingbin Bu, Yifan Wang, Zhiyuan Li

School of Information Network Security, People’s Public Security University of China, Beijing, 100038, China

* Corresponding Author: Fanliang Bu. Email: email

Computers, Materials & Continua 2025, 82(3), 5169-5184. https://doi.org/10.32604/cmc.2025.061187

Abstract

Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features. Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation. However, these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples, which are crucial for robust association modeling. To address these challenges, we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities. First, we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments. This shared pre-training step ensures the extraction of complementary identity information across modalities. Subsequently, we introduce a cross-modal simplex center loss, which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere. This design enforces an equidistant and balanced distribution of identity embeddings, reducing intra-class variations. Furthermore, we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability, enhancing the model’s ability to generalize across challenging scenarios. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance across various speech-face association tasks, including matching, verification, and retrieval. Notably, in the challenging gender-constrained matching task, our method achieves a remarkable accuracy of 79.22%, significantly outperforming existing approaches. These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.

Keywords

Speech-face association; cross-modal learning; cross-modal matching; cross-modal retrieval

Cite This Article

APA Style
Ma, Q., Bu, F., Wang, R., Bu, L., Wang, Y. et al. (2025). Cross-modal simplex center learning for speech-face association. Computers, Materials & Continua, 82(3), 5169–5184. https://doi.org/10.32604/cmc.2025.061187
Vancouver Style
Ma Q, Bu F, Wang R, Bu L, Wang Y, Li Z. Cross-modal simplex center learning for speech-face association. Comput Mater Contin. 2025;82(3):5169–5184. https://doi.org/10.32604/cmc.2025.061187
IEEE Style
Q. Ma, F. Bu, R. Wang, L. Bu, Y. Wang, and Z. Li, “Cross-Modal Simplex Center Learning for Speech-Face Association,” Comput. Mater. Contin., vol. 82, no. 3, pp. 5169–5184, 2025. https://doi.org/10.32604/cmc.2025.061187



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 282

    View

  • 99

    Download

  • 0

    Like

Share Link