Cross-Modal Simplex Center Learning for Speech-Face Association

Qiming Ma; Fanliang Bu; Rong Wang; Lingbin Bu; Yifan Wang; Zhiyuan Li

doi:10.32604/cmc.2025.061187

Open Access icon Open Access

ARTICLE

Cross-Modal Simplex Center Learning for Speech-Face Association

Qiming Ma, Fanliang Bu^*, Rong Wang, Lingbin Bu, Yifan Wang, Zhiyuan Li

School of Information Network Security, People’s Public Security University of China, Beijing, 100038, China

* Corresponding Author: Fanliang Bu. Email: email

Computers, Materials & Continua 2025, 82(3), 5169-5184. https://doi.org/10.32604/cmc.2025.061187

Received 19 November 2024; Accepted 25 December 2024; Issue published 06 March 2025

Abstract

Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features. Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation. However, these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples, which are crucial for robust association modeling. To address these challenges, we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities. First, we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments. This shared pre-training step ensures the extraction of complementary identity information across modalities. Subsequently, we introduce a cross-modal simplex center loss, which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere. This design enforces an equidistant and balanced distribution of identity embeddings, reducing intra-class variations. Furthermore, we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability, enhancing the model’s ability to generalize across challenging scenarios. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance across various speech-face association tasks, including matching, verification, and retrieval. Notably, in the challenging gender-constrained matching task, our method achieves a remarkable accuracy of 79.22%, significantly outperforming existing approaches. These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.

Keywords

Speech-face association; cross-modal learning; cross-modal matching; cross-modal retrieval

Cite This Article

APA Style

Ma, Q., Bu, F., Wang, R., Bu, L., Wang, Y. et al. (2025). Cross-Modal Simplex Center Learning for Speech-Face Association. Computers, Materials & Continua, 82(3), 5169–5184. https://doi.org/10.32604/cmc.2025.061187

Vancouver Style

Ma Q, Bu F, Wang R, Bu L, Wang Y, Li Z. Cross-Modal Simplex Center Learning for Speech-Face Association. Comput Mater Contin. 2025;82(3):5169–5184. https://doi.org/10.32604/cmc.2025.061187

IEEE Style

Q. Ma, F. Bu, R. Wang, L. Bu, Y. Wang, and Z. Li, “Cross-Modal Simplex Center Learning for Speech-Face Association,” Comput. Mater. Contin., vol. 82, no. 3, pp. 5169–5184, 2025. https://doi.org/10.32604/cmc.2025.061187

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Cross-Modal Simplex Center Learning for Speech-Face Association

Abstract

Keywords

Cite This Article

945

1101

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link