TY  - EJOU
AU  - Chen, Yuanle 
AU  - Wang, Haobo 
AU  - Liu, Chunyu 
AU  - Wang, Linyi 
AU  - Liu, Jiaxin 
AU  - Wu, Wei 

TI  - Generative Multi-Modal Mutual Enhancement Video Semantic Communications
T2  - Computer Modeling in Engineering \& Sciences

PY  - 2024
VL  - 139
IS  - 3
SN  - 1526-1506

AB  - Recently, there have been significant advancements in the study of semantic communication in single-modal scenarios. However, the ability to process information in multi-modal environments remains limited. Inspired by the research and applications of natural language processing across different modalities, our goal is to accurately extract frame-level semantic information from videos and ultimately transmit high-quality videos. Specifically, we propose a deep learning-based Multi-Modal Mutual Enhancement Video Semantic Communication system, called M3E-VSC. Built upon a Vector Quantized Generative Adversarial Network (VQGAN), our system aims to leverage mutual enhancement among different modalities by using text as the main carrier of transmission. With it, the semantic information can be extracted from key-frame images and audio of the video and perform differential value to ensure that the extracted text conveys accurate semantic information with fewer bits, thus improving the capacity of the system. Furthermore, a multi-frame semantic detection module is designed to facilitate semantic transitions during video generation. Simulation results demonstrate that our proposed model maintains high robustness in complex noise environments, particularly in low signal-to-noise ratio conditions, significantly improving the accuracy and speed of semantic transmission in video communication by approximately 50 percent.
KW  - Generative adversarial networks; multi-modal mutual enhancement; video semantic transmission; deep learning

DO  - 10.32604/cmes.2023.046837