iconOpen Access

ARTICLE

PMCFusion: A Parallel Multi-Dimensional Complementary Network for Infrared and Visible Image Fusion

Xu Tao1, Qiang Xiao2, Zhaoqi Jin2, Hao Li1,*

1 School of Information Science & Engineering, Yunnan University, Kunming, 650504, China
2 Yunnan Highway Network Toll Management Co., Ltd., Yunnan Key Laboratory of Digital Communications, Kunming, 650100, China

* Corresponding Author: Hao Li. Email: email

Computers, Materials & Continua 2026, 86(2), 1-18. https://doi.org/10.32604/cmc.2025.070790

Abstract

Image fusion technology aims to generate a more informative single image by integrating complementary information from multi-modal images. Despite the significant progress of deep learning-based fusion methods, existing algorithms are often limited to single or dual-dimensional feature interactions, thus struggling to fully exploit the profound complementarity between multi-modal images. To address this, this paper proposes a parallel multi-dimensional complementary fusion network, termed PMCFusion, for the task of infrared and visible image fusion. The core of this method is its unique parallel three-branch fusion module, PTFM, which pioneers the parallel synergistic perception and efficient integration of three distinct dimensions: spatial uncorrelation, channel-wise disparity, and frequency-domain complementarity. Leveraging meticulously designed cross-dimensional attention interactions, PTFM can selectively enhance multi-dimensional features to achieve deep complementarity. Furthermore, to enhance the detail clarity and structural integrity of the fused image, we have designed a dedicated multi-scale high-frequency detail enhancement module, HFDEM. It effectively improves the clarity of the fused image by actively extracting, enhancing, and injecting high-frequency components in a residual manner. The overall model employs a multi-scale architecture and is constrained by corresponding loss functions to ensure efficient and robust fusion across different resolutions. Extensive experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art fusion algorithms in both subjective visual effects and objective evaluation metrics.

Keywords

Infrared and visible image fusion; deep learning; parallel multi-dimensional; attention mechanism; detail enhancement

1  Introduction

Image fusion aims to integrate complementary information from multi-source images into a single, information-rich enhanced image, thereby improving scene understanding and decision-making capabilities [13]. Among various applications, the fusion of infrared and visible images is particularly crucial due to its significant potential in fields such as geological exploration, intelligent security, and medical diagnosis [4]. Visible images provide fine-grained textures and color details but are vulnerable to poor illumination or smoke, leading to information loss [5]. Infrared images, in contrast, capture thermal radiation and effectively highlight salient targets under adverse conditions but lack background details. Designing a fusion algorithm that can combine visible fine textures with infrared salient features to produce a final image containing both clear textures and prominent targets remains a central challenge in the field [3]. Early studies relied on handcrafted priors, such as sparse representation [6,7] and multi-scale decomposition [810]. While these traditional methods achieved limited success, their fixed feature extractors could not adapt to diverse real-world scenarios, restricting robustness and generalization. With the rise of deep learning, Convolutional Neural Networks (CNNs) [11,12], Generative Adversarial Networks (GANs) [1316], and Transformers [17,18] have substantially advanced fusion quality through data-driven representation learning. CNN-based models excel at local feature extraction but may neglect global semantics; GAN-based approaches improve perceptual realism but sometimes introduce artifacts; Transformer-based architectures capture long-range dependencies and semantic alignment but are often computationally expensive, which hinders deployment in real-time applications.

In addition, hybrid paradigms have emerged. Pretrained backbones such as VGG, ResNet, and CLIP have been leveraged to transfer strong representation ability into the fusion task [19]. Optimization-based methods, including particle swarm optimization (PSO) and genetic algorithms, have been explored to adaptively search for optimal fusion rules or weights. Zhang et al. combined PSO with Dense Blocks to optimize wavelet coefficients, but relied on costly optimization [20]. Meanwhile, graph-based approaches explicitly model cross-modal dependencies by converting images into graph structures and employing graph convolutional networks (GCNs). Li et al. proposed a graph representation learning framework that extracts non-local self-similarity features across modalities, achieving superior performance on TNO, RoadScene, and M3FD datasets [21]. Similarly, DGFD employs a dual-graph convolutional network for cross-modal fusion and demonstrates remarkable improvements in low-light object detection tasks [22]. These approaches offer valuable insights but often rely on costly optimization, heavy pretraining, or complex graph modeling.

Nevertheless, most existing networks still suffer from a common limitation: fusion decisions are usually made in isolation, restricted to either spatial-channel domains [18] or spatial-frequency domains [23]. This serial pipeline design risks information loss during cross-domain transitions and fails to fully exploit the intrinsic complementarity across multiple dimensions. To address these limitations, we propose PMCFusion, a parallel multi-dimensional complementary fusion framework. By concurrently processing spatial, channel, and frequency features, PMCFusion achieves synergistic enhancement and strikes a balanced trade-off between detail preservation, structural fidelity, and cross-modal generalization.

To address the aforementioned challenges, this paper proposes an innovative infrared and visible image fusion framework, named the Parallel Multi-dimensional Complementary Fusion Network (PMCFusion). The core of this network lies in its meticulously designed Parallel Tri-branch Fusion Module (PTFM), which coordinately processes and efficiently integrates spatial structures, channel characteristics, and frequency-domain representations from the source images. Specifically, the PTFM consists of three parallel sub-paths: the spatial branch models spatial discrepancies in the feature maps and generates attention weights to highlight edge and structural information; the channel branch introduces a cross-modal channel attention mechanism to effectively capture and weight complementary channel features between the modalities; and the frequency branch innovatively incorporates the Fast Fourier Transform (FFT) and its inverse (IFFT) to extract and fuse complementary information from the amplitude and phase spectra of both modalities. The outputs from these three branches are dynamically integrated and processed with high- and low-pass filtering, achieving selective enhancement and deep complementarity of multi-dimensional features. To further enhance the detail clarity and visual quality of the fused images, a High-frequency Detail Enhancement Module (HDEM) is independently integrated into the network, and a targeted high-frequency consistency loss is introduced to provide training supervision. The overall PMCFusion framework adopts a multi-scale pyramid architecture, with the PTFM embedded at each scale level to ensure that multi-level features, from coarse to fine, undergo thorough cross-dimensional information interaction and optimization. We systematically evaluated PMCFusion on multiple public infrared and visible image fusion datasets and conducted a comprehensive comparison with several state-of-the-art (SOTA) fusion algorithms. The experimental results demonstrate that PMCFusion exhibits significant advantages in both subjective visual performance and multiple objective evaluation metrics.

In summary, the main contributions of this paper can be outlined as follows:

•   We propose a novel framework named the Parallel Multi-dimensional Complementary Fusion Network (PMCFusion). Its core innovation lies in the ability to concurrently process and synergistically integrate complementary information from three dimensions—spatial, channel, and frequency—effectively overcoming the limitations of existing methods in multi-dimensional information utilization.

•   We have designed a unique Parallel Tri-branch Fusion Module (PTFM) as the network’s core component. Through meticulously constructed sub-paths for spatial perception, channel interaction, and frequency-domain analysis, this module achieves refined extraction, adaptive weighting, and efficient fusion of multi-modal features. Combined with high- and low-pass filtering, it ensures the optimized integration of cross-dimensional features.

•   We have integrated a standalone High-frequency Detail Enhancement Module (HFDEM) and designed a composite loss function aimed at comprehensively optimizing both structure and details, which significantly enhances the edge clarity, texture representation, and overall visual quality of the fused images.

•   Extensive experimental results on several public benchmark datasets demonstrate that the proposed PMCFusion consistently and significantly outperforms current state-of-the-art (SOTA) fusion algorithms in terms of both subjective visual effects and multiple objective evaluation metrics.

The remainder of this paper is organized as follows: Section 2 reviews the related work in the field of infrared and visible image fusion. Section 3 elaborates on the overall architecture, core module design, and loss function of the proposed PMCFusion network. Section 4 presents and provides an in-depth analysis of the experimental results. Finally, Section 5 concludes the paper and discusses future research directions.

2  Related Work

2.1 Traditional Image Fusion

Before deep learning became dominant, infrared and visible image fusion was mainly explored via traditional signal processing methods. Representative categories include multi-scale transform approaches (e.g., DWT, NSST) [2,9,10], sparse representation-based strategies [6,7], subspace learning such as PCA [24,25], and saliency-driven fusion [26]. Hybrid models combining multiple paradigms were also studied [27,28]. While effective in controlled settings, these methods rely heavily on hand-crafted priors and heuristic rules, which limit adaptability, fail to capture high-level semantics, and often incur high computational costs. These drawbacks motivated the transition to data-driven deep learning approaches [29].

2.2 Deep Learning-Based Image Fusion

With the success of deep learning in vision tasks, a wide range of neural network-based fusion methods have been developed. They can be broadly categorized into CNN-based, GAN-based, and Transformer-based approaches.

CNN-based methods: Multi-layer convolutional networks automatically extract multi-scale features from source images [11,12]. Fusion is then performed via weighted averaging, element-wise operations, or a dedicated fusion sub-network, and decoded into the fused output. Residual connections [30] have improved training stability. Despite their strength in local feature extraction, CNNs are constrained by limited receptive fields, making them less effective for modeling long-range dependencies, and their fusion strategies may cause redundancy or information loss.

GAN-based methods: GANs leverage adversarial learning to generate perceptually natural fused images [13,16]. For instance, dual-discriminator setups encourage alignment with both infrared and visible modalities. While GANs can produce visually realistic details, they suffer from training instability and weaker structural fidelity compared with CNNs.

Transformer-based methods: Transformers introduce self-attention to capture global dependencies and semantic relationships [17,18]. Recent work integrates Transformer blocks into CNNs or designs end-to-end Transformer fusion networks. Although performance is strong, Transformers are computationally intensive and less practical for real-time applications.

2.3 Other Emerging Paradigms

Beyond CNN-based, GAN-based, and Transformer-based methods, several new paradigms further expand the research landscape.

Pretrained backbones: Large pretrained networks (e.g., VGG, ResNet, CLIP) have been adopted as feature extractors to transfer semantic knowledge from large datasets into fusion tasks, enhancing structural preservation and texture fidelity [19].

Optimization-based methods: Meta-heuristic strategies, such as particle swarm optimization (PSO) and genetic algorithms, have been applied to search for fusion rules or weights adaptively [20]. These approaches improve fusion quality but are typically computationally expensive.

Graph-based approaches: Graph representation learning has been used to explicitly model cross-modal dependencies. Li et al. [21] proposed GCN-based graph learning to capture non-local similarities, while DGFD [22] introduces a dual-graph convolutional network to achieve cross-modal fusion and demonstrates significant performance improvements in low-light object detection tasks. Such methods achieve notable gains but rely on complex graph modeling and heavy pretraining.

Diffusion and equivariant models: Recently, diffusion models and equivariant learning have entered the field. For example, EMMA [31] leverages equivariant constraints for multi-modality fusion, DMFuse [32] leverages diffusion model guidance with cross-attention learning for infrared and visible image fusion, and SGDFuse [33] incorporates SAM-guided diffusion for robust fusion. These methods represent the latest progress, though they often involve high computational cost and complex training.

2.4 Limitations of Existing Methods

Despite rapid advances, most existing methods focus on restricted domains—either spatial-channel [18] or spatial-frequency [23]. Serial pipelines risk information loss during cross-domain transitions and fail to leverage intrinsic complementarity across spatial, channel, and frequency dimensions. Addressing this gap motivates our proposed PMCFusion, which performs parallel multi-dimensional complementary fusion for more effective multi-modal integration.

3  Method

3.1 Overall Architecture

The PMCFusion network proposed in this paper adopts an end-to-end multi-scale encoder-decoder architecture, the overall design of which is illustrated in Fig. 1. This architecture is engineered to systematically extract, fuse, and reconstruct multi-level features from the input infrared image IirRH×W×C and the visible image IvisRH×W×C. In the encoder path, the network follows a feature extraction strategy that proceeds from fine to coarse resolutions. Initially, a convolutional layer performs shallow feature extraction on the source images. Diverging from conventional methods, we innovatively introduce a High-Frequency Detail Enhancement Module (HFDEM) at this stage to actively enhance the details within the initial features, thereby providing a more information-rich input for the subsequent fusion process. Subsequently, the enhanced feature maps undergo a series of downsampling operations to generate three feature levels at different resolutions (Level 1, Level 2, and Level 3). At each level, we employ a Residual Feature Distillation Block (RFDB) [34] to deeply refine the features, yielding more discriminative representations. In the decoder path, the network employs a bottom-up process for feature fusion and reconstruction. This process commences at the deepest feature level (Level 3), where the infrared and visible features are fed into our core fusion unit—the Parallel Tri-branch Fusion Module (PTFM). The output of the PTFM is then upsampled and passed to the preceding level (Level 2), where it is concatenated with the same-scale features from the encoder path. Ultimately, the fused features from the highest level are fed into a reconstruction module to generate the final fused image. By performing deep fusion and multi-level supervision at each scale, PMCFusion ensures the thorough integration of cross-modal information and high-quality reconstruction. The designs of the two core modules, HFDEM and PTFM, will be elaborated in the following subsections.

images

Figure 1: The overall architecture of the PMCFusion network. The PTFM module and HFDEM module will be detailed in the following

3.2 Parallel Tri-Branch Fusion Module

As the core of the PMCFusion network, the PTFM module is embedded at each feature scale to perform deep cross-modal fusion. Existing multi-modal fusion strategies are largely confined to feature interaction in single or dual dimensions and often employ serial processing. This approach not only overlooks the unique complementarity offered by other dimensions, such as the frequency domain [3], but is also prone to information loss during cross-domain transfer. To overcome these challenges, our designed PTFM module abandons the conventional serial-stacking paradigm. Its core concept is the construction of a parallel tri-branch network designed to simultaneously mine and integrate infrared and visible features from three orthogonal dimensions within a single, unified unit: spatial dissimilarity, channel disparity, and frequency complementarity. This parallel architecture ensures the maximal preservation of information from different dimensions and facilitates efficient cross-domain synergy. The overall structure of the PTFM and its key components are illustrated in Fig. 2. The specific design of its three internal parallel branches will be elaborated upon in the following subsections.

images

Figure 2: Detailed structure diagram of PTFM, where (a) is a detailed diagram of the three branches, and (b) is the final fusion using high-low filtering

Spatial Branch: Information in the spatial dimension, such as edges, contours, and textures, is critical to the task of image fusion. Infrared images excel at capturing salient target contours, while visible images are replete with fine textural details. To adaptively preserve this complementary spatial information, we designed the spatial branch. This branch operates by computing the differences between feature maps at corresponding spatial locations to generate a spatial attention map. This map, in turn, guides the model to focus on and retain the most informative spatial regions from each modality.

Initially, the difference map Ds is derived by computing the element-wise absolute difference between the input features Fq and Fk. Subsequently, both average and max pooling are applied to Ds along the channel dimension. The results are concatenated and then fed into a multi-scale spatial attention network, which employs parallel 3 × 3 and 7 × 7 convolutional kernels, to generate the final spatial weight map, Ws. This weight map Ws highlights regions where significant spatial discrepancies exist between the two modalities. The output feature of the spatial branch Fs is then obtained by the following equation:

Ds=|FqFk|Ws=σ(Conv3×3([Avg(Ds)];[Max(Ds)])+Conv7×7([Avg(Ds)];[Max(Ds)]))Fs=FqWs,(1)

where [;] denotes the concatenation operation, Avg() and Max() represent average pooling and max pooling, respectively, and σ is the Sigmoid activation function.

Channel Branch: Feature responses along the channel dimension characterize different abstract semantic information within an image. To accentuate the unique features inherent to each modality, we have designed the channel branch. This branch operates by computing the dissimilarity between the features of the two modalities at the channel level to generate a weighting map, thereby enhancing features that are prominent in one modality but less pronounced in the other.

Initially, we pass the input features through an enhanced channel attention module to generate a distinct channel attention map for each modality. We then quantify the correlation between these attention maps by computing their cosine similarity. This metric is integrated with the difference between the maps to derive the final channel weight Wc. This weight Wc serves to amplify the channels that exhibit significant differences between the two modalities. The output feature of the channel branch Fc is ultimately defined as:

Acq,Ack=Sigmoid(Conv1×1(ReLU(Conv1×1(AvgPool(Fq,Fk)))))Corrc=1simcos(Acq,Ack)Wc=Norm(|AcqAck|(1+Corrc))Fc=FqWc,(2)

here, Aqc and Akc are the channel attention maps for the query and key features, respectively; simcos() denotes the cosine similarity computation, represents element-wise multiplication, and Norm() indicates a normalization operation. Through the above operations, the model is enabled to pay more attention to and enhance the channel features in Fq that exhibit strong complementarity with Fk in terms of channel responses.

Frequency Branch: In contrast to seeking dissimilarities within the spatial and channel domains, the frequency domain offers another unique perspective for feature fusion. The low-frequency components of an image typically represent its global contours and background context, whereas the high-frequency components contain information about edges and details. Infrared and visible images exhibit a natural complementarity in their information distribution across different frequency bands.

The frequency branch is designed to fuse features directly within the frequency domain. The input features Fq and Fk are first transformed into the frequency domain using the Fast Fourier Transform (FFT) to obtain their respective amplitude and phase spectra. Subsequently, the amplitude spectra from both modalities are concatenated, as are the phase spectra. These are then processed by two independent convolutional networks to learn and generate a fused amplitude spectrum Af and a fused phase spectrum Pf. Finally, the Inverse Fast Fourier Transform (IFFT) is applied to convert the fused frequency-domain representation back into the spatial domain, yielding the output feature of the frequency branch Ff. The entire process can be summarized as:

Af,Pf=Gfre([FFT(Fq);FFT(Fk)])Fa,Fa=LeakyReLU(Conv3×3(LeakyReLU(Conv3×3(Af,Pf))))Ff=IFFT(Comp(Fa,Fa)),(3)

herein, Gfre denotes the convolutional network for fusing frequency-domain spectra, and Comp() denotes the reconstruction of the complex spectrum based on Euler’s formula.

Dynamic Branch Weighting and Frequency-Selective Recombination: To optimally integrate the complementary information derived from the three parallel branches, we introduce a dynamic branch weighting module. This module takes the original features Fq and Fk as input and employs a lightweight convolutional network to dynamically generate weights (ws,wc,wf) for the spatial, channel, and frequency branches at each spatial location. This mechanism enables the network to adaptively determine the contribution of each branch according to the local feature characteristics. The resulting weighted fused feature is denoted as Ffus. Subsequently, to prevent information loss that could arise from a direct dimensionality reduction to 32 channels, a separate lightweight convolutional network is used to extract the final feature Ffus.

Ffus=wsFs+wcFc+wfFfFfus=BatchNorm(Conv3×3(GELU(Conv1×1(Ffus)))).(4)

Building upon this, to further refine the features, the PTFM subjects Ffus to a frequency-selective recombination process. We employ independent Gaussian low-pass and Laplacian high-pass filters to decompose the input features Fq and Fk into low-frequency components FL and high-frequency components FH respectively. Concurrently, a separate network—composed of a 3 × 3 convolutional layer with a padding of 1, a ReLU activation function, a 1 × 1 convolutional layer, and a final Softmax activation—learns the fusion weights for the low- and high-frequency components wL and wH based on the input features. Ultimately, by performing a weighted fusion of the frequency components from the different modalities and adding this to the tri-branch output via a residual connection, the final output of the PTFM FoutPT is obtained:

FLout=wLFLq+(1wL)FLkFHout=wHFHq+(1wH)FHkFHF=Softmax(Conv1×1(ReLU(Conv3×3(FLout+FHout))))+Ffus.(5)

In this manner, the PTFM module not only extracts uncorrelated and complementary features in parallel from three distinct dimensions but also achieves a deep integration and optimization of multi-modal information through dynamic weighting and subsequent frequency-selective fusion. This process provides a rich feature representation essential for generating high-quality fused images.

3.3 High-Frequency Detail Enhancement Module

To proactively enhance the detail expressiveness of the source images before feature fusion, the PMCFusion network integrates a High-Frequency Detail Enhancement Module (HFDEM) at the initial stage of the encoder. This module operates independently on the early-stage feature maps of the infrared and visible images. Its core task is to extract and reinforce the high-frequency components within each respective modality, such as fine textures and edge contours, thereby providing feature inputs that are richer in information and clearer in detail for the subsequent fusion process. The overall architecture of the HFDEM is illustrated in Fig. 3, and its specific operational workflow is as follows.

images

Figure 3: Detailed structure diagram of HFDEM

To capture detail information at varying granularities, the HFDEM first employs a multi-scale strategy to extract high-frequency components. We apply a set of Laplacian High-Pass Filters (HPF) with different kernel sizes to the input feature map Fin to parallelly separate fine-grained, medium-grained, and coarse-grained edge and texture information. The high-frequency features from each scale FHk are then fed into a respective, independent convolutional network Genhk for enhancement, in order to learn and reinforce the particular detail patterns at that scale. To enable the model to focus more on significant detail scales, an adaptive weighting module is also designed. Conditioned on the original input feature Fin, this module generates corresponding weights for the three high-frequency scales, allowing the network to adaptively assign contribution levels to the high-frequency information from each scale. This process can be represented as:

FHk=HPFk(Fin)k{3,5,7}Ek=Genhk(FHk)k{3,5,7}(w3,w5,w7)=Softmax(Gw(Fin))Ew=WkEk,(6)

among them, HPFk represents a high-pass filter of size k×k, and FHk is the high-frequency feature map at the corresponding scale. Gw is the weight generation network, and denotes element-wise multiplication.

Prior to fusing the multi-scale high-frequency information, and to ensure the enhancement process does not introduce noise into smooth regions while simultaneously reinforcing details in edge areas, we introduce an edge-preserving attention mechanism. This mechanism generates a spatial attention map Aedge, whose values are higher in edge regions and lower in smooth areas. Subsequently, the weighted multi-scale enhanced features are concatenated and then passed through a fusion network Gfin to obtain the enhanced high-frequency feature Fenh. Finally, adhering to the principle of residual learning, we add the enhanced high-frequency feature Fenh back to the original input feature Fin to obtain the module’s final output FoutHF. This additive fusion ensures that the module preserves the original low-frequency information in its entirety while simultaneously enhancing the high-frequency details.

Aedge=σ(Gattn(Fin))Ecat=Concat[Ew3,Ew5,Ew7]Fenh=Gfin(EcatAedge)FoutHF=Fin+Fenh.(7)

Thus, by leveraging the HFDEM, the network acquires feature representations enriched with fine details at an early stage. This provides a high-quality foundation for the subsequent cross-modal deep fusion, resulting in a significant enhancement of the clarity and detail fidelity in the final fused image.

3.4 Fusion Module

In the final stage of the network, feature fusion and reconstruction are accomplished by an optimized module. The design of this module is predicated upon the principles of the Self-Fusion Convolutional network (SFC) [35]. While recognizing the efficacy of the SFC framework for feature fusion, we have implemented several key adaptive modifications to better suit the task of infrared and visible image fusion. The core objective of these changes is to enhance detail preservation and prevent information loss during the feature compression process. Our primary optimization involves abandoning the potentially aggressive feature compression methods found in traditional reconstruction modules, opting instead for a smoother, progressive compression strategy to gently fuse cross-modal features. Concurrently, to ensure that the fine-grained features originating from the PTFM survive the complex transformations, we have introduced a parallel residual connection path. This path acts as an information shortcut, enabling the direct transmission of important detailed features to the final reconstruction layer. Furthermore, we have integrated a lightweight channel attention mechanism to accentuate key features and have replaced the conventional tanh function with a more stable Sigmoid activation function to better preserve the dynamic range and grayscale fidelity of the output image. These targeted optimizations allow the module to inherit the core ideas of SFC while generating fusion results with richer details and fewer artifacts for our specific task.

3.5 Loss Function

To ensure that the fused image preserves both the salient thermal information from the infrared image and the rich textural details from the visible image, we impose comprehensive constraints on the fusion result across four dimensions: pixel intensity, perceptual features, spatial structure, and overall fusion quality.

To ensure that the fused image preserves the fundamental content information at the pixel level, we employ the Mean Squared Error (MSE) as a basic constraint. To achieve more fine-grained supervision, we also apply this loss at multiple scales. The multi-scale MSE loss Lmse is defined as:

Lmse=i=13(MSE(Fi,Iiri)+MSE(Fi,Ivisi)),(8)

here, Ii represents the image at the i-th scale. This loss provides the network with direct and effective content consistency supervision.

Relying solely on pixel-level losses can lead to fused results that lack high-level semantic information. To address this limitation, we introduce a perceptual loss Lp. This loss utilizes a VGG-19 network [36], pre-trained on a large-scale image dataset, as a feature extractor. The single-channel fused image and the source images are first mapped to three-channel representations using an intelligent channel expansion strategy. The perceptual loss is then calculated as the L1 distance between their respective feature maps extracted from multiple layers of the VGG-19 network:

Lp=j=14λj(ϕj(F1)ϕj(Iir1)1+ϕj(F1)ϕj(Ivis1)1),(9)

here, ϕj() denotes the output of the j-th feature layer of the VGG network, and λj is the weight corresponding to that layer. Higher layers are assigned greater weights.

To precisely preserve the high-frequency details from the source images, such as edges and textures, we have designed a spatial gradient loss Ls. This loss utilizes first-order Sobel and second-order Laplacian operators to extract gradient information and imposes a constraint on the gradient map of the fused image. Distinct from conventional gradient losses, our target gradient map Gtar is obtained by adaptively weighting the gradients of the source images, a strategy designed to retain the more salient edge information at each spatial location.

Gtar=GirGir+GvisGir+GirGir+GvisGvisLS=GFGtar1+γLFLtar1,(10)

where Gir and Gvis are the gradient magnitude maps of the infrared and visible images, respectively, computed using the Sobel operator. L denotes the Laplacian response, and γ is the balancing weight.

To further enhance the visual quality of the fused image and achieve superior performance on multiple objective evaluation metrics, we introduce a novel comprehensive fusion quality loss Lcomp. This loss function integrates metrics for four critical fusion attributes: structural consistency, gradient preservation, mutual information, and texture similarity. By applying this comprehensive loss Lcomp at multiple scales for supervision, we guide the network to generate high-quality fusion results in a holistic manner.

The total loss function Ltotal is the weighted sum of these four loss components, defined as follows:

Ltal=wmseLmse+wpLp+wsLs+wcompLcomp,(11)

here, the weights wmse,wp,ws,wcomp are dynamically adjusted according to the training epoch. This strategy prioritizes basic reconstruction in the initial stages, shifts focus to structural fidelity in the intermediate stages, and finally emphasizes the enhancement of comprehensive fusion metrics in the later stages of training.

Through the synergistic action of the aforementioned loss terms, the PMCFusion network is capable of learning an optimal fusion strategy, thereby generating infrared and visible fused images that are richer in information, clearer in detail, and more natural in visual appearance.

4  Experiments

4.1 Experiments Settings

In this study, the proposed PMCFusion network was systematically trained and comprehensively evaluated on the public KAIST [37] multispectral dataset. To construct an effective training set and enhance the model’s generalization capabilities, we randomly sampled approximately 19,000 well-aligned image pairs from the original dataset and designed an online data augmentation strategy. During each training iteration, the image pairs underwent random horizontal and vertical flips, small-angle rotations, and were finally cropped into 256 × 256 sub-images. Furthermore, we introduced slight noise injection as well as random adjustments to brightness and contrast to simulate variable imaging conditions. All visible images were uniformly converted to a single-channel grayscale format, and the pixel values of all images were normalized to the [−1, 1] range. For the optimizer configuration, we employed the Adam [38] method with parameters β1 and β2 set to 0.9 and 0.999, respectively, and a weight decay coefficient of 1 ×108. The initial learning rate was set to 1 ×104 and was managed by a step-wise learning rate scheduler, which multiplied the learning rate by 0.9 every 10 epochs. All parameters of our proposed PMCFusion network were learned from scratch; however, the calculation of the perceptual loss utilized a VGG-19 [36] model pre-trained on ImageNet as a fixed feature extractor. A key strategy during the training process was the dynamic loss weighting mechanism. Instead of using fixed hyperparameters, we dynamically adjusted the weights of the four loss components via a function dependent on the training epoch, thereby emphasizing different optimization objectives at various stages of training. The entire training process was conducted on a server equipped with four NVIDIA 3080 Ti GPUs and was run for a total of 20 epochs.

4.2 Experimental Details

To comprehensively evaluate the fusion performance and generalization capability of the proposed PMCFusion model, systematic experiments were conducted on several public benchmark datasets for infrared and visible image fusion. Specifically, the KAIST dataset was employed for model training and initial validation. This dataset contains a large number of precisely aligned infrared and visible image pairs (640 × 512 pixels), whose clear urban scenes provide an ideal platform for model learning. For the testing phase, 200 pairs were randomly selected from images not seen during training for evaluation. Furthermore, to assess the model’s performance under more challenging conditions, three classical test datasets were introduced. The TNO [39] dataset, comprising 20 image pairs, includes a variety of multispectral images from nighttime and military scenarios and is widely used to evaluate model performance in low-illumination environments. The RoadScene [40] dataset, consisting of 221 pairs of real-world road scene images, is specifically designed to assess model robustness in complex and dynamic traffic environments. Finally, the M3FD [41] dataset, with 300 image pairs, covers an extremely diverse range of environmental, lighting, and weather conditions, serving to ultimately validate the model’s scene adaptability and generalization capability.

4.3 Quantitative Assessment

To demonstrate its effectiveness, we compare PMCFusion against 15 state-of-the-art (SOTA) methods, including DenseFuse [11], SEDRFuse [42], CSF [12], GAN-FM [16], DIDFuse [43], U2Fusion [40], YDTR [44], ERNet [13], SwinFusion [17], TarDAL [41], DATFuse [45], as well as recent approaches such as EMMA [31], CMTFusion [18], SFDFusion [23], and DCEvo [46].

As indicated by the results in Table 1, on the KAIST dataset, PMCFusion achieves EI, SF, and AG scores of 40.943, 10.116, and 3.714, yielding relative gains of approximately 4.1%, 3.4%, and 8.5% over the respective runner-up method (EMMA, with 39.301, 9.783, and 3.420). For EN and Qabf, our results are highly comparable to the best competitors, with differences at the per-mille level (e.g., only 0.005 lower than DCEvo in Qabf), showing competitive information retention and structural preservation. On M3FD, PMCFusion reports an EI of 58.343 and AG of 5.479, outperforming the next best method (EMMA, with 56.045, and 5.128) by approximately 4.1% and 6.8%, respectively. On SF, PMCFusion reaches 16.303, surpassing EMMA (15.227) by 7.1% to achieve the best performance. On RoadScene, PMCFusion delivers notable improvements: EI reaches 72.244, which is 2% higher than EMMA (70.815), while SF and AG achieve 16.550 and 6.403, corresponding to 3.1% and 4.1% gains over EMMA (16.054 for SF, 6.152 for AG), respectively. On TNO, PMCFusion again leads with EI of 49.777 (1.7% higher than EMMA’s 48.927), SF of 12.719 (12.4% gain over EMMA’s 11.315), and AG of 4.808 (5.2% gain over EMMA’s 4.570). Overall, PMCFusion provides consistent and stable advantages on clarity- and detail-related metrics (EI, SF, AG) across all datasets, while also achieving competitive or superior performance on EN and Qabf. These results demonstrate not only the superior perceptual quality of the fused outputs but also the strong cross-dataset generalization capability of the proposed framework.

images

4.4 Qualitative Assessment

Fig. 4 displays the fusion results of various algorithms on the KAIST, M3FD, RoadScene, and TNO datasets. These visual comparisons reveal two key advantages of our method. First, our algorithm consistently generates fused images with low levels of noise. Second, the proposed method better preserves the textural details from both source images. For example, the details of the tree within the red box in (1) are well preserved. In (3), only our method successfully detects all vehicles. Most models fail to detect the car on the left, and some models, even if they detect the car on the left, encounter issues with detecting other vehicles. For instance, the visible light image falsely detects multiple unnecessary trucks on the right. This further demonstrates the superior performance of our model in fusion tasks.

images

Figure 4: The Qualitative comparison results display the fusion results of three datasets, where (1) and (2) are the infrared and visible light image fusion results on the KAIST and TNO datasets, and (3) is the object detection result of the fusion image on the M3FD dataset

4.5 Ablation Experiment

We conducted a series of ablation studies to validate the importance of the key components in our proposed algorithm. These experiments were designed to assess the impact of the PTFM, the HFDEM, and the individual loss functions on the overall image fusion performance. The M3FD dataset was used as the exclusive test set for all ablation experiments.

As shown in Table 2, the ablation study demonstrates that sequentially removing the spatial, channel, and frequency branches from the PTFM module leads to a significant decline in all evaluation metrics, highlighting the importance of complementary information among the three branches in the fusion process. When the HFDEM module is removed, all metrics related to sharpness and detail experience a substantial drop, indicating that enhancing high-frequency details of the source images prior to fusion is essential. Regarding the loss functions, the removal of spatial loss Ls, perceptual loss Lp, and comprehensive loss Lcomp each results in varying degrees of performance degradation. In particular, the exclusion of Ls and Lp causes a pronounced decrease, underscoring the significant role these losses play in the quality of the fusion results. Although the removal of Lcomp does not lead to a marked decline in metrics, it still demonstrates its importance in optimizing the model.

images

4.6 Complexity and Detection Performance Comparison of Different Methods

We compared the PMCFusion model with other models in terms of inference time, parameter count, and computational load, and evaluated the object detection performance of the fused images using YOLOv5s. The left side of Table 3 shows that, despite the larger number of parameters in our model, it still ranks among the top in inference time. The results on the right side of Table 3 indicate that, although our method was not specifically optimized for downstream object detection tasks, it still maintains excellent detection performance, with detection accuracy about 3% higher than that of the second-best model. These comparisons demonstrate that PMCFusion offers high practical value in real-world applications. We plan to make the model open-source after acceptance.

images

5  Conclusion

In this paper, we propose a Parallel Multi-dimensional Complementary Fusion Network (PMCFusion). The proposed algorithm first enhances the source images using HFDEM, and subsequently achieves complementary information fusion in the spatial, channel, and frequency domains through the PTFM module, thereby effectively improving visual quality and texture details. Experimental results demonstrate that our method offers consistent advantages in terms of visual perception and quantitative metrics. In addition, while the four benchmark datasets already contain challenging cases such as low illumination, noise, and overexposure, our current study does not explicitly highlight robustness-oriented experiments. As an important extension, we plan to conduct systematic robustness evaluations under synthetic noise, modality misalignment, and simulated adverse weather conditions, to further validate the generalizability of PMCFusion. In the future, another important direction will be to extend the PMCFusion framework to a broader range of image fusion tasks, thereby enhancing its applicability in diverse real-world scenarios.

Acknowledgement: Not applicable.

Funding Statement: This work was supported in part by the Funds for Central-Guided Local Science & Technology Development (Grant No. 202407AC110005) Key Technologies for the Construction of a Whole-Process Intelligent Service System for Neuroendocrine Neoplasm; in part by the Xingdian Talent Project of Yunnan Province. The key technology research and application of cross-domain automatic business collaboration in smart tourism (XYYC-CYCX-2022-0005); and in part by the Yunnan Province Zhangjun Expert Workstation (No. 202205AF150081).

Author Contributions: Xu Tao and Hao Li are responsible for the thesis manuscript and experiments, while Zhaoqi Jin and Qiang Xiao are mainly in charge of the data. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The datasets used in this study (KAIST, M3FD, ROADSCENE, and TNO) are publicly available benchmark datasets. Interested researchers can obtain the data from the respective official repositories in accordance with the terms specified by the dataset providers.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Li S, Kang X, Fang L, Hu J, Yin H. Pixel-level image fusion: a survey of the state of the art. Inform Fus. 2017;33:100–12. doi:10.1016/j.inffus.2016.05.004. [Google Scholar] [CrossRef]

2. Zhang X. Deep learning-based multi-focus image fusion: a survey and a comparative study. IEEE Transact Pattern Anal Mach Intell. 2021;44(9):4819–38. doi:10.1109/tpami.2021.3078906. [Google Scholar] [PubMed] [CrossRef]

3. Liu J, Wu G, Liu Z, Wang D, Jiang Z, Ma L, et al. Infrared and visible image fusion: from data compatibility to task adaption. IEEE Trans Pattern Anal Mach Intell. 2025;47(4):2349–69. doi:10.1109/tpami.2024.3521416. [Google Scholar] [PubMed] [CrossRef]

4. Hermessi H, Mourali O, Zagrouba E. Multimodal medical image fusion review: theoretical background and recent advances. Signal Process. 2021;183:108036. doi:10.1016/j.sigpro.2021.108036. [Google Scholar] [CrossRef]

5. Li C, Song D, Tong R, Tang M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019;85:161–71. doi:10.1016/j.patcog.2018.08.005. [Google Scholar] [CrossRef]

6. Li S, Yin H, Fang L. Group-sparse representation with dictionary learning for medical image denoising and fusion. IEEE Transact Biomed Eng. 2012;59(12):3450–9. doi:10.1109/tbme.2012.2217493. [Google Scholar] [PubMed] [CrossRef]

7. Zhu Z, Yin H, Chai Y, Li Y, Qi G. A novel multi-modality image fusion method based on image decomposition and sparse representation. Informat Sci. 2018;432:516–29. doi:10.1016/j.ins.2017.09.010. [Google Scholar] [CrossRef]

8. Cao L, Jin L, Tao H, Li G, Zhuang Z, Zhang Y. Multi-focus image fusion based on spatial frequency in discrete cosine transform domain. IEEE Signal Process Lett. 2014;22(2):220–4. doi:10.1109/lsp.2014.2354534. [Google Scholar] [CrossRef]

9. Zhou Z, Wang B, Li S, Dong M. Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with Gaussian and bilateral filters. Informat Fus. 2016;30(1):15–26. doi:10.1016/j.inffus.2015.11.003. [Google Scholar] [CrossRef]

10. Liu R, Liu Y, Wang H, Du S. WaveFusionNet: infrared and visible image fusion based on multi-scale feature encoder-decoder and discrete wavelet decomposition. Opt Commun. 2024;573:131024. doi:10.1016/j.optcom.2024.131024. [Google Scholar] [CrossRef]

11. Li H, Wu XJ. DenseFuse: a fusion approach to infrared and visible images. IEEE Transact Image Process. 2018;28(5):2614–23. doi:10.1109/tip.2018.2887342. [Google Scholar] [PubMed] [CrossRef]

12. Wang X, Guan Z, Qian W, Cao J, Liang S, Yan J. Cs2fusion: contrastive learning for self-supervised infrared and visible image fusion by estimating feature compensation map. Inf Fusion. 2024;102:102039. doi:10.1016/j.inffus.2023.102039. [Google Scholar] [CrossRef]

13. Su W, Huang Y, Li Q, Zuo F, Liu L. Infrared and visible image fusion based on adversarial feature extraction and stable image reconstruction. IEEE Transact Instrument Measur. 2022;71:1–14. doi:10.1109/tim.2022.3177717. [Google Scholar] [CrossRef]

14. Yang Y, Liu J, Huang S, Wan W, Wen W, Guan J. Infrared and visible image fusion via texture conditional generative adversarial network. IEEE Transact Circ Syst Video Technol. 2021;31(12):4771–83. doi:10.1109/tcsvt.2021.3054584. [Google Scholar] [CrossRef]

15. Gao Y, Ma S, Liu J. DCDR-GAN: a densely connected disentangled representation generative adversarial network for infrared and visible image fusion. IEEE Transact Circ Syst Video Technol. 2022;33(2):549–61. doi:10.1109/tcsvt.2022.3206807. [Google Scholar] [CrossRef]

16. Zhang H, Yuan J, Tian X, Ma J. GAN-FM: infrared and visible image fusion using GAN with full-scale skip connection and dual Markovian discriminators. IEEE Transact Computat Imag. 2021;7:1134–47. doi:10.1109/tci.2021.3119954. [Google Scholar] [CrossRef]

17. Ma J, Tang L, Fan F, Huang J, Mei X, Ma Y. SwinFusion: cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J Automatica Sinica. 2022;9(7):1200–17. doi:10.1109/jas.2022.105686. [Google Scholar] [CrossRef]

18. Park S, Vien AG, Lee C. Cross-modal transformers for infrared and visible image fusion. IEEE Transact Circ Syst Video Technol. 2023;34(2):770–85. doi:10.1109/tcsvt.2023.3289170. [Google Scholar] [CrossRef]

19. Infrared and visible image fusion using ResNet101-based pretrained backbone. Biomed Signal Process Cont. 2024;90:106976. [Google Scholar]

20. Zhang X, Tang B, Hu S. Infrared and visible image fusion based on particle swarm optimization and Dense Blocks. Front Energy Res. 2022;10:1001450. doi:10.3389/fenrg.2022.1001450. [Google Scholar] [CrossRef]

21. Li J, Bai L, Yang B, Li C, Ma L. Graph representation learning for infrared and visible image fusion. IEEE Trans Autom Sci Eng. 2025;22:13801–13. doi:10.1109/tase.2025.3557234. [Google Scholar] [CrossRef]

22. Chen X, Xu S, Hu S, Ma X. DGFD: a dual-graph convolutional network for image fusion and low-light object detection. Inf Fusion. 2025;119:103025. doi:10.1016/j.inffus.2025.103025. [Google Scholar] [CrossRef]

23. Hu K, Zhang Q, Yuan M, Zhang Y. SFDFusion: an efficient spatial-frequency domain fusion network for infrared and visible image fusion. In: ECAI 2024. Amsterdam, The Netherland: IOS Press; 2024. p. 482–9. doi:10.3233/faia240524. [Google Scholar] [CrossRef]

24. Li H, Liu L, Huang W, Yue C. An improved fusion algorithm for infrared and visible images based on multi-scale transform. Infrar Phy Technol. 2016;74:28–37. [Google Scholar]

25. Cvejic N, Bull D, Canagarajah N. Region-based multimodal image fusion using ICA bases. IEEE Sens J. 2007;7(5):743–51. doi:10.1109/jsen.2007.894926. [Google Scholar] [CrossRef]

26. Liu C, Qi Y, Ding W. Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Phy Technol. 2017;83:94–102. [Google Scholar]

27. Luo Y, He K, Xu D, Shi H, Yin W. Infrared and visible image fusion based on hybrid multi-scale decomposition and adaptive contrast enhancement. Signal Process Image Commun. 2025;130:117228. doi:10.1016/j.image.2024.117228. [Google Scholar] [CrossRef]

28. Gan W, Wu X, Wu W, Yang X, Ren C, He X, et al. Infrared and visible image fusion with the use of multi-scale edge-preserving decomposition and guided image filter. Infrar Phys Technol. 2015;72:37–51. [Google Scholar]

29. Karim S, Tong G, Li J, Qadir A, Farooq U, Yu Y. Current advances and future perspectives of image fusion: a comprehensive review. Informat Fus. 2023;90:185–217. doi:10.1016/j.inffus.2022.09.019. [Google Scholar] [CrossRef]

30. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. p. 770–8. [Google Scholar]

31. Zhao Z, Bai H, Zhang J, Zhang Y, Zhang K, Xu S, et al. Equivariant multi-modality image fusion. In: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024 Jun 16–22; Seattle, WA, USA. p. 25912–21. [Google Scholar]

32. Qi W, Zhang Z, Wang Z. DMFuse: diffusion model guided cross-attention learning for infrared and visible image fusion. Chin J Inform Fusion. 2024;1(3):226–42. doi:10.62762/cjif.2024.655617. [Google Scholar] [CrossRef]

33. Zhang X, Hua Z, Ju Y, Zhou W, Liu J, Kot AC. SGDFuse: SAM-guided diffusion for high-fidelity infrared and visible image fusion. arXiv:2508.05264. 2025. [Google Scholar]

34. Liu J, Tang J, Wu G. Residual feature distillation network for lightweight image super-resolution. In: European Conference on Computer Vision. Cham, Switzerland: Springer; 2020. p. 41–55. [Google Scholar]

35. Gong S, Zhang S, Yang J, Yuen PC. Self-fusion convolutional neural networks. Pattern Recognit Lett. 2021;152:50–5. doi:10.1016/j.patrec.2021.08.022. [Google Scholar] [CrossRef]

36. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. 2014. [Google Scholar]

37. Hwang S, Park J, Kim N, Choi Y, So Kweon I. Multispectral pedestrian detection: benchmark dataset and baseline. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 7–12; Boston, MA, USA. p. 1037–45. [Google Scholar]

38. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv:1412.6980. 2014. [Google Scholar]

39. Toet A. The TNO multiband image data collection. Data Brief. 2017;15:249. [Google Scholar] [PubMed]

40. Xu H, Ma J, Jiang J, Guo X, Ling H. U2Fusion: a unified unsupervised image fusion network. IEEE Transact Pattern Anal Mach Intell. 2020;44(1):502–18. doi:10.1109/tpami.2020.3012548. [Google Scholar] [PubMed] [CrossRef]

41. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W, et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 18–24; New Orleans, LA, USA. p. 5802–11. [Google Scholar]

42. Jian L, Yang X, Liu Z, Jeon G, Gao M, Chisholm D. SEDRFuse: a symmetric encoder-decoder with residual block network for infrared and visible image fusion. IEEE Transact Instrument Measur. 2020;70:1–15. doi:10.1109/tim.2020.3022438. [Google Scholar] [CrossRef]

43. Zhao Z, Xu S, Zhang C, Liu J, Li P, Zhang J. DIDFuse: deep image decomposition for infrared and visible image fusion. arXiv:2003.09210. 2020. [Google Scholar]

44. Tang W, He F, Liu Y. YDTR: infrared and visible image fusion via Y-shape dynamic transformer. IEEE Transact Multim. 2022;25:5413–28. doi:10.1109/tmm.2022.3192661. [Google Scholar] [CrossRef]

45. Tang W, He F, Liu Y, Duan Y, Si T. DATFuse: infrared and visible image fusion via dual attention transformer. IEEE Transact Circ Syst Video Technol. 2023;33(7):3159–72. doi:10.1109/tcsvt.2023.3234340. [Google Scholar] [CrossRef]

46. Liu J, Zhang B, Mei Q, Li X, Zou Y, Jiang Z, et al. DCEvo: discriminative cross-dimensional evolutionary learning for infrared and visible image fusion. In: Proceedings of 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2025 Jun 10–17; Nashville, TN, USA. p. 2226–35. [Google Scholar]


Cite This Article

APA Style
Tao, X., Xiao, Q., Jin, Z., Li, H. (2026). PMCFusion: A Parallel Multi-Dimensional Complementary Network for Infrared and Visible Image Fusion. Computers, Materials & Continua, 86(2), 1–18. https://doi.org/10.32604/cmc.2025.070790
Vancouver Style
Tao X, Xiao Q, Jin Z, Li H. PMCFusion: A Parallel Multi-Dimensional Complementary Network for Infrared and Visible Image Fusion. Comput Mater Contin. 2026;86(2):1–18. https://doi.org/10.32604/cmc.2025.070790
IEEE Style
X. Tao, Q. Xiao, Z. Jin, and H. Li, “PMCFusion: A Parallel Multi-Dimensional Complementary Network for Infrared and Visible Image Fusion,” Comput. Mater. Contin., vol. 86, no. 2, pp. 1–18, 2026. https://doi.org/10.32604/cmc.2025.070790


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 224

    View

  • 54

    Download

  • 0

    Like

Share Link