|Computers, Materials & Continua |
Artifacts Reduction Using Multi-Scale Feature Attention Network in Compressed Medical Images
Department of Convergence IT Engineering, Kyungnam University, Changwon, 51767, Korea
*Corresponding Author: Dongsan Jun. Email: firstname.lastname@example.org
Received: 01 June 2021; Accepted: 03 July 2021
Abstract: Medical image compression is one of the essential technologies to facilitate real-time medical data transmission in remote healthcare applications. In general, image compression can introduce undesired coding artifacts, such as blocking artifacts and ringing effects. In this paper, we proposed a Multi-Scale Feature Attention Network (MSFAN) with two essential parts, which are multi-scale feature extraction layers and feature attention layers to efficiently remove coding artifacts of compressed medical images. Multi-scale feature extraction layers have four Feature Extraction (FE) blocks. Each FE block consists of five convolution layers and one CA block for weighted skip connection. In order to optimize the proposed network architectures, a variety of verification tests were conducted using validation dataset. We used Computer Vision Center-Clinic Database (CVC-ClinicDB) consisting of 612 colonoscopy medical images to evaluate the enhancement of image restoration. The proposed MSFAN can achieve improved PSNR gains as high as 0.25 and 0.24 dB on average compared to DnCNN and DCSC, respectively.
Keywords: Medical image processing; convolutional neural network; deep learning; telemedicine; artifact reduction; image restoration
In the telemedicine field, a large number of medical images are produced from endoscopy, Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). As these medical images have to support high quality to identify more accurate medical diagnoses, image compression is one of the essential technologies to facilitate real-time medical data transmission in remote healthcare applications. Although the latest image compression method can provide powerful coding performance without noticeable quality loss, both diagnostic uncertainty and degradation of subjective quality can be caused by image compression from a low bitrate environment with limited network bandwidth. In general, image compression can introduce undesired coding artifacts such as blocking artifacts and ringing effects primarily due to block-based coding to remove high-frequency components . Because these artifacts can decrease perceptual visual quality, there is a need to reduce them on compressed medical images.
Deep learning methods using Convolutional Neural Network (CNN) have brought great potentials into low-level computer vision applications such as Super Resolution (SR) [2–8], image denoising [9–16], and image colorization [17–18]. In particular, these applications have been developed by CNN-based image denoising methods with deeper and denser network architectures [19–20]. Recently, these methods tend to be more complicated network architectures with enormous network parameters, excessive convolution operations, and high memory usages. In addition, most networks were initially designed to remove coding artifacts of natural images, so direct applications of them to medical images will lead to unsatisfactory performance. In this paper, we proposed a novel CNN structure to efficiently improve the quality of compressed medical images as shown in Fig. 1. The main contributions of this paper are summarized as follows:
• In order to reduce coding artifact of compressed medical image, we proposed a Multi-Scale Feature Attention Network (MSFAN) with two essential parts, which are multi-scale feature extraction layers and feature attention layers.
• Through a variety of ablation works, the proposed network architecture was verified to guarantee its optimal performance for coding artifact reduction.
• Finally, we evaluated the performance of image restoration on natural images as well as medical images to demonstrate versatile applications of the proposed MSFAN.
The remainder of this paper is organized as follows. In Section 2, we review the previous CNN-based image restoration methods to remove the coding artifacts. Then, the proposed method is then described in Section 3. Finally, experimental results and conclusions are given in Sections 4 and 5, respectively.
2 Related Works
With the advancement of deep learning algorithms, the researches of low-level computer vision such as SR and image denoising has been combined with various CNN architectures to achieve higher image restoration. In the area of SR, Dong et al. have proposed a Super Resolution Convolutional Neural Network (SRCNN)  consisting of three convolutional layers. SRCNN can learn end-to-end pixel mapping from an interpolated low-resolution image to a high-resolution image. Since the advent of SRCNN, CNN-based image restoration methods have been reported with various deep learning models [21–27].
In terms of artifact reduction of compressed images, those methods can be applied to compressed images to reduce coding artifacts. As SR networks have generally up-sampling layers, the size of the output image is larger than that of the input image. On the other hand, the size of the output image is the same as that of the input image in the image denoising networks. Dong et al. have also proposed an Artifacts Reduction CNN (ARCNN) to reduce the coding artifacts compressed by Joint Photographic Experts Group (JPEG) . Chen et al. addressed a Trainable Nonlinear Reaction Diffusion (TNRD) for a variety of image restoration tasks, such as Gaussian image denoising, SR, and JPEG deblocking . Zhang et al. have proposed a Denoising CNN (DnCNN) utilizing residual learning  and batch normalization  to enhance network training as well as denoising performance . Fu et al. have proposed a Deep Convolutional Sparse Coding (DCSC)  to exploit multi-scale image features using three different dilated convolutions .
In terms of artifact reduction of compressed video sequences, CNN based video restoration methods show better performance than the conventional method. Lee et al. have proposed an algorithm to remove color artifacts using block-level quantization parameter offset control in compressed High Dynamic Range (HDR) videos . On the other hand, Dai et al. have proposed CNN based video restoration, namely Variable-filter-size Residue-learning CNN (VRCNN) , which can be applied to compressed images by High Efficiency Video Coding (HEVC) . Compared to ARCNN, this method can improve PSNR and reduce the number of parameters using small filter size. Meng et al. have proposed a Multi-channel Long-Short-term Dependency Residual Network (MLSDRN), which updates each cell to adaptively store and select long-term and short-term dependency information in HEVC . Aforementioned image and video denoising networks can be deployed in the preprocessing of various high-level computer vision applications, such as object recognition [30–32] and detection [33–34] to achieve higher accuracy.
As depicted in Fig. 2, Hu et al. have presented a Channel Attention (CA) block, namely Squeeze-and-Excitation Network (SENet), which adaptively recalibrates channel-wise feature responses to represent interdependencies between feature maps , where GAP, , and indicate global average polling operation, input feature maps of CA block and output feature maps, respectively. Zhang et al. have proposed a very deep Residual Channel Attention Network (RCAN), which deploys a CA block to adaptively rescale channel-wise features for improving SR performance . Ding et al. have proposed a Squeeze-and-Excitation Filtering CNN (SEFCNN) to fully explore the relationship between channels in HEVC in-loop filter .
3 Proposed Methods
3.1 Overall Architecture of MSFAN
Fig. 3 shows the overall architecture of the proposed Multi-Scale Feature Attention Network (MSFAN) to remove coding artifacts in compressed medical images. It consists of an input layer, multi-scale feature extraction layers, feature attention layers, and an output layer. The convolutional operation of MSFAN calculates output feature maps () from previous feature maps () as expressed in Eq. (1):
where , , , and ‘’ represent Parametric Rectified Linear Unit (PReLU) function as an activation function, filter weights, biases, and convolutional operation, respectively. For fast and stable network training, the proposed MSFAN uses a residual learning scheme with skip connections . Specifically, the input image is added to the feature map of the output layer using a skip connection to learn the residual image. In addition, output feature maps of the input layer are added to output feature maps of feature attention layers using CA-based weighted skip connections .
As shown in Fig. 4a, the CA block consists of GAP and two convolutional layers. Because the CA block can emphasize more important feature maps for better network training, it assigns weights () to each channel of input feature maps to adaptively control channel-wise feature response as expressed in Eq. (2):
where indicates sigmoid function. Then, output feature maps () of the CA block are generated from channel-wise product operations ‘’ as shown in Eq. (3).
Multi-scale feature extraction layers have four Feature Extraction (FE) blocks. Each FE block consists of five convolution layers and one CA block for weighted skip connection as shown in Fig. 4b. In the FE blocks, we used dilated convolutional operations with three different dilation facors (DF) to extract multi-scale features, as depicted in Fig. 5. Because large-size of filters will cause substantial increases for the number of parameters, we deployed dilated convolution to allow a wide receptive field without additional network parameters . Note that CA-based weighted skip connection was also implemented on each FE block to train interdependencies between multi-scale channels.
In the feature attention layers, concatenated feature maps from all FE blocks () are used as the input of the next CA block. After generating output feature maps () by CA block, they are fed into a bottleneck layer in order to reduce the number of output feature maps. It means that the bottleneck layer has a role in decreasing the number of filter weights as well as compressing the number of feature maps. Output feature maps () of the feature attention layers are computed by element-wise sum between and the output of the bottleneck layer. Finally, the output layer generates a predicted residual image () between the input and original images. Note that we used zero padding to allow all feature maps to have the same spatial resolution between different convolutional layers, and the padding size is determined by Eq. (4):
where and indicate the width of the filter and rounding down operation, respectively.
3.2 MSFAN Training
In order to find optimal network parameters, various hyper parameters are set as presented in Tab. 1. We used the loss function as expressed in Eq. (5) which represents regularization as well as Mean Square Error (MSE) as a data loss.
In Eq. (5), , N, , , and denote the set of network parameters (filter weights and biases), batch size, original image, restored image, and weight decay factor, respectively. Note that the proposed MSFAN used a weight decay scheme for network training to ensure generalization performance on various test datasets. In the training stage, the set of network parameters is updated using Adam optimizer  with a batch size of 128. In addition, filter weights are initialized by orthogonal normalization .
4 Experimental Results
All experiments were performed on an Intel Xeon Gold 5120 (14 cores @ 2.20 GHz) with 177 GB RAM and two NVIDIA Tesla V100 GPUs under the experimental environment described in Tab. 2. For performance comparison, the proposed MSFAN was compared with ARCNN , DnCNN , and DCSC  in terms of image restoration and network complexity.
4.1 Performance Comparisons for Medical Images
In order to evaluate the enhancement of image restoration, we used Computer Vision Center-Clinic Database (CVC-ClinicDB)  consisting of 612 colonoscopy medical images. We randomly divided CVC-ClinicDB into a training dataset (315 images), a validation dataset (103 images), and a test dataset (194 images). Note that all images were converted from the YUV color format into only Y component and compressed by JPEG codec under four different quality factors (10, 20, 30, and 40) to produce various coding artifacts. As a pre-processing of the training dataset, we cropped edges of each training image to remove unnecessary boundaries and extracted training images with a size of without overlap. As a result, we collected 96,768 patches from the training dataset.
To evaluate the enhancement of image restoration, we measured Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM)  between original and restored images. As measured in Tabs. 3 and 4, the proposed MSFAN can achieve the improved PSNR gains as high as 0.25 and 0.24 dB on average compared to DnCNN and DCSC, respectively. In addition, the proposed MSFAN showed better SSIM result on average than the other methods. Fig. 6 shows examples of visual comparisons between the proposed MSFAN and previous methods using test datasets. For each image in Fig. 6, images of the second row represent the zoom-in for the area indicated by the red box. These results verified that the proposed network could recover structural information effectively and find more accurate textures than other methods.
4.2 Performance Comparisons for Natural Images
We further evaluated the proposed MSFAN for natural images to demonstrate versatile applications of our network. For training the MSFAN with the natural image dataset, we used 400 images from BSD500 . Similar to medical images, all training images were converted into YUV color format and only Y components were extracted with a size of using data agumentation including rotation and flip. For the test image dataset, we used Classic5 which commonly used as testing dataset in various image restoration studies [19–20]. Tabs. 5 and 6 show average PSNR and SSIM results on Classic 5, respectively. While the proposed MSFAN had marginally lower PSNR values than DnCNN on average, these SSIM results were superior to those of comparison networks except that JPEG quality factor was 10.
4.3 Ablation Studies
In order to optimize the proposed network architecture, we conducted a variety of verification tests using the validation dataset. First, we performed tool-off tests to verify the effectiveness of essential parts of the proposed network, as shown in Tab. 7. According to the results of tool-off tests confirmed that both FE and CA blocks have an effect on the performance of image restoration. Additionally, we investigated two verification tests to determine optimal number of channels and 1 × 1 convolutional layers in the CA block. Tabs. 8 and 9 show that the proposed MSFAN has an optimal network architecture.
4.4 Computational Complexity
In order to investigate network complexity, we analyzed the number of parameters, total memory size, and inference speed using the test dataset. Note that the total memory size denotes the amount of memory required to store both network parameters and feature maps. As shown in Tab. 10, the proposed MSFAN has smaller total memory size than both DnCNN and DCSC, while it has more network parameters than other methods. In addition, Fig. 7 shows that the inference speed of our network is almost similar to that of DCSC using the CVC-ClinicDB test dataset.
Medical image compression is one of the essential technologies to facilitate real-time medical data transmission in the remote healthcare applications. In general, image compression is known to introduce undesired coding artifacts, such as blocking artifacts and ringing effects. In this paper, we proposed a Multi-Scale Feature Attention Network (MSFAN) with two essential parts, which are multi-scale feature extraction layers and feature attention layers to efficiently remove the coding artifacts of compressed medical images. Multi-scale feature extraction layers have four Feature Extraction (FE) blocks, and each FE block consists of five convolution layers and one CA block for weighted skip connection. In order to optimize the proposed network architecture, we conducted a variety of verification tests using the validation dataset. We used Computer Vision Center-Clinic Database (CVC-ClinicDB) consisting of 612 colonoscopy medical images to evaluate the enhancement of image restoration. The proposed MSFAN can improve PSNR gains as high as 0.25 and 0.24 dB on average compared to DnCNN and DCSC, respectively.
Acknowledgement: This work was supported by Kyungnam University Foundation Grant, 2020.
Funding Statement: This work was supported by Kyungnam University Foundation Grant, 2020.
Conflicts of Interest: The author declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|