|Computers, Materials & Continua |
AF-Net: A Medical Image Segmentation Network Based on Attention Mechanism and Feature Fusion
1College of Computer Science and Information Technology, Central South University of Forestry & Technology, Changsha, 410004, China
2Department of Mathematics and Computer Science, Northeastern State University, Tahlequah, 74464, OK, USA
*Corresponding Author: Jiaohua Qin. Email: firstname.lastname@example.org
Received: 31 January 2021; Accepted: 16 April 2021
Abstract: Medical image segmentation is an important application field of computer vision in medical image processing. Due to the close location and high similarity of different organs in medical images, the current segmentation algorithms have problems with mis-segmentation and poor edge segmentation. To address these challenges, we propose a medical image segmentation network (AF-Net) based on attention mechanism and feature fusion, which can effectively capture global information while focusing the network on the object area. In this approach, we add dual attention blocks (DA-block) to the backbone network, which comprises parallel channels and spatial attention branches, to adaptively calibrate and weigh features. Secondly, the multi-scale feature fusion block (MFF-block) is proposed to obtain feature maps of different receptive domains and get multi-scale information with less computational consumption. Finally, to restore the locations and shapes of organs, we adopt the global feature fusion blocks (GFF-block) to fuse high-level and low-level information, which can obtain accurate pixel positioning. We evaluate our method on multiple datasets(the aorta and lungs dataset), and the experimental results achieve 94.0% in mIoU and 96.3% in DICE, showing that our approach performs better than U-Net and other state-of-art methods.
Keywords: Deep learning; medical image segmentation; feature fusion; attention mechanism
Nowadays, deep learning has been applied to information hiding [1–3], image classification [4,5], image retrieval [6,7], image restoration and reconstruction [8,9], object recognition and detection [10,11], and many other fields [12,13]. Among them, deep learning is widely used in image segmentation, and medical image segmentation has become one of the hot topics in artificial intelligence medicine directions.
Fully Convolutional Neural Network (FCN) , an end-to-end image segmentation method, is a representative work of deep learning applied in image segmentation. Therefore, many new architectures based on FCN  have appeared, which can output dense pixel-wise prediction and achieve fine-grained classification. The existing algorithms generally use feature fusion or attention mechanisms to improve the performance of FCN. On the one hand, feature fusion can integrate information from different layers. For example, the splicing method is used in U-Net  and its variant networks [16–18] to fuse high-level and low-level features. But this method can not make the most use of context information and leads to feature suppression. Subsequently, a series of more complex and effective feature fusion methods appeared. These methods [19–21] fuse the processed low-level features with high-level features to improve feature utilization. However, the medical image segmentation algorithms using feature fusion had a high feature redundancy rate and large calculation consumption. Multi-scale feature fusion [22,23] is also a popular feature fusion method applied to many tasks such as panoramic segmentation. Nevertheless, the convolution kernel used in the pyramid structure is large and takes up many calculation resources.
On the other hand, the attention mechanism can filter and weight features. Many methods based on the attention mechanism selectively aggregates heterogeneous context information by learning channel attention, spatial attention [24,25], or point attention [26,27]. Unfortunately, medical image segmentation based on attention mechanism is faced with the problem of mis-segmentation due to the lack of explicit regularization, and usually causes high computational cost owing to large convolutional kernels.
To solve the problems above, we design an AF-Net model based on attention mechanism and feature fusion. It can effectively obtain multi-scale and global information for accurate medical image segmentation. In this approach, we use dual attention blocks (DA-block) as the encoder’s primary block to select features and obtain more effective feature representation. Then, multi-scale feature fusion is adopted in the decoder to promote the understanding of global context information. Moreover, we use small-size convolution kernels to reduce computational resource consumption. Finally, we combine the low-level features with the weighting of high-level features. The main contributions of our proposed method include the following three parts:
1. Propose the parallel channel and spatial attention blocks. The DA-blocks proposed in our method can select features to ensure the information effectiveness in the backward propagation fully and focus the network on the target area to obtain more precise feature representation, thus achieving a better segmentation effect.
2. Design a multi-scale feature fusion module with less calculation consumption. The MFF-block can promote the understanding of global context information, and the small-size convolution kernels used in this module reduce computational resource consumption, which effectively decreases the occurrence of mis-segmentation problems with less computational consumption.
3. Adopt global feature fusion modules. We combine low-level features with the high-level features by the GFF-blocks to use different levels of information fully, so our method can obtain accurate pixel positioning and achieve better edge segmentation effects.
The structure of the remaining part is given as follows. Section 2 reviews some related research. Section 3 introduces the proposed method. Section 4 presents the extensive experimental evaluations. Finally, Section 5 concludes this paper.
2 Related Work
2.1 Feature Fusion
Many approaches, based on feature extraction and fusion [28,29], have been proposed and applied to medical image segmentation. Fu et al.  used the conditional random field to the U-Net to improve the segmentation accuracy by obtaining multi-scale feature maps. Since then, M-Net  performed target segmentation by adding a multi-scale input and deep supervision mechanism to the U-Net. The feature pyramid network  generated multi-scale features using four different sizes of convolution kernels, which can get feature maps of different scales. However, the convolution kernel size selected in these approaches is large and takes up many computing resources. Li et al.  proposed the fusion of pyramid features and spliced feature projections of different scales into different layers. Still, it is easy to suppress elements by using an addition operation. Gu et al.  proposed dense dilation connection and residual multi-core pooling modules for extracting and merging multi-scale features to obtain good segmentation results. It is worth noting that module reuse makes model parameters increase and does not perform well in small organ segmentation tasks.
2.2 Attention Mechanism
The attention mechanism is a research hotspot in image segmentation recently. Attention u-net  suppressed unrelated background areas using attention gates, which combined the output of the encoder and decoder. SE-Net  established the interdependence among feature channels to achieve adaptive channel calibration. Danet  adopted various matrix operations followed by the element-wise addition to achieving a good segmentation effect. Later, Sinha et al.  expanded it by adding a semantic reconstruction unit and using a joint loss function to improve the segmentation accuracy. However, the two methods above are faced with the problems of large computing resource consumption. Resnet_cbam  added a serial attention branch to each codec module of Res-net , which can improve the segmentation accuracy to a certain extent. Since medical images are grayscale, multiple screenings may cause adequate information loss and make the accuracy of medical image segmentation low.
The methods above have improved the accuracy of medical image segmentation to a certain extent. However, the information among pixels in the image is reduced or missed due to the extensive use of upsampling. Simultaneously, these methods have difficulties in similar organ segmentation owing to the position changes and organ similarity.
3 Our Method
This section discusses the proposed AF-Net framework, which is an encoder-decoder network based on attention mechanism and feature fusion. Our framework consists of two parts: the encoder based on DA-blocks and the decoder based on MFF-block and GFF-blocks, as shown in Fig. 1. We use the DA-blocks to filter and weight the preprocessed image to generate multi-scale features for feature fusion in the following steps. Then we add MFF-block for further extraction to get deeper global information. Finally, we integrate the encoder’s multi-scale features with the decoder’s intermediate output to generate accurate pixel location.
3.1 Feature Encoder Based on DA-Block
In this paper, we firstly use the conv-block to change the number of channels and obtain a feature map with a size of 1/2 of the original image, and the details can be seen in Fig. 2. Then we select the first four blocks of Res-net  and add the attention module as DA-block. As shown in Fig. 3, we add the attention module before the short jump connection to obtain more helpful information without increasing excessive computing consumption.
The attention mechanism has been widely used in recent years. Sinha et al.  adopted the reuse of the attention module and performed semantic reconstruction, which consumes immoderate computing resources and is hard to be trained. Reset_cbam  used a serial attention module to repeatedly filter features, which lead to the loss of detailed information in medical images undeniably. Separate from these methods, we add an attention module to each unit of the encoder to weigh and filter the features while reducing the computational cost, as shown in Fig. 3.
The DA-block designed in this paper adopts a parallel processing method to avoid the loss of detailed information, which is more suitable for medical image segmentation tasks. F is the input feature map, FC refers to the output of the channel attention branch, and FS refers to the output of the spatial attention branch. The process can be summarized as:
where represents element-wise addition, FATT denotes the overall output feature map, and . Fig. 4 depicts the calculation process of each attention map.
As shown in Fig. 4, in the channel attention branch, we first use the adaptive average and maximum pooling operations to obtain FGap and FGmp. Next, we add them element by element and then perform the sigmoid function to obtain the channel feature weight of size . Finally, we multiply the channel weight and F to get a weighted channel feature map. The calculation process is as follows:
where denotes the sigmoid function, and Gmp respectively denotes the global average pooling and global maximum pooling, f denotes the convolution and regularization operations after each pooling operation, and denotes the multiply operation.
Similar to the channel attention branch, we first adopt the average and maximum value of each position in the entire channel as the spatial feature value FMean and FMax, followed by a concatenation operation. Then, we use the function to activate the weight of spatial features. Finally, we multiply the weight of spatial features and F to obtain the spatially weighted feature map. The calculation can be formulated as:
where Mean and Max refer to the operation of finding the average and maximum value in channel dimension, respectively, and denotes the concatenation operation.
3.2 Multi-Scale Feature Fusion
Inspired by pyramid feature fusion [22,23], we propose the MFF-block that uses a smaller convolution kernel to obtain feature maps of various receiving fields. The dilation convolution aims to make up for the loss of the down-sampling process, which uses padding operation to obtain multi-scale and high-resolution feature maps without changing size.
Therefore, we use the advantage of dilated convolution to design the MFF-block for feature extraction and fusion, as shown in Fig. 5. Let Fin be the input feature of MFF-block, where . Three dilated convolutions are adopted to replace convolution to obtain . Furthermore, two dilated convolutions are applied to substitute convolution to get , and a dilated convolution is used to obtain . After each branch, convolution is employed, followed by the concatenation process. The output feature map is obtained after channel reduction, which is the same size as Fin. The formula is as follows:
where conv1 refers to convolution.
3.3 Global Feature Fusion
The networks based on U-Net structure [15–18] used long jump connections to splice low-level features with high-level features directly, which inevitably destroyed information after activation. For this reason, we combine low-level information and high-level information by the GFF-blocks, which fully integrates various details and locates more accurately. The detail can be seen in Fig. 6.
Give two features, FH and FL, where FH refers to the high-level feature map, and FL refers to the low-level feature map. We use the global average pooling to obtain the most significant pixel information in FH. Later, we perform the batch normalization and sigmoid function on FH to get feature indication , which is regarded as the guide of low-level features. Simultaneously, we use convolution to reduce the channel number of FL and obtain , which is multiplied with to get block output . The illustration of the GFF-block can be seen in Fig. 6, and the calculation formula is as follows:
where conv3 refers to convolution.
3.4 Combined Loss Function
The problem of medical image segmentation in this paper can be regarded as a pixel classification problem, which determines whether the pixel belongs to the foreground or the background. Binary cross-entropy (BCE) loss is considered as the basis for solving the binary classification problem. Therefore, we use the BCE loss function to train the network. The formula can be rewritten as follows:
where N is the number of pixels, xi refers to the pixel of input image, refers to the true category of . is the predicted probability when xi belongs to category 1.
However, a model with an excellent segmentation effect requires multiple training rounds, but it is easy to cause overfitting on a smaller medical image dataset. To prevent over-fitting, we use the L2 regularization method  to reduce over-fitting and improve the recognition ability. The loss function with the regularization term is:
where is the weight parameter, and we use to train the network to prevent overfitting in this paper.
4 Experimental Results and Analysis
This section evaluates our AF-Net on various medical image segmentation tasks, such as aortic segmentation, lung segmentation, and liver segmentation. The test results are found in Tabs. 1–3. We then perform ablation studies with the test set to examine the performance of various aspects of our AF-Net model.
4.1 Experimental Setup
To increase the contrast of the image and retain the detailed information, we utilize an adaptive threshold algorithm  to preprocess original images. Meanwhile, we also adopt the test enhancement strategies to improve the robustness, including horizontal, vertical, and diagonal flip. All methods use the same design.
All experiments are completed in CPU Intel Core i7-8750H @ 2.20 GHz, GPU RTX 1060 Ti, and 6 Gb memory on the Windows operating system. The PyTorch framework  is adopted.
We use mini-batch stochastic gradient descent (SGD)  with a batch size of 8, the momentum equals to 0.9, and the weight attenuation is . Besides, we also use the multi-learning rate strategy . The initial learning rate is set as and multiplied by , where . The maximum number of iterations is 300 in training.
4.2 Experiment on Multiple Datasets
4.2.1 Aortic Segmentation
We evaluate our approach on the aorta dataset, consisting of 297 clinical chest computer tomography(CT) images provided by the second Xiangya hospital of central south university. Under the guidance of experienced cardiologists, we use a professional labeling tool, named Labelme , to mark the aorta in CT images at the pixel level. Then we randomly select 192 labeled CT pictures, crop them into the size of with 3 channels for training, and use the remaining 105 labeled CT pictures as test data. Due to the privacy and confidentiality agreement of the case, the dataset used in this paper is not publicly accessible.
We select several state-of-the-art methods for comparison, including U-Net , Attention u-net , and Ce-net . As shown in Tab. 1, our approach outperforms all the other methods and achieves the best performance in the aorta dataset. It is worth highlighting that our AF-Net model achieves 2.3% and 2.1% improvement in mIoU and DICE compared with Ce-net.
We further present the visualization results of the above methods on the aorta dataset, as shown in Fig. 7. It can be found that simple splicing of low-level and high-level features cannot fully restore the information of original pixel positioning. The previous method misses challenging targets, while our AF-Net model can segment the target organ completely and reduce mis-segmentation.
4.2.2 Liver Segmentation
The liver dataset consists of 420 2D images and corresponding category labels, which are divided into 400 training images and 20 testing images. The liver dataset comes from the 2017 CT image segmentation challenge (LiTS) contest of liver tumor lesions recognition, which can be dow- nloaded from the official website of this challenge (https://chaos.grand-challenge.org/Download/).
To verify the effectiveness of the proposed method, we compare our results with various state-of-the-art models. Tab. 2 reports the results of different modus on the liver dataset. As a result, our approach achieves the best performance among all methods in AUC, mIoU, and DICE. Compared with Ce-net, the overall effect is still better than Ce-net, although our network is slightly inferior in PRE.
Fig. 7 shows the visualization of the resulted images. As shown in the figure, tiny tissues are hard to be segmented. Therefore, the existing methods fail to reconstruct enough details and generate abnormal pixels. As a result, our AF-Net model can well recognize several tissues in the liver and beats all the previous models in the edge segmentation effect.
4.2.3 Lung Segmentation
The lung segmentation dataset comes from the lung structure segmentation task of the lung nodule analysis (LUNA) competition, which provides 190 2D training samples and 77 test samples, with an average resolution of 512 512. The lung dataset can be downloaded for free from the official website (https://www.kaggle.com/kmader/finding-lungs-in-ct-data/data/).
For the large organ segmentation, we evaluate our model on the lung dataset, in which the lung tissue accounts for a larger proportion of the total image area. The quantitative results can be viewed in Tab. 3. We compare the performance of our model and the state-of-the-art methods for lung segmentation. Obviously, our method outperforms other conventional methods, demonstrating that our model can capture more useful information and features.
4.3 Ablation Study
4.3.1 Ablation study for DA-block
To prove the effectiveness of the DA-block proposed in this paper, we conduct experiments to analyze the improvement of each component. Tab. 4 shows the metrics of our method by adding different portions. The DA-block proposed by us is based on Res-net . We add a shunt connection attention module to select and filter features effectively, reducing the interference caused by useless information on subsequent steps. The metrics in Tab. 4 show that the parallel connection method can increase SEN by 2% and DICE by 1% compared with the serial attention module.
4.3.2 Ablation study for MFF-block
As discussed above, to improve the representation ability of our model and decrease computing consumption, we uniformly use the dilated convolution with a small convolution kernel and choose different dilation rates for connection. Subsequently, we utilize convolution to reduce the channels to obtain the same output as the number of channels of the input feature map at the end of each dilated convolution branch. As shown in Tab. 4, DICE and mIoU increase by 1.3% and 1.9% after adding MFF-block, respectively. The MFF-block proposed by us can better retain the global and high-level information, which is beneficial to more accurate segmentation.
4.3.3 Ablation study for GFF-block
After the DA-blocks focus the network on the region that includes the aorta, the GFF-blocks further restore the pixel position more accurately by using the global information of high-level features as a guide. Compared with the method only using MFF-block, mIoU of the method using MFF-block and GFF-block increases to 80.1% with an increase of 1.4%, which can be seen in the sixth row of Tab. 4. The fourth column of Fig. 8 shows the visualization of the resulted images. As shown in the figure, the images obtained by our method are more specific in detail, where aortic dissection has the most prominent cleavage effect.
4.3.4 Comparison of calculation consumption
To prove that our model achieves a better segmentation effect without increasing too much computational consumption, we compare the computational consumption and FLOPs of various components. The results are shown in Tab. 5. Our AF-Net module is less computationally expensive than all other methods and achieves better segmentation effects under the same input size.
In this work, we present an AF-Net model to segment medical images based on deep learning. More precisely, the attention module is designed with parallel branches to filter out more useful characteristics for propagating backward. Feature fusion enables our module to obtain deeper, richer, and more comprehensive global information. Experimental results demonstrate that our AF-Net model outperforms existing state-of-art medical image segmentation methods on aortic, lung, and liver datasets.
Acknowledgement: The author would like to thank the support of Central South University of Forestry & Technology and the support of the National Natural Science Fund of China.
Funding Statement: This work was supported in part by the National Natural Science Foundation of China under Grant 61772561, author J. Q, http://www.nsfc.gov.cn/; in part by the Science Research Projects of Hunan Provincial Education Department under Grant 18A174, author X. X, http://kxjsc.gov.hnedu.cn/; in part by the Science Research Projects of Hunan Provincial Education Department under Grant 19B584, author Y. T, http://kxjsc.gov.hnedu.cn/; in part by the Natural Science Foundation of Hunan Province (No.2020JJ4140), author Y. T, http://kjt.hunan.gov.cn/; and in part by the Natural Science Foundation of Hunan Province (No. 2020JJ4141), author X. X, http://kjt.hunan.gov.cn/; in part by the Key Research and Development Plan of Hunan Province under Grant 2019SK2022, author Y. T, http://kjt.hunan.gov.cn/; in part by the Key Research and Development Plan of Hunan Province under Grant CX20200730, author G. H, http://kjt.hunan.gov.cn/; in part by the Graduate Science and Technology Innovation Fund Project of Central South University of Forestry and Technology under Grant CX20202038, author G.H, http://jwc.csuft.edu.cn/.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|