MRI Image Segmentation of Nasopharyngeal Carcinoma Using Multi-Scale Cascaded Fully Convolutional Network

Nasopharyngeal carcinoma (NPC) is one of the most common malignant tumors of the head and neck, and its incidence is the highest all around the world. Intensive radiotherapy using computer-aided diagnosis is the best technique for the treatment of NPC. The key step of radiotherapy is the delineation of the target areas and organs at risk, that is, tumor images segmentation. We proposed the segmentation method of NPC image based on multi-scale cascaded fully convolutional network. It used cascaded network and multi-scale feature for a coarse-to-fine segmentation to improve the segmentation effect. In coarse segmentation, image blocks and data augmentation were used to compensate for the shortage of training samples. In fine segmentation, Atrous Spatial Pyramid Pooling (ASPP) was used to increase the receptive field and image feature transfer, which was added in the Dense block of DenseNet. In the process of upsampling, the features of multiple views were fused to reduce false positive samples. Additionally, in order to improve the class imbalance problem, Focal Loss was used to weight the loss function of tumor voxel distance because it could reduce the weight of background category samples. The cascaded network can alleviate the problem of gradient disappearance and obtain a smoother boundary. The experimental results were quantitatively analyzed by DSC, ASSD and F1_score values, and the results showed that the proposed method was effective for nasopharyngeal carcinoma segmentation compared with other methods in this paper.


Introduction
Nasopharyngeal carcinoma (NPC) is a tumor occurring at the top and lateral wall of the nasopharynx. It is one of the most common malignant tumors of the head and neck, with the highest incidence all around the world [1]. Intense-modulated Radiation Therapy (IMRT) has been proved to be the most effective technique Due to the heterogeneity of the tumor shape between patients and the ambiguity of the tumor-normal tissue interface, it is still a challenge to automatically segment the radiotherapy target area for nasopharyngeal carcinoma through deep learning. Xue et al. [15] used Deeplabv3 + convolutional neural network model to perform end-to-end automatic segmentation of CT images for 150 patients with nasopharyngeal carcinoma. Their work was to segment the GTVP profile of the radiotherapy target of the primary tumor on CT images. However, the GTVP target area in radiotherapy for nasopharyngeal carcinoma was lack of soft tissue contrast on CT images, and the interface between tumor and normal tissue was very vague. Therefore, it was a challenging task to make target segmentation based on CT images. Lin et al. [16] used 3D CNN network structure to extract the information of four MRI sequences, and used the AI prediction results to assist 8 experts in delineating. They used DSC and ASSD to evaluate the delineation results of tumors in different periods and different cross-sections. This study had a large sample size and comprehensive test. It was the first study of AI in target delineation of full-stage nasopharyngeal carcinoma radiotherapy. Chen et al. [17] proposed a Multi-modal MRI Fusion Network (MMFNet) for nasopharyngeal carcinoma segmentation. In the experiment, they fused the features of T1, T2 and contrast-enhanced T1 of MRI, used 3D convolutional blocks (3D-CPAM) and residual fusion blocks (RFblocks) to form fusion blocks. By enhancing informative features and reweighted features, the segmentation network can effectively segment NPC images by fully mining information from multimodal MRI images. Diao et al. [18] used Inveption-V3 as a network and used transfer learning strategy to segment nasopharyngeal cancer. Three pathologists were invited to diagnose the panoramic pathological images in the test set of their study. Three pathologists had different levels of experience. For the diagnosis results of the model and the doctor, AUC was used to evaluate the diagnostic performance, and the Jaccard coefficient, Euclidean distance and Kappa coefficient were used to evaluate the diagnostic consistency.
It is proved that the extraction of features is closely related to the segmentation accuracy, and the extraction of semantic features by the fully convolutional network can be achieved at the pixel level, which makes the pixel positioning of medical images more accurate and the segmentation accuracy higher. Both 3D U-Net and DenseNet [19] are fully convolutional networks with strong feature extraction capability, which can be used in the segmentation of medical images.

Cascaded Fully Convolutional Network
The cascaded network was used for face detection at first. The principle is to use cascaded classifier to remove most of the background, and carry out sample mining and joint training for features in different cascaded stages to complete boundary regression and face classification [20]. In this paper, two networks were used to segment NPC images. In the first network, 3D U-Net is used for coarse segmentation to obtain tumor contours. In order to make full use of the context information between the 3D MRI image layers and improve the problem of insufficient sample training, image blocks were performed on the MRI data of the training set and the test set. And then, the results from the first network were served as input to the second network. In the second layer, DenseNet with dense connections was used to achieve fine segmentation. Atrous convolution [21] injected holes into the convolutional layer to increase the receptive field, so it was added to the Dense block of DenseNet to carry out multi-scale feature extraction and improve the classification accuracy. At last, the final segmentation results were obtained by probability fusion of coarse and fine segmentation. In addition, to solve the class-imbalanced problem, the loss function was Focal Loss weighted tumor voxel distance, which could reduce the weight of background category samples.
In this paper, the cascaded segmentation network was named as CSCN, the first layer network was called BUNet and the second layer was called ACDNet. Fig. 1 illustrates the framework of the proposed segmentation method. BUNet used 3D UNet as the backbone and enhanced the data in image blocks. ACDNet used DenseNet as the backbone and added Atrous convolution in the dense block.

Multi-Scale Feature for ACDNet
Multi-scale features have been widely used in image classification and segmentation. Its function is to fuse multi-scale information, increase the network receptive field and reduce the computation of network [22]. Especially in medical images with imbalanced categories, fusing multi-scale features is an important method to improve image segmentation performance. DenseNet uses cross-layer connections to make full use of features, and uses dense block between different layers to enhance feature propagation and feature reuse, which helps to alleviate the problem of gradient disappearance. Atrous Spatial Pyramid Pooling [23] (ASPP) was proposed in in Deeplab v3 [21]. Its principle is to obtain multi-scale information at different scales with different hole rates. Each scale is an independent branch. The network combines the features of different scales and adds a convolutional layer to output prediction labels.
In this work, the second layer added an ASPP module before the Bottle Neck layer of each block to assist in extracting multi-scale feature information. The bottle neck layer is a 1×1 convolutional layer, which was used to provide feature compression. When training, each block was used as a small network, and each block was set with a convolutional layer, a BN layer and a ReLU layer. The structure of each block was consistent, and the use of dense connections between blocks could be stacked, which made the network structure easier to adjust [24]. The structure of ACDNet is shown in Fig. 2.
The deeper the network is, the more likely it is to have the gradient disappearance problem. Dense connection can be described that each layer is directly connected to the Input and Loss. It can alleviate the gradient disappearance problem and enhance the reuse of features, so that it can improve the accuracy of the segmentation.

Loss Metrics
There is a serious class imbalance problem in medical images [25]. In the NPC image segmentation task, because the background samples are relatively large, the network tends to predict the background, but the target is not completely predicted. Therefore, it is necessary to modify Cross Entropy (CE) of the loss function to reduce the weight of the background samples. Weighted cross entropy adds weights to different categories, so that the network pays attention to the categories with fewer samples. A coefficient is used to describe the importance of samples in the loss function. For a small number of samples, its contribution to the loss function should be enhanced; but for a large number of samples, its contribution to the loss function should be reduced.
In general, the "hard" samples are distributed along the segmentation boundary, with a probability of 0.5 in the probability response graph. Focal Loss [26] is a weighted cross entropy. This loss function reduces the weight of many simple negative samples in training, solves the problem of class imbalance, and helps to mine "hard" samples.
In the binary classification task, the cross entropy of loss function is defined as: The predicted probability value of category t can be expressed as: Substituting Eq. (2) into Eq. (1), we can get: When many simple samples are added together, small loss values can dominate the rare categories. Therefore, the weight parameter a 2 [0,1] is used to adjust the CE value when the category is unbalanced. The equation is as follows: In Eq. (4), a weight parameter a t is added when loss is calculated for different categories. If the weight of categories with few samples is higher, the network prediction effect can be improved. However, Eq. (4) only solves the problem of weight distribution of different categories. For "hard" samples with a probability of about 0.5, the network still cannot segment these voxel points. Focal loss uses the modulation factor 1 À p t ð Þ c to solve the problem of difficult sample mining. The equation is as follows: where c is a parameter that can be adjusted, c >0. When a sample is predicted incorrectly, the probability p t is very small, and the modulation factor 1 À p t ð Þ c γ is close to 1, then the sample's contribution to loss will not be penalized. On the contrary, if the sample prediction is successful and the probability p t is close to 1, then the weight of the well-divided sample is lowered and the sample's contribution to loss is small.
The class imbalance and "hard" sample problems in medical images are both prominent, and combined with Eqs. (4) and (5), the final Focal loss formula is formed: In this paper, the loss function is Focal Loss weighted tumor voxel distance. The distance from each voxel to the tumor boundary was taken as the weight parameter a t , and each voxel in the loss function had an independent weight. In the experiment, a distance map was calculated based on the real image value, and each voxel in the distance map represented the distance from the voxel to the tumor boundary. The weighted value only affected the tumor voxels and did not affect the non-tumor voxels. The distance calculation can be regarded as: where d is the distance from the voxel to the tumor boundary. Inside the tumor, the minimum distance between the voxel and the tumor boundary is not less than 1, while the voxel outside the tumor area is always 1. The distance weights on both sides of the tumor boundary are very similar. For voxels at the tumor boundary, the penalty for predicting a positive sample was much greater than for predicting a negative sample. As a result, the network avoided this penalty by predicting nothing, which could reduce the total loss. Inside the tumor, if the tumor voxel was predicted as a non-tumor voxel, a greater penalty would be produced. The distance weight matrix was used to punish the error of the model on the boundary, which aimed to improve the extraction of tumor boundary with loss penalty.

Dataset and Pre-Processing
This experiment had been conducted using three-dimensional MRI images of 120 patients with NPC, which were from the same hospital. They were scanned by T1 High Resolution Isotropic Volume Examination (THRIVE [27]), which could obtain more obvious MRI tumor images than other MRI. The images have a voxel size of 0.6 × 0.6 × 3.0 mm 3 . They were needed to pre-process the original image. Firstly, we cropped the original image to retain only the head image of the nasopharyngeal tumor at the neck and above, because the acquisition range of the original image was large but the position of the nasopharyngeal tumor in the image was relatively fixed. Secondly, the images were resampled to the voxel size of 1 × 1 × 1 mm 3 . Thirdly, we used the up-down jitter method to crop the image in the Z direction, and used the greedy algorithm to remove the black area in the X and Y directions. The final image size was 160 × 198 × 103. Finally, in order to make full use of the limited data set, the center of each image was randomly selected for image block sampling, and the horizontal flip was used for data augmentation. Due to the random cropping, when the number of image blocks was increased enough, the neighborhood of each pixel including the image boundary might be selected multiple times. It was more conducive to the training and learning of the network. Fig. 3 shows three views of NPC images NPC images after being cropped.

Training Details
In the experiment, there were four automatic segmentation networks for comparison, including CNN, ACDNet, DeepLab and V-Net [28]. They were all the commonly used automatic segmentation networks. In order to obtain sufficient training samples, the images were sampled as an image block and input into the network for training. The image blocks were extracted as a training set by sliding window from the axial, coronal and sagittal directions of 3D MRI data respectively. Its size was 24 × 24 × 8.
Cross-validation was performed 5 times in the experiment. And in each cross-validation training, we randomly selected 24 patients' images as the test set, 9 patients' images as the validation set and 87 patients' images as the training set. Predictions for 120 patients were obtained after 5 trainings. Fig. 4 shows the training accuracy curve, and Fig. 5 shows the loss function with distance weight. When training the network, we used the Adam optimizer and set the initial learning rate as 0.001. In all experiments, the number of training iterations was 50000, and the initial learning rate decreased exponentially with a decay rate of 0.9 per 500 iterations. All network structures used Softmax as the activation function to output the probability of the final split graph. Fig. 6 shows the curve of learning rate.

Evaluation Metrics
There were three quantitative indicators to evaluate the segmentation performance of the network, including Dice Similarity Coefficient (DSC), Average Symmetric Surface Distance (ASSD) and F1-score.
DSC was used to measure the similarity between the segmentation results and the ground-truth [29]. For the given manually labeled tumor segmentation result X and predicted result Y of network segmentation, DSC is defined as: where, X \ Y j jare the intersection between the labeled segmentation result and predicted result, X j j and Y j j represent the number of elements of X and Y. The value range of DSC is [0,1]. If the value of DSC is larger, the similarity between the network segmentation result and the real result is higher.
The ASSD index represents the average surface distance between the predicted results of network segmentation and the results of manual labeling segmentation. Its formula is as follows: where G and P represent the surface voxels of ground-truth and network prediction segmentation results respectively, and d p; g ð Þ represents the Euclides distance between g and p.
F1-score was used to quantitatively evaluate the accuracy of network segmentation. It can be regarded as a weighted average of precision and recall. It is defined as follows: Figure 6: Curve of learning rate where, the Precision is: where, the Recall rate is: where, TP is the true positive sample set, indicating the number of samples whose actual and predicted values are positive, that is, the predicted answer is correct. FP is the false positive sample set, representing the number of samples that are actually negative but predicted to be positive. FN is also the false negative sample set, which represents the number of samples that are actually positive but predicted to be negative. FP and FN both mean that the prediction is wrong. Precision represents the model's ability to distinguish between negative samples. The higher the precision, the stronger the model's ability to distinguish between negative samples. Recall rate reflects the model's ability to identify positive samples. The higher the recall rate, the stronger the model's ability to identify positive samples. F1-score takes into account both the precision and recall of the classification model. The larger the F1-score value is, the better the segmentation effect is, indicating that the model is more robust.

Results
The experimental results were evaluated qualitatively and quantitatively. Fig. 7 shows the twodimensional segmentation results of different network structures on the same dataset in the experiment. Each row is the segmentation result of the same patient, showing 3 patients in total. Fig. 8 shows the 3D segmentation results of different network structures on the same dataset. It shows the coronal, sagittal, and axial segmentation results for the same patient. It can be seen from the figure that the segmentation effect of V-Net is not good. Its segmentation result is quite different from the Ground truth, which is not suitable for the segmentation of NPC images. The sagittal segmentation results of various methods are good, which are close to the Ground truth. In the coronal and axial segmentation, the method presented in this paper has the best performance.
Tab. 1 shows the quantitative evaluation values of various method, including DSC, ASSD and F1-score. It can be seen that CSCN has the best evaluation value. Compared with other networks, V-Net always failed to converge and the prediction result could not be obtained. In order to verify the performance of the algorithm, we used the annotation data of two physicians to compare. Each person annotated 28 patients' MRI images, one of which was used as the Ground truth and the other as the prediction result. The values of DSC, ASSD and F1-score of the segmentation results were 0.642, 2.692 mm and 0.686 respectively.

Discussion and Conclusion
On the whole, the proposed network is superior to the contrast network in segmentation results and network performance. But for DSC, which values of manual labeling and network prediction  segmentation results of doctors are not very good. In our analysis, it was believed that the complex anatomical structure of nasopharynx and slight surface differences caused by some tumors with special shapes would have a great impact on DSC indicators.
In this work, we proposed a coarse-to-fine cascaded fully convolutional network segmentation method. This algorithm used cascaded network and multi-scale feature skip connection to improve the segmentation effect. In the first network, random sampling was carried out from the axial, coronal and sagittal directions respectively to alleviate the problem of image category imbalance and the shortage of training samples. The tumor probability map of coarse segmentation was obtained, which was regarded as the input for the second network. In the second network, Atrous Spatial Pyramid Pooling was used to replace the convolution layer and pooling layer in the Dense block of DenseNet, so that multi-scale features were extracted to achieve voxel-level fine segmentation of MRI images. The cascaded network can alleviate the problem of gradient disappearance and obtain a smoother boundary.