Skin cancer is one of the most common types of cancer in the world, melanoma is considered to be the deadliest type among other skin cancers. Quite recently, automated skin lesion classification in dermoscopy images has become a hot and challenging research topic due to its essential way to improve diagnostic performance, thus reducing melanoma deaths. Convolution Neural Networks (CNNs) are at the heart of this promising performance among a variety of supervised classification techniques. However, these successes rely heavily on large amounts of class-balanced clearly labeled samples, which are expensive to obtain for skin lesion classification in the real world. To address this issue, we propose a mixed re-sampled (MRS) class-imbalanced semi-supervised learning method for skin lesion classification, which consists of two phases, re-sampling, and multiple mixing methods. To counter class imbalance problems, a re-sampling method for semi-supervised learning is proposed, and focal loss is introduced to the semi-supervised learning to improve the classification performance. To make full use of unlabeled data to improve classification performance, Fmix and Mixup are used to mix labeled data with the pseudo-labeled unlabeled data. Experiments are conducted to demonstrate the effectiveness of the proposed method on class-imbalanced datasets, the results show the effectiveness of our method as compared with other state-of-the-art semi-supervised methods.
Skin cancer is one of the major types of cancers with an increasing incidence over the past decades, with over 5 million newly diagnosed cases every year [
Dermoscopy [
Recently, convolutional neural networks (CNNs) which are trained end-to-end have been widely used and achieved remarkable success in a variety of visual recognition tasks [
To alleviate this annotation burden, some semi-supervised learning algorithms have been proposed to improve the performance of models by utilizing the information contained in unlabeled data [
In this paper, we propose a mixed re-sample (MRS) class-imbalanced semi-supervised learning method for skin lesion classification. Mixed sample data augmentation is originally proposed to optimize the performance of classification tasks, and obtained state-of-the-art results in multiple supervised learning classification tasks. ICT [
Hence, in our work, a new training procedure has been introduced to improve the semi-supervised learning’s performance on a class-imbalanced dataset. First, for each batch of training phases, the labeled data is re-sampled to ensure that the model learns uniformly distributed data to learn general knowledge across the data distribution. Then, the labeled data is mixed with the pseudo-labeled unlabeled data by Mixup [
The main contributions of this paper are thus summarized as follows: We defined a class-imbalanced semi-supervised learning skin lesion classification task, reflecting a more realistic situation, and proposed a method to solve the task. We introduce a re-sample to class-imbalanced semi-supervised learning method, which improves the classification performance of semi-supervised learning on class-imbalanced data. Based on mixed sample data augmentation, we use Mixup and Fmix methods to mix the labeled data with pseudo-unlabeled data, further improve the generalization performance of the semi-supervised learning model. The proposed class-imbalanced semi-supervised learning method adopts an end-to-end learning style and has achieved state-of-the-art results on the ISIC skin 2019 dataset.
The rest of the article is organized as follows: Section 2 details the proposed method. Based on the open dataset, experimental results as well as the discussion are given in Section 3. Finally, Section 4 gives the conclusion.
In this section, we introduce our proposed MRS method, which consists of a resampling strategy to balance the class-imbalanced data and a mix sample data augmentation strategy mixing labeled data with pseudo-unlabeled data to improve the model’s performance in skin lesion classification. An overview of Mix-RS is presented in
In general, we have a small class-imbalanced labeled data set
The main steps are as follows: In the first step, the labeled samples are sampled to ensure that the labeled samples sent to the model at the beginning of the training are class-balanced. Details of the sampling labeled data will be presented in 2.2. At the same time, the same number of unlabeled samples are taken randomly. Two round stochastic data augmentation has been applied to both labeled and unlabeled samples. Assuming that
The number of samples for different categories in the labeled data set is usually different. Generally, the category with a larger number of samples is defined as the major class, and the category with a smaller number of samples is defined as the minor class. This class-imbalanced phenomenon also exists in the field of skin lesion classification. In order to solve class-imbalanced problem in semi-supervised learning for skin lesion classification, we introduce a novel re-sample data training (RDT) strategy for model training. Different with other re-sampling-based method, where the majority classes are down-sampled or the minority classes are over-sampled to ensure uniform distribution. Our RDT can solve the deficiency of under sampling methods that usually ignore many examples of most types, and can also solve the problem that oversampling methods are easy to cause overfitting.
In RDT, the model is initially trained by class-balanced label data, which is achieved by strictly requiring the number of data for each category in each batch to enter the model. In other words, the same number of samples are taken from each category to form a batch and put into the model for training. Then, as the training process progresses, gradually changes the ratio of the input data class in each batch, and thus increasing the ratio of the major class, and decreasing the ratio of the minor class. In this case, there is no need to down-sample the major class. The minor classes may face the risk of overfitting.
To reduce the overfitting of minor classes, we have taken the following three strategies: First, RandAugment, which is based on AutoAugment, is used to augment the training data. AutoAugment learns an augmentation strategy based on transformation from the Python Image Library using reinforcement learning. This requires large labeled images to learn the augmentation pipeline. However, we do not have enough data to learn this augment strategy for skin lesion classification tasks. As a result, RandAugment, a variant of AutoAugment, which does not require the augmentation strategy to be learned ahead of time with labeled data, is adapted to solve the overfitting problem of minor class in our task. Before the end of each data AutoAugment, we have also used the Cutout strategy to improve the augment effect.
Second, to further prevent the over-fitting effect of minor class, we introduced the Focal loss [
where
In our loss term, we introduce focal loss into the standard semi-supervised loss function, the focal semi-supervised learning loss is computed as:
where
where
The last but equally important strategy is the mixed sample data augmentation, which is described in detail in 2.3.
To further improve the performance for class-imbalanced semi-supervised learning for skin lesion classification, a training strategy named mixed sample data augmentation (MSDA) for semi-supervised learning is integrated. Recently, a plethora of MSDA approaches have been proposed and obtained state-of-the-art results in supervised classification tasks. One of the most popular methods is Zhang et al. [
where
And half of all augmentations of all unlabeled samples with their pseudo-labels into
For the other labeled data
where
and the other unlabeled samples with their pseudo-labels into
We provide some example Mixup and Fmix images for skin lesion in
To prove MRS effectiveness in the field of automatic classification of skin lesion, we perform our experiments on the International Skin Imaging Collaboration 2019 skin lesion classification (ISIC-skin 2019) dataset, which is the largest skin dermoscopy image dataset publicly available. We first introduce the training details and the ISIC-skin 2019 dataset and then conduct semi-supervised learning experiments with part of the labeled training data. Finally, the proposed method is compared and discussed with several state-of-the-art semi-supervised learning methods.
Unless otherwise stated, in all our experiments, we use the “ResNeXt-101-32x8d” architecture in Xie et al. [
During the training phase, we set the batch size to 8 and the training epoch to
We evaluate the proposed method on the ISIC-skin 2019 dataset, consisting of 25331 images for training across 8 different categories including melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), vascular lesion (VASC), and squamous cell carcinoma (SCC), the distribution of samples for training is heavily imbalanced. Since the test set of the data set has no public labels, we take 100 out of each category, a total of 800 as validation set to verify the effectiveness of the method. Then we divide the remaining data into labeled data and unlabeled data.
NV | MEL | BCC | BKL | AK | SCC | VASV | DF | |
---|---|---|---|---|---|---|---|---|
labeled | 2000 | 800 | 400 | 400 | 200 | 200 | 100 | 100 |
unlabeled | 10775 | 3622 | 2823 | 2024 | 597 | 328 | 53 | 39 |
val | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
total | 12875 | 4522 | 3323 | 2524 | 897 | 628 | 253 | 239 |
In particular, in order to fit the model, the image in ISIC-skin 2019 is resized to 256
To quantitatively evaluate the proposed MRS method, we used the sensitivity, specificity, accuracy, area under the receiver operating characteristic curve (AUC), and normalized multi-class accuracy (NMCA) as the performance metrics, which are defined as:
where TP, FN, TN, FP,
Since MRS is a semi-supervised learning method, we consider the three methods including Mean Teacher, ICT, and MixMatch as baselines for comparison. We also use labeled data to perform supervised learning as a baseline. In order to make these four baseline methods produce good generalization performance on class-imbalanced distribution, we oversample the minor class labeled data, and reimplemented each of these methods in the same codebase and apply them to the same model to ensure a fair comparison. The experimental results are shown in
From
Methods | MEL | NV | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
MeanTeacher | 0.847 | 0.871 | 0.481 | 0.950 | 0.920 | 0.842 | 0.580 | 0.967 |
ICT | 0.861 | 0.875 | 0.425 | 0.966 | 0.925 | 0.870 | 0.718 | 0.942 |
MixMatch | 0.839 | 0.776 | 0.682 | 0.795 | 0.858 | 0.731 | 0.170 | 0.999 |
Supervised | 0.843 | 0.856 | 0.530 | 0.922 | 0.915 | 0.861 | 0.730 | 0.924 |
RMS | 0.889 | 0.878 | 0.642 | 0.926 | 0.929 | 0.878 | 0.793 | 0.918 |
Methods | BCC | AK | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
MeanTeacher | 0.890 | 0.875 | 0.600 | 0.916 | 0.845 | 0.876 | 0.460 | 0.899 |
ICT | 0.872 | 0.872 | 0.544 | 0.922 | 0.804 | 0.870 | 0.393 | 0.896 |
MixMatch | 0.892 | 0.874 | 0.623 | 0.912 | 0.791 | 0.839 | 0.452 | 0.859 |
Supervised | 0.870 | 0.870 | 0.618 | 0.908 | 0.828 | 0.857 | 0.521 | 0.876 |
RMS | 0.904 | 0.870 | 0.768 | 0.886 | 0.845 | 0.894 | 0.439 | 0.918 |
Methods | BKL | DF | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
MeanTeacher | 0.748 | 0.914 | 0.250 | 0.977 | 0.954 | 0.974 | 0.433 | 0.981 |
ICT | 0.813 | 0.882 | 0.474 | 0.920 | 0.931 | 0.959 | 0.522 | 0.974 |
MixMatch | 0.782 | 0.841 | 0.430 | 0.880 | 0.946 | 0.977 | 0.589 | 0.982 |
Supervised | 0.769 | 0.877 | 0.380 | 0.924 | 0.917 | 0.973 | 0.478 | 0.979 |
RMS | 0.818 | 0.901 | 0.463 | 0.942 | 0.937 | 0.984 | 0.489 | 0.990 |
Methods | VASV | SCC | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
MeanTeacher | 0.892 | 0.988 | 0.436 | 0.996 | 0.865 | 0.969 | 0.229 | 0.986 |
ICT | 0.893 | 0.977 | 0.604 | 0.983 | 0.826 | 0.956 | 0.414 | 0.968 |
MixMatch | 0.854 | 0.922 | 0.653 | 0.926 | 0.874 | 0.939 | 0.427 | 0.951 |
Supervised | 0.905 | 0.983 | 0.564 | 0.989 | 0.836 | 0.952 | 0.389 | 0.965 |
RMS | 0.885 | 0.981 | 0.604 | 0.986 | 0.860 | 0.954 | 0.420 | 0.965 |
Methods | Total | ||||
---|---|---|---|---|---|
AUC | ACC | Sensitivity | Specidicity | NMCA | |
Mean Teacher | 0.870 | 0.913 | 0.433 | 0.959 | 0.478 |
ICT | 0.862 | 0.901 | 0.511 | 0.935 | 0.499 |
MixMatch | 0.854 | 0.862 | 0.503 | 0.913 | 0.458 |
Supervised | 0.860 | 0.903 | 0.526 | 0.935 | 0.484 |
RMS | 0.883 | 0.917 | 0.577 | 0.941 | 0.528 |
It is worth noting that in most aspects, the performance of the supervised method is better than the Mixmatch and Mean Teacher methods. The reason for this phenomenon is that the Mixmatch and mean teacher method is a semi-supervised learning optimization method for uniformly distributed data. In the case where both labeled data and unlabeled data are unevenly distributed, it is difficult for the classifier to extract valid features from unlabeled data, so the performance of the classifier cannot be optimized by the distribution of unlabeled data. However, the performance of the ICT method is superior to the supervised method. This is because compared to Mixmatch, ICT uses unlabeled data only once in a batch, so it has less impact on the distribution of a batch of samples. At the same time, Mixup can completely mix unlabeled data in ICT. In general, our proposed RMS method mixes labeled and unlabeled data using Mixup and Fmix, which has less effect on the distribution of resampled labeled data. Therefore, the unlabeled data can be fully utilized to improve the performance of the classifier in the case of unbalanced categories.
In this part, we compared the performance of RMS to seven top-ranking performances without using external data in the ISIC-2019 skin lesion classification challenge leaderboard. These reported results on the ISIC-2019 challenge dataset can reflect state-of-the-art performance in the skin lesion classification task.
Since almost all the seven-top ranking methods on the ISIC-2019 skin lesion classification challenge leaderboard use the ensemble model to obtain better generalization performance, in this experiment, we selected a part of the data as supervised data in the labeled data and trained two independent ResNeXt models.
NV | MEL | BCC | BKL | AK | SCC | VASV | DF | |
---|---|---|---|---|---|---|---|---|
labeled | 400 | 300 | 300 | 200 | 200 | 150 | 100 | 100 |
unlabeled | 12375 | 4122 | 2923 | 2224 | 597 | 378 | 153 | 129 |
val | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
total | 12875 | 4522 | 3323 | 2524 | 897 | 628 | 253 | 239 |
From
Since our RMS method combines various optimizations and augmentation techniques, we perform an extensive ablation study to better understand why it is able to obtain performant results. Specifically, what we measured is that our method only removes resample, RandAugment, Fmix, Mixup, and focal loss.
Methods | MEL | NV | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
#2 | 0.808 | 0.896 | 0.675 | 0.941 | 0.878 | 0.902 | 0.813 | 0.944 |
#3 | 0.933 | 0.910 | 0.684 | 0.956 | 0.954 | 0.899 | 0.866 | 0.914 |
#4 | 0.922 | 0.894 | 0.665 | 0.940 | 0.950 | 0.889 | 0.750 | 0.956 |
#6 | 0.911 | 0.914 | 0.555 | 0.950 | 0.944 | 0.869 | 0.877 | 0.865 |
#7 | 0.911 | 0.877 | 0.746 | 0.903 | 0.952 | 0.886 | 0.876 | 0.890 |
RMS | 0.889 | 0.886 | 0.534 | 0.957 | 0.939 | 0.882 | 0.728 | 0.955 |
Methods | BCC | AK | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
#2 | 0.868 | 0.888 | 0.841 | 0.895 | 0.765 | 0.916 | 0.596 | 0.933 |
#3 | 0.947 | 0.884 | 0.853 | 0.888 | 0.896 | 0.939 | 0.321 | 0.972 |
#4 | 0.935 | 0.878 | 0.803 | 0.890 | 0.888 | 0.932 | 0.342 | 0.963 |
#6 | 0.937 | 0.872 | 0.816 | 0.881 | 0.895 | 0.931 | 0.527 | 0.953 |
#7 | 0.937 | 0.860 | 0.854 | 0.861 | 0.897 | 0.898 | 0.642 | 0.912 |
RMS | 0.914 | 0.896 | 0.630 | 0.935 | 0.857 | 0.906 | 0.449 | 0.930 |
Methods | BKL | DF | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
#2 | 0.762 | 0.927 | 0.562 | 0.962 | 0.832 | 0.982 | 0.678 | 0.985 |
#3 | 0.907 | 0.925 | 0.551 | 0.960 | 0.977 | 0.988 | 0.567 | 0.993 |
#4 | 0.872 | 0.920 | 0.465 | 0.964 | 0.976 | 0.986 | 0.589 | 0.991 |
#6 | 0.876 | 0.931 | 0.527 | 0.953 | 0.977 | 0.987 | 0.589 | 0.992 |
#7 | 0.891 | 0.902 | 0.616 | 0.929 | 0.961 | 0.730 | 0.656 | 0.977 |
RMS | 0.826 | 0.915 | 0.356 | 0.968 | 0.962 | 0.988 | 0.411 | 0.995 |
Methods | VASV | SCC | ||||||
AUC | ACC | Sensitivity | Specificity | AUC | ACC | Sensitivity | Specificity | |
#2 | 0.797 | 0.984 | 0.604 | 0.989 | 0.744 | 0.962 | 0.516 | 0.972 |
#3 | 0.938 | 0.989 | 0.989 | 0.995 | 0.922 | 0.975 | 0.446 | 0.986 |
#4 | 0.929 | 0.989 | 0.515 | 0.996 | 0.918 | 0.978 | 0.420 | 0.990 |
#6 | 0.913 | 0.986 | 0.584 | 0.992 | 0.898 | 0.970 | 0.408 | 0.982 |
#7 | 0.896 | 0.985 | 0.624 | 0.990 | 0.921 | 0.961 | 0.592 | 0.969 |
RMS | 0.904 | 0.988 | 0.525 | 0.994 | 0.902 | 0.965 | 0.344 | 0.979 |
Methods | UNK | |||||||
AUC | ACC | UNK Sensitivity | Specificity | |||||
#2 | 0.562 | 0.798 | 0.179 | 0.946 | ||||
#3 | 0.502 | 0.808 | 0.004 | 0.999 | ||||
#4 | 0.642 | 0.807 | 0.012 | 0.997 | ||||
#6 | 0.500 | 0.808 | 0 | 1 | ||||
#7 | 0.705 | 0.729 | 0.390 | 0.81 | ||||
RMS | 0.572 | 0.807 | 0 | 1 |
We find that each component contributes to RMS’s performance. Among them, the contribution of RandAugment is the largest, the contribution of rasample is second, and the contribution of focal loss is the smallest.
Methods | Total | ||||
---|---|---|---|---|---|
AUC | ACC | Sensitivity | Specidicity | NMCA | |
#2 | 0.780 | 0.917 | 0.607 | 0.952 | 0.607 |
#3 | 0.886 | 0.924 | 0.540 | 0.963 | 0.593 |
#4 | 0.892 | 0.919 | 0.507 | 0.965 | 0.578 |
#6 | 0.872 | 0.914 | 0.555 | 0.950 | 0.563 |
#7 | 0.897 | 0.897 | 0.666 | 0.916 | 0.558 |
RMS | 0.865 | 0.916 | 0.449 | 0.969 | 0.553 |
Methods | Total | ||||
---|---|---|---|---|---|
AUC | ACC | Sensitivity | Specidicity | NMCA | |
RMS without resample | 0.835 | 0.895 | 0.461 | 0.946 | 0.487 |
RMS without RandAugment | 0.830 | 0.896 | 0.440 | 0.947 | 0.474 |
RMS without Fmix | 0.833 | 0.898 | 0.460 | 0.949 | 0.491 |
RMS without Mixup | 0.839 | 0.894 | 0.503 | 0.944 | 0.523 |
RMS without focal loss | 0.842 | 0.899 | 0.499 | 0.947 | 0.524 |
RMS | 0.883 | 0.917 | 0.577 | 0.941 | 0.528 |
In this paper, we presented a mixed re-sampled class imbalanced semi-supervised learning method for skin lesion classification. The proposed approach has been evaluated on the ISIC-skin 2019 dataset with considerably small labeled images dataset. Despite using only 4800 labeled images, our method has only a small gap comparing the performance to seven top-ranking performances in the ISIC-2019 skin classification challenge leaderboard using all the 25331 labeled data. The results have shown that our method can significantly improve the performance compared to other semi-supervised methods on the same task. Achieving state-of-the-art performance, this research confirms previous findings and contributes to our understanding of semi-supervised learning methods for skin lesion classification. A natural progression of this work is to improve the recognition performance of unknown classes. Further research should concentrate on incorporating additional ideas from the semi-supervised and the class-imbalanced learning literature into our methods.