Breast cancer is the most prevalent cancer among women, and diagnosing it early is vital for successful treatment. The examination of images captured during biopsies plays an important role in determining whether a patient has cancer or not. However, the stochastic patterns, varying intensities of colors, and the large sizes of these images make it challenging to identify and mark malignant regions in them. Against this backdrop, this study proposes an approach to the pixel categorization based on the genetic algorithm (GA) and principal component analysis (PCA). The spatial features of the images were extracted using various filters, and the most prevalent ones are selected using the GA and fed into the classifiers for pixel-level categorization. Three classifiers—random forest (RF), decision tree (DT), and extra tree (ET)—were used in the proposed model. The parameters of all models were separately tuned, and their performance was tested. The results show that the features extracted by using the GA+PCA in the proposed model are influential and reliable for pixel-level classification in service of the image annotation and tumor identification. Further, an image from benign, malignant, and normal classes was randomly selected and used to test the proposed model. The proposed model GA-PCA-DT has delivered accuracies between 0.99 to 1.0 on a reduced feature set. The predicted pixel sets were also compared with their respective ground-truth values to assess the overall performance of the method on two metrics—the universal image quality index (UIQI) and the structural similarity index (SSI). Both quality measures delivered excellent results.
Cancer is caused by cell abnormalities and is a leading cause of death worldwide. The American Cancer Society (ACS) has estimated that 1.9 million new cancer cases were identified and 608,570 people died of the disease in 2021 in the United States alone (1670 deaths/day) [
The remainder of this paper is organized into the following sections: Section 2 describes past work in the area of image segmentation, feature extraction, and machine learning. Section 3 details the proposed method, which includes the random forest (RF), decision tree (DT) and extra tree (ET) algorithms as well as feature extraction, genetic algorithm (GA)-based feature selection, and performance-related parameters. Section 4 describes the experimental setup used to test the proposed method, the results, and an assessment of its performance. Section 5 summarizes the conclusions of this study and highlights directions for future work in the area.
Researchers have explored solutions for a better, faster, and more accurate understanding of medical images by using computer-based methods. Medical images contain complex patterns that make image processing tasks challenging. In the last decade, many attempts have been made to detect objects in images. Object-oriented algorithms, firefly models, and hybrid models have been developed for the segmentation and classification of nuclei. A recent study [
Although the above results are promising there is still room for improvement in feature extraction and reduction in the relevant methods. In this study, we extract feature sets based on different combinations and configurations of spatial filters. In this research work, feature sets have been significantly reduced to only twelve components and applied to construct an effective supervised machine learning model.
We propose an automatic approach for generating new sets of pixel values using different sets of filters to identify pixels in images indicative of breast cancer. Section 3.5 describes some of the filters used in this study.
The proposed model involves feature extraction using different filters, feature selection using the GA, feature reduction using PCA, hyperparameter optimization for the ML models, the pixel-level classification on two datasets (a) the BreCaHad dataset and (b) dataset of ultrasound images of the breast. In addition, random images are used without target labels, and the results of the proposed method are compared with the corresponding ground-truth images to visualize its performance. The assessment of our method is based on the image quality indices UIQI and structural similarity index (SSI).
--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------
(i) Load dataset D= (xi,yi)
(ii) Pre-process all input images
(iii) Extract important features from the image by using different filters and set a target class by using its ground-truth image Split the data into training and testing subsets Apply genetic programming for feature selection Apply the RF, DT, and extra tree classifier models Fine-tune the hyperparameters of the classifiers based on the assessment of the training and testing subsets
(iv) Go to the next step if the highest accuracy has been achieved; otherwise, repeat steps (a) to (d) by changing the combination of the filter sets
(v) Select the model with the highest accuracy (ACCmax)
(vi) Randomly select an image and feed it to the proposed GA-PCA-DT model
(vii) For each sample image:
(a) Apply the procedure given in step iii(a) (b) Tune and align the model the classifier by training and testing subsets (c) Apply genetic programming for feature selection (d) Apply PCA repeatedly to identify the optimum feature set (i.e., PCA components) (e) Apply the reduced set of features to the final tuned model (f) Calculate the accuracy (ACCmax) of each predicted image with respect to the ground-truth values
(viii) Compare the predicted image with the ground-truth image on the quality measures.
Entire images of slides of tissues, of size 1360 × 1024, were converted into grayscale images of size 300 × 300. The image was resized for two reasons (a) to reduce the computational and memory-related costs, and (b) to render the image manageable.
Digital filters are useful for image pre-processing tasks. A variety of filters have been used in the literature to extract the properties of images [
The Prewitt operator works on grayscale images and applies both masks one by one to determine the horizontal and vertical edges. Both edges are then combined to display the complete edges of the given image:
Here Im is an image and the total magnitude in both directions can be calculated as
A higher value of G represents better edge detection. The direction (
Gaussian blurring is a method of image denoising that is commonly used for image pre-processing. The Gaussian operator for a 2D distribution is as follows:
The value of sigma is important as it determines the width of the kernel image to be blurred. Higher values blur images over a larger area and lower values are used to limit this blurring area.
The Gabor filter acts as a local bandpass filter to extract spatial and edge-related features. It is constructed by using a Gaussian function in combination with a sinusoidal input. The filter can be expressed using parameters revised as eq (5) made more generalized, where
The median filter is a non-linear digital filter that is commonly used to de-noise images. The median operation uses the median value of the pixels surrounding the pixel of interest. All pixel values are arranged and their median value is then selected as the value of the central pixel.
This study uses the genetic algorithm (GA) to control the problem of dimensionality. The GA algorithm can search for the optimum value based on the survival-of-the-fittest theory. Genetic operators include selection, crossover, and mutation. These operators are applied to chromosomes (solutions) to optimize their fitness values. Chromosomes collections of genes as shown in
In the feature selection, “0” indicates the absence of a particular feature in the chromosome and “1” represents its presence. The initial population size, number of generations (iterations), crossover points, crossover probability, mutation probabilities, and techniques of chromosome selection are major considerations in this vein. The process of genetic programming is shown in
The principal component analysis is a popular approach for dimension reduction in which an original set of attributes is orthogonally transformed into a new set of attributes. The relationship between the attribute sets (x, y) can be analyzed by a covariance matrix. Eigenvalues and eigenvectors are the fundamental measures used to determine the relevant components. The highest eigenvector with the maximum eigenvalue represents the first component, and the other components are calculated in decreasing order of value:
Machine learning algorithms are used to identify the important characteristics in a given dataset for predictive analysis or classification. This kind of identification is based on such attributes as the number of dimensions and the location of each data point. Medical images captured by advanced medical equipment provide a greater number and variety of features than images obtained by using traditional systems.
The major steps are as follows: A random image is selected for classification. Features are extracted using a set of filters, and target values are assigned. Then the extracted dataset is split into training and testing subsets. Both subsets are submitted to the selected model. Training subset is useful to train the model for predicting each pixel value, whereas test subset result directs to tune and evaluate the model for better accuracy.
The final step consists of optimization using the genetic algorithm (GA) and principal component analysis (PCA). First, the GA selects the most discriminant and important feature set, which is processed by PCA to be limited to 12 components. Next, the tuned and aligned models are provided with the final components without their target values. The predicted values are then compared with the respective ground-truth values.
The RF classifier uses multiple decision trees to generate different sets of samples and calculates their average accuracy. Bootstrapping is carried out and random features are selected to this end, and the final decisions are taken according to the maximum voting principle. Such averaging is useful for handling the problem of model overfitting.
The decision tree is used to evaluate data points on pre-specified, multiple “if-else” conditions. The end nodes contain a class of all possible classes.
This classifier is similar to the random forest and applies multiple decision trees. In the extra tree classifier, all points are supplied to the tree, whereas subsamples are supplied in the RF.
True positive (TP) = number of positive samples correctly predicted.
False-negative (FN) = number of positive samples incorrectly predicted.
False-positive (FP) = number of negative samples incorrectly predicted as positive.
True negative (TN) = number of negative samples correctly predicted.
Accuracy (overall): the ratio of all correctly classified cases to the total number of cases. It is defined as
The predicted images are assessed using two image quality measures (a) Universal image quality index (UIQI) and Structural similarity index (SSI) [
We used the breast cancer histopathological annotation and diagnosis dataset (BreCaHAD). It is a collection of open-source, benchmark histopathological images of patients with breast cancer [
Another dataset consisting of ultrasound images was used. These images were classified as benign, malignant, or normal. The ground-truth values of each image are also provided in this open dataset, which can be accessed at the following link:
All experimental work was carried out by using the Google Colab cloud platform on the Tesla P100-PCIE GPU. Various Scikit learn and other packages were installed as required. The scripting language was Python 3.5 and the seed value was set to 42.
Model | Training accuracy | Test accuracy | Remarks (best parameter values) |
---|---|---|---|
Random forest | 1.0 | 0.651 | {n_estimators: 70, random_state: 50} |
Decision tree | 1.0 | 0.942 | {max_depth: 100, random_state: 10} |
Extra tree | 0.829 | 0.536 | {n_estimators: 500, random_state: 50} |
Feature set | Extracted features | Classifier | Training accuracy | Test accuracy |
---|---|---|---|---|
Feature set-1 | Gabor features | RF | 1.0 | 0.578 |
DT | 1.0 | 0.94 | ||
ET | 0.713 | 0.49 | ||
Feature set-2 | Gabor features+ Prewitt | RF | 1.0 | 0.581 |
DT | 1.0 | 0.941 | ||
ET | 0.716 | 0.468 | ||
Feature set-3* | Gabor features+Prewitt +Gaussian | RF | 1.0 | 0.658 |
DT | ||||
ET | 0.785 | 0.575 | ||
Feature set-4 | Gabor+ |
RF | 1.0 | 0.69 |
DT | 1.0 | 0.94 | ||
ET | 0.82 | 0.60 |
Classifiers | GA-DT model (30 features) | Base models (36 features) | ||
---|---|---|---|---|
Training accuracy | Test accuracy | Training accuracy | Test accuracy | |
DT | 1.0 | 1.0 | 0.942 |
When classifying the images, the population was set to 50 and the number of generations to 20 to ensure robust performance. The seed value was set to 42 to ensure that the results could be reproduced, and the components of PCA were limited to 12. This yielded a range of variance of 0.9999 to 1.0.
To evaluate the images, an unlabeled image was first supplied to build the model (GA-PCA-DT) and the output of the model (predicted image) was compared with the respective ground-truth (labeled) image. The markers or labels of the predicted and the ground-truth images were then noted. The relevant steps are shown in
To check the robustness of the model, one image from each malignant class was randomly selected and supplied to it for label prediction.
Image ID | Original image (in BGR) | Ground-truth image (in grayscale) | Predicted image (in grayscale) | Ground-truth nucleus annotation (in“jet” colormap) | Predicted nucleus annotation using GA+PCA (12)* features+DT (in “jet” colormap) |
---|---|---|---|---|---|
Image Case_1-04.png | |||||
Image Case_12-09.png | |||||
Image Case_13-05.png | |||||
Image Case_16-03.png | |||||
Image Case_17-06.png | |||||
Image Case_4-10.png |
Note: *PCA(12) = PCA first 12 components.
Image | Training/Testing and whole-image accuracies | Accuracy using GA features (24 features) | Whole-image accuracy with GA+PCA (12) features | PCA variance with 12 components |
---|---|---|---|---|
Image Case_1-04.png | Train: 1.0 |
Train: 1.0 |
Whole image:1.0 | 0.99996 |
Image Case_12-09.png | Train: 1.0 |
Train: 1.0 |
Whole image:1.0 | 0.99997 |
Image Case_13-05.png | Train: 1.0 |
Train: 1.0 |
Whole image:1.0 | 1.0 |
Image Case_16-03.png | Train: 1.0 |
Train: 1.0 |
Whole image:0.999788 | 0.99999 |
Image Case_17-06.png | Train: 1.0 |
Train: 1.0 |
Whole image:1.0 | 0.99999 |
Image Case_ 4-10.png | Train: 0.999962 |
Train: 0.99974 |
Whole image: 0.99978 | 0.99995 |
A simplified image quality measure, UIQI, has been proposed by Wang and Bovik. UIQI evaluates the quality of the test image over the referenced image by finding, loss in correlation, distortions in luminance and distortions in contrast. For example, suppose,
UIQI has been employed to evaluate how the predicted images corresponded to the ground-truth images, as shown in
An extension of the UIQI, structural similarity (SSI), was also proposed by Wang et al. [
The SSI examines the similarity between the original and the processed image. It has a value between zero and one, where zero indicates that the images are completely dissimilar, and one indicates that they are identical. The SSI measure assumes that the human eye gathers image information through three channels, luminance, contrast, and structure present in the images. Therefore, SSI uses measurement functions for luminance (L), contrast (C), and structure (S). These functions are combined to calculate the final SSI value as represented in
A score of 0.99 and above is considered representative of satisfactory similarity. The SSI scores of the randomly selected images are shown in
Image class |
Original image |
Ground-truth Image** | Predicted image |
Feature importance using GA+PCA (12) | Whole-image accuracy using GA+PCA (12) +DT |
---|---|---|---|---|---|
1.0 | 1.0 | ||||
1.0 | 1.0 | ||||
1.0 | 1.0 |
Note: ** Black and white images.
The proposed method successfully extracted prominent spatial features from the images through various filters, reduced feature set size by implementing GA and PCA in the combination, and accurately classified all pixels of chosen test images. The RF, DT, and extra tree models were built, and their hyperparameters were optimized based on the training and testing sets. The training accuracies of the GA-based method (without PCA) was 1.0, and testing accuracies ranged from 0.92 to 0.99 for the histopathology dataset. In the ultrasound image dataset, the model yielded training accuracies of 1.0 and testing accuracies from 0.9327 to 1.0. The final proposed model, GA-PCA-DT, was tested on images selected from each class. The proposed model has produced whole image classification accuracy between 0.9997 and 1.0. Furthermore, the predicted image (all pixels values) compared with respective ground-truth values generated UIQI scores in the range of 0.9999 to 1.0 and SSI scores in the range of 0.99824 to 1.0 for the test images, chosen randomly from every class of both datasets.
Breast cancer detection and annotation is a tedious and time-consuming task for every healthcare expert. The proposed model is simple, accurate, fast, and inexpensive, and thus can help classify whole image pixels into binary classes i.e., “disease” and “non-disease”. We believe that healthcare expert’s interventions are necessary for every medical examination and such automatic models can only assist experts at the primary level in expediting breast cancer detection, especially in the large populations or remote areas where healthcare facility is not sufficient.
The results reported here are excellent but slightly overfitted. The likely reasons for this are (a) a large number of target classes and (b) the consideration of exceptional pixel values as an additional class. The problem of overfitting can be solved by increasing the population size, further fine-tuning the model, and using regularization methods.
In future work, attention can be drawn to the use of more ML models, filters, and the employment of feature reduction techniques. In addition, different image quality measures can be explored to determine better evaluation. In this study, the images were resized to lower dimensions. Future studies can consider operations on the original images (i.e., 1360 × 1024), and RGB (i.e., red, green, blue) channels should be used instead of grayscale values.
The authors thank the following dataset contributors:
We thank Aksac et al. for providing a histopathological annotated dataset for breast cancer diagnosis for academic and research purposes, available at We also used ultrasound images for breast cancer diagnosis from the Breast Ultrasound Images Dataset | Kaggle.