This study offers a framework for a breast cancer computer-aided treatment prediction (CATP) system. The rising death rate among women due to breast cancer is a worldwide health concern that can only be addressed by early diagnosis and frequent screening. Mammography has been the most utilized breast imaging technique to date. Radiologists have begun to use computer-aided detection and diagnosis (CAD) systems to improve the accuracy of breast cancer diagnosis by minimizing human errors. Despite the progress of artificial intelligence (AI) in the medical field, this study indicates that systems that can anticipate a treatment plan once a patient has been diagnosed with cancer are few and not widely used. Having such a system will assist clinicians in determining the optimal treatment plan and avoid exposing a patient to unnecessary hazardous treatment that wastes a significant amount of money. To develop the prediction model, data from 336,525 patients from the SEER dataset were split into training (80%), and testing (20%) sets. Decision Trees, Random Forest, XGBoost, and CatBoost are utilized with feature importance to build the treatment prediction model. The best overall Area Under the Curve (AUC) achieved was 0.91 using Random Forest on the SEER dataset.
In 2020, 2.3 million women were diagnosed with breast cancer, with 68,500 worldwide fatalities. As of 2020, 7.8 million women have been diagnosed with breast cancer in the past five years, making it the world’s most prevalent cancer [
Since the late 1960s, computer-aided diagnosis (CAD) for mammography has progressed. Its primary goal is to aid radiologists in detecting malignancies that might otherwise go undetected [
Image identification, clinical translation of tumour phenotype to genotype, and outcome prediction in connection to treatment and prognosis strategies are areas where artificial intelligence (AI) can streamline and integrate radiologist diagnostic skills. For decades, radiologists have relied on AI-assisted CAD systems to translate visual data to quantitative data [
This paper proposes an opensource framework for a CATP system for breast cancer.
The contents of this paper are organized as follows: Section 2 reviews previous work on breast cancer, its diagnosis, and treatment. Section 3 contains the experimental setup details, dataset used, classification models, and the evaluation metrics. In Section 4, we present a series of experimental results to demonstrate the effectiveness of the proposed framework. Finally, concluding remarks are provided in Section 5.
Breast cancer is a condition in which the cells of the breast begin to grow out of control. There are several types of breast cancer. The kind of breast cancer is determined by which cells in the breast become cancerous. Breast cancer can start in a variety of places in the breast. Lobules (glands that make milk), ducts (tubes that transport milk from the breast to the nipple), and connective tissue (fibrous and fatty tissue) are the three major components of a breast. Breast cancer usually starts in the ducts or lobules, it can also spread to other parts of the body via blood and lymph arteries.
Breast cancer is divided into three categories based on Estrogen receptor (ER), Progesterone receptor (PR), and human epidermal growth factor 2 (ERBB2) gene amplification, previously known as human epidermal growth factor receptor 2 (HER2) gene amplification: ER or PR positive (also known as HR+), ERBB2 positive, or triple negative. HR+ or ERBB2+ subtypes have a five-year average overall survival, while triple-negative subtypes have a one-year average overall survival [
There are several cancer staging methods in use right now. One method divides tumours into four stages: Stage 0, Stage I, Stage II, Stage III, and Stage IV, with further subcategories, where Stage IV denotes a metastatic distant cancer. TNM (tumour, node, metastasis) is another cancer staging method that assigns stages based on the tumour, node and metastases status [
There are two types of treatment for cancer, depending on the kind and stage of the disease: local and systemic. Surgery and radiation are considered local treatments as they treat the tumour without harming the rest of the body. Systemic therapy, on the other hand, employs the use of medicines to combat the disease. Drugs can reach cancer cells anywhere in the body and be administered directly into the bloodstream or orally. Systemic treatments include chemotherapy, hormone therapy, targeted medication therapy, and immunotherapy. Systemic treatment maybe preoperative (neoadjuvant), postoperative (adjuvant), or both [
Breast conservation surgery (excision of the tumour with surrounding normal breast tissue) or mastectomy (removal of the entire breast) are two options for surgery (total removal of breast tissue). Because of their impact on local recurrence following breast-conserving surgery, some clinical and pathological variables may affect breast conservation or mastectomy choices. An inadequate initial excision, young age, the existence of a significant
Mastectomy is a surgical procedure that removes the breast tissue and a portion of the underlying skin, which generally includes the nipple. A mastectomy should be paired with axillary lymph nodes surgery in some way. Lymph node ectomy is used for both diagnostic (determining the anatomic extent of breast cancer) and therapeutic purposes (removal of cancerous cells) [
Breast conservation surgery may consist of excision of the tumour with a 1 cm margin of normal tissue (broad local excision) or a more extensive excision of a complete quadrant of the breast (breast conservation surgery) (quadrantectomy). The extent of excision is the most critical factor that determines local recurrence following breast-conserving. Compared to grade II or III tumours, grade I tumours have a 1.5-fold reduced recurrence rate. The lower the recurrence rate, but the poorer the aesthetic effect, the larger the excision. Although there is no size restriction for breast conservation surgery, adequate excision of lesions larger than 4 cm yields a poor aesthetic outcome. Hence most breast units limit breast-conserving surgery to lesions less than 4 cm. Breast conservation surgery may be done at any age [
Breast cancer patients may get radiation treatment to the entire breast or a breast section (after lumpectomy), the chest wall (after mastectomy), and the regional lymph nodes. Whole-breast radiation after a lumpectomy is a standard part of breast-conserving treatment [
In the United States, 5.8% of breast cancer patients are metastatic, with a 5-year survival rate of 29% [
Neoadjuvant chemotherapy is used to treat localized early-stage triple-negative breast cancer (TNBC) to preserve the breast or for patients who are temporarily unable to undergo surgery. Chemotherapy in the neoadjuvant situation allows for a direct clinical examination or imaging evaluation of the response [
The Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute (NCI) is a trustworthy source of information on cancer incidence and survival in the United States. SEER now collects and publishes cancer incidence and survival data from community-based cancer registries covering about 47.9% of the US population. The SEER Program registries routinely gather data on patient demographics, initial tumour site, tumour shape and stage at diagnosis, the first course of therapy, and vital status follow-up [
Feature | Description |
---|---|
Age | The age of the patient at diagnosis |
Laterality | describes the side of the breast on which the reportable tumour originated |
HISTOLOGY ICD-O-2 | Code that describes the microscopic composition of cells and/or tissue for a specific primary. The tumour type or histology is a basis for staging and determination of treatment options. It affects the prognosis and course of the disease |
Breast subtype | Created with combined information from ER Status Recode Breast Cancer, PR Status Recode Breast Cancer, and Derived HER2 Recode |
Tumour size | Information on tumour size |
Lymph nodes | Information on involvement of lymph nodes |
Regional nodes evaluated | Records the total number of regional lymph nodes that were removed and examined by the pathologist |
Regional nodes positive | Records the exact number of regional lymph nodes examined by the pathologist that were found to contain metastases |
Mets at distant lymph nodes | Information on distant metastasis |
Stage | The stage of cancer |
T | American Joint Committee on Cancer (AJCC) “T” component: extent (size) of the tumour |
N | This is the AJCC “N” component: The spread to nearby lymph nodes |
M | This is the AJCC “M” component: The spread (metastasis) to distant sites. |
ER | Indicates whether the cancer has the estrogen receptor protein or not |
PR | Indicates whether the cancer has the progesterone receptor protein or not |
HER2 | Indicates whether the cancer has the HER2 protein or not |
Surgery | Indicates whether a surgery is recommended or not |
Radiotherapy | Indicates whether radiotherapy is recommended or not |
Chemotherapy | Indicates whether chemotherapy is recommended or not |
To prepare the data for the model, and since we are only interested in whether a specific treatment is recommended or not, the information in the three treatment features (Surgery, Radiotherapy, and Chemotherapy) was converted into binary (yes/no). After considering the different treatment combinations, we ended up with eight classes representing the various treatment plans.
Surgery | Radiotherapy | Chemotherapy | Treatment plan | % Of records |
---|---|---|---|---|
0 | 0 | 0 | A | 2.65% |
0 | 0 | 1 | B | 1.93% |
0 | 1 | 0 | C | 0.6% |
0 | 1 | 1 | D | 0.66% |
1 | 0 | 0 | E | 26.42% |
1 | 0 | 1 | F | 14.34% |
1 | 1 | 0 | G | 29.68% |
1 | 1 | 1 | H | 23.69% |
After generating the plans, the three treatment features will be removed from the dataset and thus ending up with sixteen features that will be used in the proposed models. The newly extracted feature will be used as the label for our model.
The goal of feature selection is to pick a subset of features from the input that can accurately characterize the data while limiting the influence of noise and irrelevant variables and still delivering high prediction results. Feature selection has been shown to be an effective and efficient data preparation approach for preparing data (particularly high-dimensional data) for machine-learning problems [
A feature selection criterion that can measure the relevance of each feature with the output class/labels is necessary to eliminate an irrelevant feature. If a system employs irrelevant variables in machine learning, it will apply this knowledge for new data, resulting in poor generalization. Other dimension reduction approaches, such as Principal Component Analysis (PCA), should not be compared to removing irrelevant variables because good features might be independent of the rest of the data [
After selecting the features, the input samples are classified into one of the treatment classes using a classifier. In this study we will utilize Decision Trees (DT), Random Forest (RF), XGBoost, and CatBoost (gradient boosting on decision trees) to predict the treatment plan.
where
where the subscripts X, Y indicate that the probability is over the X, Y space. The more trees are added, RF produces a limiting value of the generalization error, and thus, no overfitting occurs [
eXtreme
This means we greedily add the
where
Define
For a fixed structure
And calculate the corresponding optimal value:
This equation can be used as a scoring function to measure the quality of a tree structure
where
Learning algorithm evaluation is not a simple task, as it needs a careful selection of assessment metrics, error-estimation methodologies, statistical tests, and a realization that the results will never be entirely conclusive. This is due, in part, to any evaluation tool's inherent bias and the frequent violation of the assumptions on which it is based. When there are class disparities, as there are in the dataset utilized in this study, the problem becomes considerably more complex. When data is skewed, the default, relatively robust techniques employed for un-skewed data may fail catastrophically [
where True Positive and True Negative is the number of samples which are correctly identified as positives or negatives by the classifier in the test set, respectively, and False Negative and False Positive represent the numbers of samples corresponding to those cases as they are mistakenly classified as benign or malignant, respectively. The points represented by all the acquired pairings are shown in what is known as the ROC space, a graph that depicts the true positive rate as a function of the false positive rate. The dots are then connected to form a smooth curve that reflects the classifier’s ROC curve. The closer a curve representing a classifier is from the top left corner of the ROC space (small FPR, large TPR) the better the performance of that classifier. For example,
In this study, RFE, SHAP, and Shap-Hypetune algorithms were used on the dataset to calculate feature importance. Using RFE with Decision trees resulted in eliminating five features (N, HER2, Mets at Distant LN, ICD_O_2 Histology, and Laterality), while using the same algorithm with random forest classifier resulted in eliminating one feature only (Laterality). Shapely values along with Shap-hypetune were used on XGBoost model to calculate feature importance, this resulted in excluding one feature only (ICD_O_2 Histology). Finally, the CatBoost built-in feature importance method was used to calculate feature importance, M and Laterality were the least important features.
Understanding model decisions is essential for evaluating prediction consistency and spotting potential causes of model bias. SHAP’s objective is to compute the contribution of each feature to the prediction of an instance x to explain it. Shapley values are calculated using the SHAP explanation technique based on coalitional game theory.
This study applied machine learning algorithms including Decision Trees, Random Forest, XGBoost, and CatBoost to predict a treatment plan using the SEER dataset, which includes sixteen features. After cleaning the data, the dataset was split into a training set and a validation set, and the models were fit on the training dataset containing 16 features. With five folds of stratified sampling inside each class, K-fold cross-validation was utilized to measure prediction error while maintaining the overall class distribution. AUC was used to evaluate the performance of the classifier. AUC for each class and the overall model AUC among different models were compared;
In the second phase, feature selection algorithms were utilized on the classification models, and the least important features were excluded (as explained in Section 5.1.1). K-fold cross-validation (K = 5) was used to measure model prediction;
Considering both phases, we aimed to build a model that can successfully predict a treatment plan; we chose the best available features to support prediction. As can be seen, the best achieved AUC was 0.91 for the Random Forest model, which is considered a good result. However, we still have a low AUC for treatment classes E, F, G, and H. These classes correspond to a treatment plan where surgery and other treatments are recommended. Having a close look at the shapely summary plot for these classes (
In this paper, we have investigated the issue of breast cancer treatment plan prediction using four well-known classifiers, i.e., Decision Trees, Random Forest, XGBoost, and CatBoost. These classifiers were utilized with the SEER dataset, which contains sixteen features. The best overall AUC achieved was 0.91 for the Random Forest classifier.
Feature importance, especially shapely summary plots, provided informative information on the contribution of each of the selected features to the model prediction. It gives physicians a valuable hint to pay greater attention to these critical aspects when diagnosing clinical breast tumours. With the reduced number of features, Random Forest achieved the best overall AUC (0.91) across the other classifiers. Feature importance also revealed some possible reasons for the low performance of the selected models in predicting classes that include surgery as part of the treatment plan. One could investigate this further and try to find other features that would improve the performance of these models.
The study suggested a Random Forest model that may be further developed as a potential practical methodology for a CDS system to propose a breast cancer treatment plan by providing physicians with a second opinion. Such a CDS system can also help inexperienced physicians to avoid suggesting the wrong treatment plan.
N.I.R.R. and K.I.M. have received a grant from the
The authors declare that they have no conflicts of interest to report regarding the present study.