Weighted gene co-expression network analysis identifies a novel immune-related gene signature and nomogram to predict the survival and immune infiltration status of breast cancer
1Department of Oncology, Yongchuan Hospital of Chongqing Medical University, Chongqing, China
2Department of Geriatrics, Yongchuan Hospital of Chongqing Medical University, Chongqing, China
*Address correspondence to: Fei He, firstname.lastname@example.org
Received: 26 June 2021; Accepted: 16 August 2021
Abstract: Breast cancer is one of the most common cancers in the world and seriously threatens the health of women worldwide. Prognostic models based on immune-related genes help to improve the prognosis prediction and clinical treatment of breast cancer patients. In the study, we used weighted gene co-expression network analysis to construct a co-expression network to screen out highly prognostic immune-related genes. Subsequently, the prognostic immune-related gene signature was successfully constructed from highly immune-related genes through COX regression and LASSO COX analysis. Survival analysis and time receiver operating characteristic curves indicate that the prognostic signature has strong predictive performance. And we developed a nomogram by combing the risk score with multiple clinical characteristics. CIBERSORT and TIMER algorithms confirmed that there are significant differences in tumor-infiltrating immune cells in different risk groups. In addition, gene set enrichment analysis shows 6 pathways that differ between high- and low-risk group. The immune-related gene signature effectively predicts the survival and immune infiltration of breast cancer patients and is expected to provide more effective immunotherapy targets for the prognosis prediction of breast cancer.
Keywords: Immune-related genes; Prognostic model; Breast cancer; WGCNA; Nomogram
Breast cancer is a biologically heterogeneous disease (Desantis et al., 2011). Breast cancer is the most common cancer affecting women worldwide, and its incidence and mortality are expected to increase significantly (Greaney et al., 2015). In developing countries, the incidence of these cancers is expected to increase proportionally and is estimated to grow 55% and the mortality rate of 58% within 20 years (Villarreal-Garza et al., 2013). Breast cancer is mainly classified into 4 major subtypes based on the presence/absence of critical molecular biomarkers estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor 2 (HER2), namely ER+/PR+/HER2− (luminal A), ER+/PR+/HER2+ (luminal B), and ER−/PR−/HER2+, and triple-negative breast cancer (TNBC) (Abubakar et al., 2019). Extensive research on breast cancer has promoted cytotoxic and targeted therapies, improving survival rates (Sledge et al., 2014). Although related treatment has improved, there are still challenges in combating tumor heterogeneity due to prognostic prediction difficulty (Eisenstein, 2015). Therefore, the paper outlines the relevant immune signatures for the prognostic diagnosis of breast cancer.
The immune microenvironment of breast cancer can be considered at the local (intratumor), regional (in the breast) and distant (metastatic) levels (Coleman et al., 2013). Abnormal gene expression in immune microenvironment cells affected the progression of the disease (Finak et al., 2008; Hu et al., 2005). The impact of the immune environment on breast cancer progression depends on the cancer phenotype and inflammatory cell subsets in the breast cancer microenvironment (Desantis et al., 2011). Breast cancer may be an immunogenic tumor. Immune escape is mainly caused by histocompatibility complex (MHC) class I, abnormal antigen signaling mechanism, immunosuppressive components (such as HLA-G), Fas and its ligands to activate apoptosis and overexpression of other immunosuppressive molecules, such as lymphocyte activation gene 3 (LAG-3), T cell immunoglobulin and mucin domain 3 (TIM-3) (Steven and Seliger, 2018). Immune cell infiltration plays a critical role in regulating breast cancer development and serves as an independent indicator for the treatment (Choi et al., 2017).
Sequencing technology is widely used with the updating of bioinformatics methods. Molecular markers have shown excellent prognostic performance with the development of genome sequencing technology. Multiple biomarker signatures are developing rapidly to enhance the prognostic prediction of tumors and demonstrated better performance than a single biomarker (Kaanane et al., 2019). For example, the 2-gene signature strongly predicted distant metastasis-free survival and breast cancer-specific survival (Alsaleem et al., 2020). The signature served as a genetic marker for predicting breast cancer patient’s prognosis and may guide clinical management (Ibrahim et al., 2017). Moreover, the 12-gene prognostic signature identifies a high risk of death of breast cancer effectively (Xie et al., 2020). The 17-gene signature effectively divided breast cancer patients into different survival groups and served as a candidate prognostic indicator clinically (Qian et al., 2020). However, the application of gene signatures is currently controversial and requires more clinical trials to verify. And immune dysfunction may be why the difference in survival observed between the patient groups defined by the gene signature (Ren et al., 2013).
Immune-related gene (IRGs) signature has the potential to improve the prognosis prediction of breast cancer patients. For example, the 10-IRG in breast cancer immune infiltration and tumor-immune interaction has been established based on the TCGA database (Ren et al., 2013). The IRGs contribute to longer overall survival (OS) by changing the abundance and checkpoint expression of tumor-infiltrating lymphocytes (TIL) in breast cancer (Li et al., 2020). The 130-IRGs helps to stratify luminal breast tumors patients further to benefit from checkpoint immunotherapy (Zhu et al., 2019). IRGs-based gene signatures served as prognostic biomarkers and potential therapeutic targets for breast cancer (Gordon and Gadi, 2020). Therefore, it is essential to establish an IRG based gene signature to predict the survival and immune infiltration status.
The paper screened out the common immune genes in TCGA, GEO, and Immport databases. And we use weighted gene co-expression network analysis (WGCNA) technology to construct a co-expression network of these overlapping genes. We applied univariate COX regression and LASSO COX regression to build gene signatures based on risk scores to improve breast cancer survival prediction. Next, the nomogram is constructed to improve the prediction of patient survival. The TCGA and GEO data sets were used to evaluate and verify gene signatures and to predict the survival of patients. The deconvolution method based on CIBERSORT obtained the proportion of 22 immune cells in patient samples. And we used TIMER algorithm to observe the infiltration changes of signature genes and 7 immune cells. Finally, we used the gene set enrichment analysis (GSEA) to analyze IRGs and obtained the different pathways between high- and low-risk patients.
Materials and Methods
We obtained transcriptome sequencing and clinical data of breast cancer patients from TCGA (https://cancergenome.nih.gov) and GEO (https://www.ncbi.nlm.nih.gov/geo/). The micro-array data set GSE42568 is queried from the GEO database for subsequent external study. The cohort contains mRNA expression data of 1097 cases of tumor tissues and 120 cases of adjacent tissues. Moreover, breast cancer patients’ clinical data included survival time, survival status, age, gender, tumor grade, pathological stage, etc. In addition, 2483 IRGs are obtained from the ImmPort database (https://immport.niaid.nih.gov). The intersection IRGs were determined between TCGA, GEO, and ImmPort databases through Venn diagrams.
Weighted gene co-expression network analysis
We used WGCNA implemented in the R platform program package to construct a co-expression network of IRGs. WGCNA is an efficient method for analyzing association patterns between genes. First, we performed a cluster analysis on the expression profile to calculate the Pearson correlation coefficient of the genes and established a correlation matrix:
Next, the adjacency matrix was built:
Finally, we constructed the topological adjacency matrix:
Finally, IRGs in the most significant modules are selected for subsequent analysis.
Development of the prognostic model
The IRGs in the significant WGCNA modules are analyzed further to obtain gene signatures highly correlated with the patient’s prognosis. We excluded samples with zero survival time or missing visit information in the clinical data. First, univariate COX regression was performed to initially identify IRGs related to the patient’s survival (P < 0.05). Then we used the LASSO COX model to construct a penalty function to get a more refined prognosis model. LASSO COX regularization can be expressed as:
We constructed a risk scoring formula by linear fitting with the obtained IRGs, which is expressed as:
where Exp is the expression value of IRGs in the sample in the model, and P is the regression analysis coefficient of each IRG after multivariate LASSO COX regression analysis.
Validation of the prognostic model
We take the median of the risk score (RS) value as the cut-off value and divide the samples into high and low-risk groups. And we use the tROC curve to evaluate the predictive ability of 1-year, 3-year and 5-year survival period, and further use Kaplan–Meier method to draw survival curves of high and low-risk groups (P < 0.05). The R package “pheatmap” was used to draw a heat map, and the R package “timeROC” was used to draw a time-dependent ROC curve.
Establishment of a prognostic nomogram
The patient’s risk score and various clinical information were used as analysis variables, univariate and multivariate COX regression were performed, and the model C index was calculated to calculate the patient’s prognosis (P < 0.05). The R package “rms” was used to construct a nomogram and a calibration chart to score the patients’ 1-year, 3-year, and 5-year survival prognosis and fit and calibrate it with the real survival situation.
Tumor-infiltrating immune cell analysis
The expression of 22 immune cells in patients was calculated through CIBERSORT. The immune infiltration differences between the high- and low-risk groups were analyzed by comparing the expression differences of immune cells between both groups. The correlation between prognostic IRGs and various immune cell infiltration is analyzed and visualized through the TIMER database.
Gene set enrichment analysis
We used “high” and “low” as phenotypic labels based on the expression value of RS predicted by patients. The R package “limma” was used to calculate the logFC value of the high and low-risk group genes. The R package “clusterprofiler” was used to perform GSEA analysis on the whole genome and run 1,000 genome permutations to obtain a standardized enrichment score (NES). P < 0.05 was used as the cut-off value for determining a significantly enriched gene set. The pathways enriched by differentially expressed genes in samples of high- and low-risk patients were obtained and visualized.
All R language packages for statistical analysis are implemented on the “R v4.0.2” platform. P < 0.05 is statistically significant in the study. TCGA data source (https://cancergenome.nih.gov/), GEO data source (https://www.ncbi.nlm.nih.gov/geo), ImmPort data source (https://immport.niaid.nih.gov); CIBERSORT website (https://CIBERSORT.stanford.edu/); TIMER website (https://cistrome.shinyapps.io/timer/).
The identification of prognostic-related IRGs modules
1279 overlapping IRGs expression data were screened for WGCNA analysis (Fig. 1A). By clustering the samples (Fig. 1B), we constructed a scale-free network in WGCNA with β = 3 (Fig. 1C). Six IRGs modules were identified (Fig. 1D). The module correlation analysis showed that blue and brown had a strong correlation (Fig. 2A). Fig. 2B is a network heat map based on hierarchical clustering of IRGs and sample data. Correlation analysis of 6 modules and multiple clinical phenotypic parameters found that the blue module strongly correlates with status (cor = 0.3, P = 0.002) (Fig. 2C). The IRGs significance average of all IRGs is used to characterize the correlation between the module and the cancer phenotype. The relationship between the blue module and each phenotype is the most significant (P = 4.7e−24) (Fig. 2D). The blue module (N = 160) was identified as a collection of IRGs related to the prognosis of breast cancer.
Prognosis model performance in both training and validation cohort
69 IRGs significantly related to patient survival in the GEO training set were identified (P < 0.05) through univariate COX regression. The results were further optimized using LASSO COX analysis to obtain a 5-IRGs signature (IL6ST, PLCG1, PLXNA1, TBK1, and VAV3) that significantly affected the survival of patients (Table 1; Figs. 3A and 3B). Thus, the signature-based prognosis model was established. The model formula is RS = (−0.096) × IL6ST+ 0.3352 × PLCG1+ 0.3200 × PLXNA1+ (−0.1694) × TBK1+ −0.0418 × VAV3.
The 101 patients in the training set were divided into low- and high-risk group by calculating the patients’ risk score (50 vs. 51, Figs. 3C and 3D). Correspondingly, the 1,037 patients fell into low- and high-risk patients in the validation set (519 vs. 518). Compared with low-risk patients, the expression of 5 IRGs signatures in high-risk patients in the training and validation cohort showed a significant up-regulation (Figs. 3E and 4). The results based on the GEO training set clarified that low-risk significantly improved patient survival compared with high-risk one (P < 0.0001; Fig. 5A). Time receiver operating characteristic (tROC) analysis shows that the risk model has outstanding predictive capability, and its 1-, 3-, and 5-year AUCs are 0.776, 0.739, and 0.765 (Fig. 5B). Moreover, the K–M curve analysis based on the TCGA validation set (P = 0.0016; Fig. 5C) and the time-dependent ROC analysis results confirmed the above results. In the TCGA validation set, the risk model predicts the AUCs of the 1-year, 3-year, and 5-year patient survival rates to be 0.657, 0.641, and 0.663, respectively (Fig. 5D).
Incorporation of critical clinical features into the nomogram
Next, we performed univariate and multivariate COX regression on the patients’ risk score and various clinical features. The results show that risk score, grade, and stage in the training and validation set significantly correlate with the patients’ overall survival (Table 2). Subsequently, risk score and grade were incorporated into the nomogram. The training group-based nomogram (C-index = 0.82; Fig. 6A) provided a better survival prediction than that using risk score (C-index = 0.76) and grade (C-index = 0.67). The overall survival of 1-, 3- and 5-year survival rate predicted by the nomogram closely matches the best performance prediction (Fig. 6B). Consistent with the above results, in the validation set, the prognostic factor risk score (C-index = 0.64) and stage (C-index = 0.63) were compared with the risk score (C-index = 0.63), and the nomogram (C-index = 0.65) established by combining the two prognostic factors; Fig. 6C) has a better ability to predict prognosis. Simultaneously, the validation group-based calibration curve verified the prognostic prediction ability of the nomogram (Fig. 6D).
Immune cell infiltration of risk groups
The deconvolution method-based CIBERSORT algorithm revealed the proportion of 22 immune cells of patients in different risk groups. The results of immune cells proportion in the training and the validation set show that macrophages M0, T cells gamma delta, T cells CD8, T cells CD4 memory resting, and macrophages M2 are the five most abundant immune cell types in the patients’ immune system (Figs. 7A and 7B). Difference analysis shows that there are differences in the infiltration of the 5 immune cells in high-risk and low-risk patients in the training set (Fig. 7E). The 8 types of immune cells have different infiltration differences in various risk groups in the validation set (Fig. 7F).
TIMER obtained the immune infiltration changes mediated by prognostic markers of IRGs (Fig. 8). The results showed that in the prognostic signature of 5-IRGs (IL6ST, PLCG1, PLXNA1, TBK1, and VAV3), the high expression of each IRG significantly increased the infiltration level of 6 immune cells. And the low expression of PLCG1 and PLXNA1 significantly promotes the purity of the tumor, and the increased expression of TBK1 and VAV3 can significantly promote the tumor’s purity.
Analysis of GSEA
Pathway Differences GSEA shows 6 pathways that differ between high- and low-risk patients (Fig. 9). The 6 pathways are cytokine signaling in immune system, developmental biology, innate immune system, proteasome, adaptive immune system, cell cycle and GPCR ligand binding.
Breast cancer is a disease that seriously threatens women’s health worldwide. Prognostic prediction is the most effective way to reduce breast cancer mortality (Desantis et al., 2011). However, it is difficult to achieve this goal due to the lack of specific biomarkers. The study used WGCNA to construct a co-expression network to screen out highly immune-related prognostic genes. Subsequently, the 5-IRG genes signatures were successfully constructed from highly IRGs through COX regression and LASSO COX analysis. The survival and tROC curves show that the prognostic signature has a strong predictive capability. Then, a nomogram composed of risk score and grade markers was constructed by combining with the risk score of the prognostic signature and multiple clinical characteristics. CIBERSORT and TIMER algorithm-based analysis confirmed significant differences in tumor-infiltrating immune cells between high- and low-risk groups. Besides, GSEA shows 6 pathways that differ between high- and low-risk patients.
We successfully constructed a prognostic signature based on IRGs through WGCNA and LASSO COX algorithms for the prognostic stratification of breast cancer. The LASSO COX algorithm solved the over-fitting problem that often occurs in the construction of prognostic models in the past (Meng et al., 2020). The risk score based on the prognostic signature showed that the overall survival of patients in the high-risk group was significantly shorter than that of the patients in the low-risk group. Moreover, the nomogram shows good performance in predicting the overall survival rate. The tROC analysis shows that the AUCs of the 1-, 3-, 5-year OS is 0.776, 0.739, and 0.766, which is comparable to the previous models (Piñero-Madrona et al., 2020; Wu et al., 2021), indicating our model has good prognostic performance.
The study uses various bioinformatics analyses to determine potential immune targets related to the prognosis of breast cancer. The immunotherapy and prognosis of breast cancer previously discussed by authors did not screen prognostic genes (Baxevanis et al., 2021; Choi et al., 2017). However, we conducted a WGCNA analysis to identify the modular genes most related to breast cancer prognosis. In addition, previous studies did not report the immune infiltration characteristics of the prognostic biomarkers of breast cancer (Chen et al., 2015; Wang et al., 2020). In the study, the TIMER and CIBERSORT algorithms calculated the change of immune cell infiltration, which promotes prognosis prediction in breast cancer. And we performed univariate COX regression and LASSO COX to determine 5-IRGs signature. These results provided a prospect for exploring breast cancer-related treatment methods and prognostic prediction.
The 5-IRGs signatures demonstrated good predictive capability for patients’ survival. IL6ST (gp130), an essential signaling subunit of interleukin-6 cytokine family receptors, plays a vital role in promoting the growth and metastasis of breast cancer (Ibrahim et al., 2017; Ren et al., 2013). Besides, IL6ST serves as a potential therapeutic target for breast cancer and participates in cancer development (Selander et al., 2004; Tian et al., 2019). PLCG1 is highly correlated with tumorigenesis and the development of breast cancer. PLCG1 regulated EGF receptor-driven invasion (Li et al., 2009), cell proliferation, and apoptosis of BC cells (Markova et al., 2010). FGFR1 promotes the occurrence of small cell carcinoma, and PLCG1 is the effector of FGFR1 signal transduction (Kim et al., 2020). PLXNA1 is a member of the Plexin A family and is involved in regulating malignant cells and nervous tissue in cancer specimens (Müller et al., 2007). PLXNA1 may be a biomarker for identifying breast cancer cases vulnerable to immunotherapy (Naik et al., 2017). The increase of PLXNA1 expression promotes the growth of prostate tumors and impacts the tumor’s biochemical recurrence, metastasis, and poor survival rate (Ren et al., 2018). TBK1 (Tumor Necrosis Factor TNF) is a ubiquitously expressed serine-threonine kinase, belonging to “non-standard” IκB kinases (IKKs) (Fitzgerald et al., 2003). Cytokines regulated by TBK1 affect tumor microenvironment and angiogenesis (Czabanka et al., 2008). TBK1’s autophagy regulation and anti-tumor immunity play an important role in melanoma and non-small cell lung cancer (Cooper et al., 2017; Durand et al., 2018; Eskiocak et al., 2017). VAV3 oncogene is involved in various cell signal transduction processes, including cytoskeletal organization, calcium influx, gene transcription, cell transformation, cell proliferation, and apoptosis (Bustelo, 1996). It is the upstream mediator of Ras-related C3 botulinum toxin substrate 1, enhancing the transcriptional activity of estrogen receptor alpha (ER-α) in breast cancer cells (Rosenblatt et al., 2011). Also, VAV3 is epigenetically regulated in breast cancer development (Loss et al., 2010). And studies have shown VAV3 was an independent prognostic risk factor in gastric cancer (Tan et al., 2017). Chen et al. (2015) found that the up-regulated VAV3 leads to a poor prognosis of BC patients, consistent with our results.
We used the deconvolution algorithm to infer the proportion of 22 immune cell subgroups from the tumor transcriptome. And we found that macrophages M0, macrophages M2, T gamma delta cells, T CD4 memory resting cells, and T CD8 cell infiltration are highly correlated. The interaction between tumor and immune cells affects tumor development, and also serves as a prognostic feature of immunotherapy. The study shows that M0 macrophages, M2 macrophages, Tγδ cells, T CD4 memory resting cells, and T CD8 cell infiltration are highly correlated. They are also the five most abundant immune cell types in the patient’s immune system. Macrophage-related cells inhibit T cell proliferation through the interaction of PD-1/PD-L1 cells to eliminate CD8+ T cells, one of the critical suppressors of tumor cells. Suppressing the immune activity of CD8+ T cells causes the tumor cells in TME to escape (Fang et al., 2021). As the main component of the immune microenvironment, macrophages are a very heterogeneous cell population. Macrophages M0 are mainly derived from the extravasation of monocytes in the blood circulatory system. Noy and Pollard (2014) found that the M2 macrophage with a prominent immunomodulatory function and demonstrated a poor prognosis. A higher proportion of M2 macrophages in breast cancer is associated with insufficient chemotherapy response, suggesting a potential drug resistance (Denardo et al., 2011). In the previous literature, T cells CD8 is mainly involved in adaptive immune defense, the most common is a prognostic association, and kill cancer cells through multiple mechanisms. There is also this discussion about FOXP3+ regulatory T cells CD4 memory resting in some malignant tumors. The results demonstrated the potential of tumor-related T cells and macrophages as breast cancer biomarkers and may provide a guide for future immunotherapy.
Immune-related gene features involve six signaling pathways, namely the immunomodulatory interaction between lymphocytes and non-lymphocytes, interleukin-4 and interleukin-13 signaling pathways, protein phosphorylation after translation; hematopoietic cell lineage, tumor necrosis factor signaling pathway, the interaction of viral proteins with cytokines and cytokine receptors. The tumor necrosis factor signaling pathway (TNF-α) is one of the essential pro-inflammatory cytokines found in the TME of breast cancer patients and promotes tumor progression and metastasis. Thus, TNF-α affects the development of breast cancer tumors, including primary tumor development, metastasis, and disease recurrence (Cruceriu et al., 2020).
Breast cancer is heterogeneous, and the gene signature of the study is immune-related genes. In the study, different immune infiltrating cells were calculated based on the risk scores of 5 immune gene signatures. And the high-risk group was associated with a poor prognosis. Therefore, identifying relevant immune genes and differential immune cells is beneficial to breast cancer immunotherapy. Breast tumor cells with defective DNA damage response activate the STING/TBK1/IF3 pathway, recruiting pro-inflammatory cytokines and T cells. Therefore, a combination of drugs and immunotherapy is more effective in treating breast cancer (Brown et al., 2021). PLCG1 and VAV3 are key molecules of TCR signal transduction. TCR initiates a complex cascade of signal events, leading to cytoskeletal reorganization and transcription upregulation required for T cell activation (Kogure and Kataoka, 2017). VAV3 regulates a variety of cell signaling pathways by mediating the activities of Rho family members, including the signaling pathways of T cell and B cell receptors in vertebrates (Shen et al., 2017). Therefore, the gene signature provided may guide breast cancer immunotherapy.
We use an integrative bioinformatics method to investigate breast cancer IRGs. The prognostic 5-IRGs signatures effectively improve the prognostic prediction and immune infiltration of breast cancer patients and are expected to provide more effective immunotherapy targets for the prognosis assessment of breast cancer.
Author Contribution: The authors confirm contribution to the paper as follows: Study conception and design: Fei He; data collection and analysis: Ke Pang; draft manuscript preparation: Junxia Liu. All authors reviewed the results and approved the final version of the manuscript.
Availability of Data and Materials: All data appear in the paper were obtained from GEO, TCGA and Immport database.
Funding Statement: Science and Technology Innovation Project of Social People’s Livelihood, Yongchuan District, Chongqing (Ycstc, 2017cb5502).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|