Developing successful software with no defects is one of the main goals of software projects. In order to provide a software project with the anticipated software quality, the prediction of software defects plays a vital role. Machine learning, and particularly deep learning, have been advocated for predicting software defects, however both suffer from inadequate accuracy, overfitting, and complicated structure. In this paper, we aim to address such issues in predicting software defects. We propose a novel structure of 1-Dimensional Convolutional Neural Network (1D-CNN), a deep learning architecture to extract useful knowledge, identifying and modelling the knowledge in the data sequence, reduce overfitting, and finally, predict whether the units of code are defects prone. We design large-scale empirical studies to reveal the proposed model's effectiveness by comparing four established traditional machine learning baseline models and four state-of-the-art baselines in software defect prediction based on the NASA datasets. The experimental results demonstrate that in terms of f-measure, an optimal and modest 1D-CNN with a dropout layer outperforms baseline and state-of-the-art models by 66.79% and 23.88%, respectively, in ways that minimize overfitting and improving prediction performance for software defects. According to the results, 1D-CNN seems to be successful in predicting software defects and may be applied and adopted for a practical problem in software engineering. This, in turn, could lead to saving software development resources and producing more reliable software.
In recent years, software-run applications have become crucial in day-to-day human life. When COVID-19 embarked on the world in 2020, our dependency on software accelerated more due to the lockdown. Any slight disturbance or defect in any software could lead the working software to failure [
SDP offers exceptional benefits (1) to discover problems or defects earlier based on previous projects. Generally, these previous projects may have similarities with the new project. On top of that, predicting the problems help to increase the new project or software reliability; (2) To discover several independent variables used in a model. This helps the software developer appropriately manage the software defects; (3) To manage the testing plan and prioritize the faulty classes. Nevertheless, the software tester able to use the testing plan efficiently. Overall, SDP ensures that resources are effectively used in software development, resulting in lower costs and shorter development times. As a result, it increases software quality [
Realizing the importance of SDP, researchers have proposed a number of solutions to predict defects in the software. One of the solutions is using a statistical model based on the regression or function-approximation problem analysis [
Several machine learning algorithms [
Although CNN algorithms may be useful for SDP, however, they seem to be very complex and have an insufficient accuracy level. This issue might be because of the 2D structure that was originally constructed to work only with 2D data such as images and videos. Recently, [
Yadav [
Motivated by the success of 1D-CNN algorithms applied in the aforementioned studies, we proposed a novel structure of 1D-CNN in predicting software defects with the aim to increase the performance of SDP on nine NASA datasets. On top of that, another five CNN models with different structures were also built to investigate the impact of different architecture on the performance of CNN in SDP. This paper makes the following contributions:
We propose a novel structure of 1D-CNN, a deep learning architecture to extract useful knowledge, identify and model the knowledge in the data sequence, reduce overfitting, and finally, predict whether the code units are defects prone. We investigate the impact of different architecture of 1D-CNN on SDP performance by developing five CNN models with five different structures in terms of the dropout layer, kernel size, filter size, and the inclusion of an additional convolutional layer, and type of convolutional and max-pooling layers. The empirical results show that adding a dropout layer to a 1D-CNN classifier can reduce overfitting and enhance the model's performance in predicting software defects. On the other hand, increasing the kernel size of the proposed 1D-CNN model, reduces the filter size, adds an additional convolutional layer, and uses 2D convolutional and max pooling layers do not have a great impact on the detection of software defects. We design large-scale empirical studies to present the effectiveness of the proposed model by making a comparison with four established traditional machine learning baseline models and four state-of-the-art baselines in SDP based on the NASA datasets [ Finally, we present the optimal 1D-CNN model by tuning three hyperparameters (the number of epochs, learning rate, and dropout rate) of the proposed 1D-CNN model.
The following is the paper's organization. Section 2 reviews related work. Section 3 provides the materials and methods of the proposed model. Section 4 presents the results and discusses the stated research questions. Section 5 gives the threats to validity from the construct, external, and internal validity. Section 6 gives the conclusion, and summarizes the study and suggests possible future works.
Deep learning has been utilized in a variety of fields since 2012, including software engineering. Deep learning was first used in software defect prediction in 2015 [
Yang et al. [
Wang et al. [
Currently, Zhu et al. [
However, these methods used deep learning to extract novel characteristics and other machine learning techniques to classify software as defective or not defective. As a result, these methods continue to suffer from a complicated structure and inadequate accuracy in predicting software defects, which may be improved further. To address these problems, we propose a new deep learning model that simplifies the structure, reduces overfitting and improves accuracy. To conduct this study, we utilized a 1D-CNN method for predicting software defects. The next section details the whole procedure.
The dataset used in this study was collected by the NASA Metrics Data Program. It can be retrieved from [
Dataset | Instances | Features |
---|---|---|
cm1 | 498 | 21 |
kc1 | 2109 | 21 |
kc2 | 522 | 21 |
pc1 | 549 | 21 |
pc2 | 5589 | 36 |
pc3 | 1563 | 37 |
pc4 | 1458 | 37 |
mc1 | 9466 | 38 |
mc2 | 161 | 39 |
Type of metrics | Software metrics | Dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Cm1 | Kc1 | Kc2 | Pc1 | Pc2 | Pc3 | Pc4 | Mc1 | Mc2 | ||
LOC metrics | LOC code and comments | |||||||||
LOC comments | ||||||||||
LOC executable | ||||||||||
LOC blank | ||||||||||
LOC total | ||||||||||
Percent comments | ||||||||||
Halstead metrics | Content | |||||||||
Difficulty | ||||||||||
Effort | ||||||||||
Length | ||||||||||
Level | ||||||||||
Prog time | ||||||||||
Volume | ||||||||||
Num operands | ||||||||||
Num operators | ||||||||||
Num unique operands | ||||||||||
Num unique operators | ||||||||||
Total operators + operands | ||||||||||
McCabe metrics | Cyclomatic complexity | |||||||||
Cyclomatic density | ||||||||||
Design complexity | ||||||||||
Essential complexity | ||||||||||
Decision density | ||||||||||
Design density | ||||||||||
Essential density | ||||||||||
Global data complexity | ||||||||||
Global data density | ||||||||||
Line count | ||||||||||
Count | Branch count | |||||||||
Condition count | ||||||||||
Decision count | ||||||||||
Edge count | ||||||||||
Parameter count | ||||||||||
Modified condition count | ||||||||||
Multiple condition count | ||||||||||
Node count | ||||||||||
Call pairs | ||||||||||
Maintenance severity | ||||||||||
Normalized cyclomatic complexity | ||||||||||
Total | 21 | 21 | 21 | 21 | 36 | 37 | 37 | 38 | 39 |
Preprocessing of data is a crucial step to ensure that the data are of good quality. Normalization was applied to avoid the very large difference in feature values. In this analysis, the software metrics were scaled to an interval of [0, 1] using the Sklearn library MinMaxScaler function [
In this phase, a structure for a 1D-CNN model was proposed as a baseline. Then, to investigate the impact of different structures on the performance of the proposed model, another five CNN models with different structures in terms of dropout layer exclusion, kernel size, filter size, the inclusion of an additional convolutional layer, and type of convolutional and max pooling layers were constructed. After that, four machine learning models were developed to measure the efficiency of our proposed model compared to the established machine learning models.
A deep learning architecture called 1D-CNN was proposed to extract useful knowledge, identifying and modelling the knowledge in the data sequence, and finally predict whether the unit of code is defect prone. The 1D-CNN consists of 2 main layers, convolutional and pooling layers.
Convolutional and pooling layers [
The convolutional layers are normally preceded by a nonlinear activation function and then a pooling layer. A pooling layer is a method for subsampling that extracts specific values from the convolved features and provides a matrix with a reduced dimension. Like the convolutional layer, the pooling layer employs a small sliding window that accepts the values of each patch of the convolved features as input and outputs one new value that is described by an operation that the pooling layer is defined to accomplish. Max pooling and average pooling, for example, compute the maximum and average value of each patch's values. Consequently, the pooling layer generates new matrices that can be thought of as summarized versions of the convolved features generated by the convolutional layer. Because slight changes in the input do not affect the pooled output values, the pooling procedure can assist the system to be more robust.
The structure of our proposed 1D-CNN is depicted in
The proposed model was named 1D-CNN1. We also built another five CNN models with different structures (
1D-CNN1 | ID-CNN2 | ID-CNN3 | 1D-CNN4 | 1D-CNN5 | 2D-CNN | |
---|---|---|---|---|---|---|
#Convolutional layer | 2 | 2 | 2 | 2 | 3 | 2 |
#Pooling layer | 1 | 1 | 1 | 1 | 1 | 1 |
Size of max-pooling | 1 | 1 | 1 | 1 | 1 | (1, 1) |
#Dense layers | 2 | 2 | 2 | 2 | 3 | 2 |
Activation function | ReLU + sigmoid (last dense layer) | ReLU + sigmoid (last dense layer) | ReLU + sigmoid (last dense layer) | ReLU + sigmoid (last dense layer) | ReLU + sigmoid (last dense layer) | ReLU + sigmoid (last dense layer) |
Filter | 64, 64 | 64, 64 | 64, 64 | 32, 15 | 64, 64, 64 | 64, 64 |
Kernel | 1, 1 | 1, 1 | 3, 3 | 1, 1 | 1, 1 | 1, 1 |
Training and Optimizer | Adam + binary |
Adam + binary |
Adam + binary |
Adam + binary |
Adam + binary |
Adam + binary |
#Epoch | 32 | 32 | 32 | 32 | 32 | 32 |
Learning rate | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
Dropout rate | 0.3 | Without dropout | 0.3 | 0.3 | 0.3 | 0.3 |
Four popular machine learning models were also constructed to evaluate how good is our proposed model compared to these established models. A brief description of each machine learning technique was presented in the following.
In this phase, the proposed model and the other 4 CNN models were implemented based on the structure and parameters described in subsection 3.2.2 using Keras from the TensorFlow library. The baseline models were implemented by setting their parameters to their default values using sklearn library. The experiment was conducted 5 times for each dataset, taking into consideration the occurrence of randomness. The performance of each model on each dataset was then measured in terms of accuracy, f-measure, training, and testing time. The average for each performance metric was computed and compared to find out which model has the highest performance in detecting software defects. According to the confusion matrix, which is presented in
Predicted | ||
---|---|---|
Actual | Nondefect | Defect |
Nondefect | TN | FP |
Defect | FN | TP |
TP = True positive: If a defect subject is correctly classified as a defect
TN = True negative: If a nondefect subject is correctly classified as a nondefect
FP = False positive: If a nondefect subject is misclassified as a defect
FN = False negative: If a defect subject is misclassified as a nondefect
In this phase, the accuracy of the proposed model was improved by tuning three hyperparameters: the number of epochs, learning rate, and dropout rate. The number of epochs was tuned from 32–500, the learning rate was tuned from 0.001–0.1, and the dropout rate was tuned from 0.1–0.5. The trial was performed 50 times. The hyperparameter tuning was conducted using the Optuna framework from the Python library. This hyperparameter tuning was run 9 times since we used 9 different datasets. The performance of the proposed model using the optimal parameters on each dataset was then measured in terms of accuracy, f-measure, training, and testing time. The average for each performance metric was computed then compared with the performance of the proposed model before tuning the hyperparameters to see its impact.
To guide us in evaluating the proposed model, the following research questions were constructed:
The answers for RQ1 – RQ4 are discussed separately in the following subsections.
Dataset | 1D-CNN1 | SVM | RF | DT | NB |
---|---|---|---|---|---|
cm1 | 99.43 | 59.71 | 64.00 | 60.00 | 62.29 |
kc1 | 99.89 | 74.74 | 68.57 | 63.98 | 64.92 |
kc2 | 100.00 | 65.89 | 74.95 | 74.42 | 66.58 |
mc1 | 99.96 | 88.30 | 90.30 | 80.35 | 83.47 |
mc2 | 97.54 | 52.08 | 58.95 | 60.36 | 60.54 |
pc1 | 100.00 | 56.87 | 72.86 | 68.96 | 52.90 |
pc2 | 100.00 | 90.22 | 62.49 | 63.68 | 55.93 |
pc3 | 99.78 | 56.99 | 74.77 | 58.85 | 46.93 |
pc4 | 99.76 | 58.54 | 85.16 | 67.00 | 62.02 |
Ave | 67.04 | 72.45 | 66.40 | 61.73 |
Dataset | 1D-CNN1 | SVM | RF | DT | NB |
---|---|---|---|---|---|
cm1 | 97.56 | 26.18 | 33.33 | 29.63 | 32.26 |
kc1 | 99.68 | 44.81 | 42.93 | 38.93 | 41.35 |
kc2 | 100.00 | 46.78 | 59.81 | 60.00 | 49.32 |
mc1 | 97.18 | 7.18 | 13.77 | 28.04 | 10.43 |
mc2 | 95.60 | 42.50 | 41.03 | 40.00 | 38.46 |
pc1 | 100.00 | 24.39 | 48.42 | 46.15 | 27.84 |
pc2 | 100.00 | 3.11 | 4.49 | 11.11 | 3.23 |
pc3 | 99.00 | 22.08 | 42.79 | 25.41 | 18.42 |
pc4 | 99.16 | 27.40 | 58.97 | 42.17 | 35.56 |
Average | 27.16 | 38.39 | 35.72 | 28.54 |
Dataset | 1D-CNN1 | SVM | RF | DT | NB |
---|---|---|---|---|---|
cm1 | 3.6777 | 0.0097 | 0.7450 | 0.0065 | 0.0043 |
kc1 | 11.3616 | 0.0585 | 1.0395 | 0.0165 | 0.0031 |
kc2 | 3.8472 | 0.0089 | 0.6492 | 0.0048 | 0.0027 |
mc1 | 59.741 | 1.3434 | 2.2399 | 0.0695 | 0.0086 |
mc2 | 2.8329 | 0.0030 | 0.5630 | 0.0041 | 0.0033 |
pc1 | 4.0972 | 0.0060 | 0.6909 | 0.0054 | 0.0035 |
pc2 | 34.9918 | 0.7153 | 1.8789 | 0.0583 | 0.0048 |
pc3 | 10.9125 | 0.0496 | 0.9957 | 0.0197 | 0.0032 |
pc4 | 10.3116 | 0.0501 | 0.9190 | 0.0167 | 0.0045 |
Average | 15.7526 | 0.2494 | 1.0801 | 0.0224 |
Dataset | 1D-CNN1 | SVM | RF | DT | NB |
---|---|---|---|---|---|
cm1 | 0.1032 | 0.0170 | 0.8367 | 0.0101 | 0.0075 |
kc1 | 0.1119 | 0.1085 | 1.1712 | 0.0209 | 0.0074 |
kc2 | 0.0878 | 0.0153 | 0.7364 | 0.0091 | 0.0058 |
mc1 | 0.2819 | 2.2170 | 2.5247 | 0.0743 | 0.0167 |
mc2 | 0.0866 | 0.0066 | 0.6492 | 0.0069 | 0.0087 |
pc1 | 0.0883 | 0.0120 | 0.7830 | 0.0089 | 0.0082 |
pc2 | 0.1761 | 1.1731 | 2.0909 | 0.0628 | 0.0109 |
pc3 | 0.1076 | 0.0951 | 1.1161 | 0.0242 | 0.0082 |
pc4 | 0.1077 | 0.0916 | 1.0293 | 0.0227 | 0.0088 |
Average | 0.1279 | 0.4151 | 1.2153 | 0.0267 |
In terms of accuracy (
To answer the second research question, we compare the performance of the proposed model with four state-of-the-art deep learning models: Defect Prediction with Deep Forest (DPDF) [
Dataset | 1D-CNN | DPDF [ |
GA-DNN [ |
DBNPM [ |
SDAE [ |
---|---|---|---|---|---|
cm1 | 99.43 | ⊠ | 97.59 | ⊠ | ⊠ |
kc1 | 99.89 | ⊠ | 97.82 | ⊠ | ⊠ |
kc2 | 100.00 | ⊠ | ⊠ | ⊠ | ⊠ |
mc1 | 99.96 | 98.30 | ⊠ | 85.17 | 87.00 |
mc2 | 97.54 | 74.60 | ⊠ | ⊠ | ⊠ |
pc1 | 100.00 | 91.30 | ⊠ | ⊠ | ⊠ |
pc2 | 100.00 | 98.20 | ⊠ | ⊠ | ⊠ |
pc3 | 99.78 | 90.00 | 97.96 | ⊠ | ⊠ |
pc4 | 99.76 | 88.90 | 98.00 | ⊠ | ⊠ |
Average | 90.22 | 97.84 | 85.17 | 87.00 |
Note: ⊠ indicates that the study did not test their proposed approach on the specified dataset.
Dataset | 1D-CNN | DPDF [ |
GA-DNN [ |
DBNPM [ |
SDAE [ |
---|---|---|---|---|---|
cm1 | 97.56 | ⊠ | 91.48 | ⊠ | ⊠ |
kc1 | 99.68 | ⊠ | 95.89 | ⊠ | ⊠ |
kc2 | 100.00 | ⊠ | ⊠ | ⊠ | ⊠ |
mc1 | 97.18 | 4.00 | ⊠ | 88.89 | 87.00 |
mc2 | 95.60 | 48.00 | ⊠ | ⊠ | ⊠ |
pc1 | 100.00 | 17.00 | ⊠ | ⊠ | ⊠ |
pc2 | 100.00 | 83.00 | ⊠ | ⊠ | ⊠ |
pc3 | 99.00 | 11.00 | 94.50 | ⊠ | ⊠ |
pc4 | 99.16 | 33.00 | 93.50 | ⊠ | ⊠ |
Average | 32.67 | 93.84 | 88.89 | 87.00 |
Note: ⊠ indicates that the study did not test their proposed approach on the specified dataset.
The effect of applying different structures on the performance of 1D-CNN algorithm in detecting software defects is discussed in this section.
Dataset | 1D-CNN1 | 1D-CNN2 | 1D-CNN3 | 1D-CNN4 | 1D-CNN5 | 2D-CNN |
---|---|---|---|---|---|---|
cm1 | 99.43 | 98.95 | 99.43 | 99.43 | 99.43 | 99.08 |
kc1 | 99.89 | 100.00 | 100.00 | 99.92 | 100.00 | 100.00 |
kc2 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
mc1 | 99.96 | 98.57 | 99.96 | 99.98 | 99.96 | 99.96 |
mc2 | 97.54 | 98.58 | 96.49 | 96.49 | 96.49 | 96.49 |
pc1 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
pc2 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
pc3 | 99.78 | 99.08 | 99.42 | 99.82 | 99.34 | 99.31 |
pc4 | 99.76 | 99.93 | 99.96 | 99.76 | 99.77 | 99.84 |
Average | 99.46 | 99.47 | 99.49 | 99.44 | 99.41 |
Dataset | 1D-CNN1 | 1D-CNN2 | 1D-CNN3 | 1D-CNN4 | 1D-CNN5 | 2D-CNN |
---|---|---|---|---|---|---|
cm1 | 97.56 | 97.56 | 97.54 | 97.56 | 97.56 | 96.14 |
kc1 | 99.68 | 100.00 | 100.00 | 99.76 | 100.00 | 100.00 |
kc2 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
mc1 | 97.18 | 96.70 | 97.18 | 98.14 | 96.70 | 96.68 |
mc2 | 95.60 | 93.75 | 93.75 | 93.75 | 93.75 | 93.75 |
pc1 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
pc2 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
pc3 | 99.00 | 97.26 | 97.26 | 99.17 | 96.94 | 96.77 |
pc4 | 99.16 | 99.67 | 99.86 | 99.15 | 99.15 | 99.44 |
Average | 98.33 | 98.40 | 98.61 | 98.23 | 98.09 |
Dataset | 1D-CNN1 | 1D-CNN2 | 1D-CNN3 | 1D-CNN4 | 1D-CNN5 | 2D-CNN |
---|---|---|---|---|---|---|
cm1 | 3.6777 | 3.3454 | 4.01334 | 2.9134 | 4.1091 | 3.8062 |
kc1 | 11.3616 | 9.2173 | 12.2861 | 7.5528 | 11.5590 | 11.3047 |
kc2 | 3.8472 | 3.4947 | 4.1768 | 2.9579 | 4.17148 | 3.9124 |
mc1 | 59.741 | 45.3412 | 77.2742 | 34.4239 | 61.3285 | 58.9859 |
mc2 | 2.8329 | 2.4175 | 2.9715 | 2.18492 | 2.8496 | 2.6905 |
pc1 | 4.0972 | 3.5554 | 4.2921 | 3.23928 | 4.3503 | 4.0192 |
pc2 | 34.9918 | 26.7452 | 44.8375 | 21.3692 | 35.9971 | 34.2816 |
pc3 | 10.9125 | 8.7982 | 13.6067 | 7.34014 | 11.5780 | 11.0636 |
pc4 | 10.3116 | 8.2415 | 13.0168 | 6.6651 | 10.9259 | 10.1461 |
Average | 15.7526 | 12.3507 | 19.6083 | 16.3188 | 15.5789 |
Dataset | 1D-CNN1 | 1D-CNN2 | 1D-CNN3 | 1D-CNN4 | 1D-CNN5 | 2D-CNN |
---|---|---|---|---|---|---|
cm1 | 0.1032 | 0.0920 | 0.1274 | 0.0979 | 0.1120 | 0.0927 |
kc1 | 0.1119 | 0.1118 | 0.1327 | 0.0983 | 0.1533 | 0.1143 |
kc2 | 0.0878 | 0.0884 | 0.0929 | 0.0854 | 0.1075 | 0.0974 |
mc1 | 0.2819 | 0.2411 | 0.3033 | 0.1704 | 0.2259 | 0.2408 |
mc2 | 0.0866 | 0.0782 | 0.0911 | 0.0829 | 0.0944 | 0.0878 |
pc1 | 0.0883 | 0.0935 | 0.0941 | 0.0947 | 0.1036 | 0.0931 |
pc2 | 0.1761 | 0.2059 | 0.2369 | 0.1813 | 0.1642 | 0.2189 |
pc3 | 0.1076 | 0.1109 | 0.1187 | 0.0958 | 0.1191 | 0.1132 |
pc4 | 0.1077 | 0.1499 | 0.1701 | 0.0948 | 0.1076 | 0.1106 |
Average | 0.1279 | 0.1302 | 0.1519 | 0.1320 | 0.1299 |
Compared to 1D-CNN2 which was omitted using the dropout layer, 1D-CNN1 shows a better performance in terms of accuracy and f-measure, and more efficient in terms of testing time. We can relate this result with the benefit of adding a dropout layer which can reduce overfitting in a classifier [
Compared to 1D-CNN3, which used a larger kernel size (kernel size = 3), 1D-CNN1 still shows better performance and more efficient. To illustrate the impact of different kernel sizes on the performance of the CNN classifier, we run the experiment using kernel size from 1 to 5 on 9 datasets. The average for each performance metric was computed and visualized on bar graphs (
Compared to 1D-CNN4, which used a smaller filter size for each convolutional layer, 32 and 15, respectively, 1D-CNN1 has a better performance in terms of accuracy and f-measure. However, the 1D-CNN4 is more efficient in terms of training and testing time. One can argue that using a smaller filter size may increase the efficiency of a 1D-CNN classifier but do not improve its performance in predicting software defects.
Compared to 1D-CNN5, which used an additional convolutional layer, 1D-CNN1 shows better performance and more efficient. We can say that adding one convolutional layer could not improve the performance and efficiency of a 1D-CNN classifier in predicting software defects.
Compared to 2D-CNN, which used 2D convolutional and max pooling layers, 1D-CNN1 again shows a higher performance and more efficient in terms of testing time. In terms of accuracy and f-measure, all 1D-CNN models performed better that 2D-CNN model built in this study. One might argue that using 2D structure could not improve the performance of 1D structure in predicting software defect.
The effect of applying hyperparameter tuning to the performance of 1D-CNN algorithm in detecting software defects is discussed in this section.
The optimal parameters and performance of 1D-CNN after conducting the hyperparameter tuning are shown in
Dataset | Best parameter | Performance metrics | |||||
---|---|---|---|---|---|---|---|
Epochs |
Learning rate (0.0001-0.1) | Dropout rate (0.1-0.5) | Acc | F1 | Training Time | Testing Time | |
cm1 | 343 | 0.0005 | 0.2 | 100.00 | 100.00 | 38.0487 | 0.1172 |
kc1 | 245 | 0.004 | 0.3 | 100.00 | 100.00 | 95.8636 | 0.1394 |
kc2 | 298 | 0.02 | 0.2 | 100.00 | 100.00 | 33.7125 | 0.0910 |
mc1 | 165 | 0.004 | 0.4 | 99.97 | 97.67 | 383.7442 | 0.2910 |
mc2 | 203 | 0.002 | 0.1 | 100.00 | 100.00 | 14.6395 | 0.0947 |
pc1 | 464 | 0.005 | 0.4 | 100.00 | 100.00 | 56.5624 | 0.1082 |
pc2 | 497 | 0.003 | 0.1 | 100.00 | 100.00 | 639.8141 | 0.2055 |
pc3 | 436 | 0.001 | 0.1 | 99.63 | 98.33 | 171.7405 | 0.1280 |
pc4 | 488 | 0.01 | 0.2 | 99.80 | 99.29 | 175.4491 | 0.1248 |
Average | 99.93 | 99.48 | 178.8416 | 0.1444 |
The performance metrics used in our analysis relate to threats to construct validity. In this study, 4 evaluation metrics were selected: accuracy, f-measure, training, and testing time. Other measures, such as the kappa statistic, AUC, and MCC, can be used to evaluate binary classifiers. However, the 4 metrics selected in this study are widely used measures to evaluate the detection of software defects.
The risks are primarily concerned with the unregulated internal variables that may affect the results of the experiment. The key internal threat is the possible faults during the implementation of our experiments. To reduce this hazard, we built six CNN classifiers obtained from Keras library and four baseline classifiers from sci-kit-learn libraries. We obtained the information on how to build 1D and 2D CNN models from Keras and TensorFlow documentation. The parameter setup for the proposed model is based on previous works that yield the best result. The default values obtained from the official sci-kit-learn documentation for the parameters for detecting software defects were adopted by four baseline classifiers.
Threats to external validity relate to the possibility of generalizing our results. The experiments conducted in this study used nine NASA datasets. There are several datasets available such as PROMISE, Code4Bench, AEEEM, Relink, and CodeChef. Therefore, the experimental results might not be generalizable to other datasets, which might produce better or worse results for each software defect prediction model used in this study. However, the dataset we opted for is often used in previous software defect detection [
In this study, a research method was designed to investigate the impact of different structures of the 1D-CNN classifier for the detection of software defects. The main process of the research method is to build the CNN models with different structures. First, we proposed a structure for a 1D-CNN model as a baseline. Second, we built another five CNN models with different structures in terms of dropout layer exclusion, kernel size, filter size, the inclusion of an additional convolutional layer, and type of convolutional and max pooling layers. Third, we developed four machine learning models to investigate how good is our proposed model compared to the established machine learning models. We evaluated the built models based on accuracy, f-measure, training, and testing time. The result was analysed and compared. Finally, we tuned three selected hyperparameters (the number of epochs, learning rate, and dropout rate) of the proposed 1D-CNN model to improve its performance.
The main result of this study reveals that compared to other CNN and traditional machine learning models, the proposed 1D-CNN software defect classification model achieved superior performance with 99.60% accuracy and 98.69% f-measure. This study also shows that adding a dropout layer to the proposed 1D-CNN structure improves its performance by reducing overfitting. It has a great impact on the discrimination between defect and nondefect software. On the contrary, increasing the kernel size of the proposed 1D-CNN model, reducing the filter size, adding an additional convolutional layer, and using 2D convolutional and max pooling layers do not have a great impact on the detection of software defects.
In addition, this study provides optimal values for the three selected hyperparameters for each dataset. We can conclude that conducting hyperparameter tuning improved the performance of the proposed 1D-CNN model in software defect prediction. According to these results, 1D-CNN appears to be effective for software defect prediction and can be applied for a practical challenge in the software engineering context. This in turn could lead to saving software development resources and producing more reliable software.
There are several ways to expand on this work. First, thorough experiments can be performed to investigate the impact of adding a number of convolutional layers in the model's overall performance. Second, some empirical studies can be conducted on different datasets or different levels of software defects. Third, other hyperparameters should be considered to be tuned to enhance the performance of the 1D-CNN model. Fourth, feature selection and imbalance issues in SDP should also be considered, which, in theory, might improve the performance of software defect prediction. Finally, experiments can be carried out to understand the success factors of 1D-CNN in different granularity levels of software defect such as change-level and file-level. This is particularly useful for practitioners, as it identifies situations in which 1D-CNN should be favored over alternative techniques.
The authors would like to thank the Information Systems Department, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University for providing facilities to conduct the research.