Modified Differential Box Counting in Breast Masses for Bioinformatics Applications

: Breast cancer is one of the common invasive cancers and stands at second position for death after lung cancer. The present research work is useful in image processing for characterizing shape and gray-scale complexity. The proposed Modified Differential Box Counting (MDBC) extract Fractal features such as Fractal Dimension (FD), Lacunarity, and Succolarity for shape characterization. In traditional DBC method, the unreasonable results obtained when FD is computed for tumour regions with the same roughness of intensity surface but different gray-levels. The problem is overcome by the proposed MDBC method that uses box over counting and under counting that covers the whole image with required scale. In MDBC method, the suitable box size selection and Under Counting Shifting rule computation handles over counting problem. An advantage of the model is that the proposed MDBC work with recently developed methods showed that our method outperforms automatic detection and classification. The extracted features are fed to K-Nearest Neighbour (KNN) and Support Vector Machine (SVM) categorizes the mammograms into normal, benign, and malignant. The method is tested on mini MIAS datasets yields good results with improved accuracy of 93%, whereas the existing FD, GLCM, Texture and Shape feature method has 91% accuracy.


Introduction
About 1 in 28 women are expected to develop breast cancer during their lifetime. By 2030, breast cancer will cause most deaths among women in India than any other cancers. The survival rate of breast cancer is low because the detection takes place late. Early detection can not only improve the outcome but can remarkably cut down the costs of treatment [1]. Detection of breast cancer at an early stage is essential to reduce the mortality rate. Therefore, a breast cancer screening method is needed to facilitate early diagnosis of this potentially fatal disease [2]. Breast cancer medical imaging can be used to look inside the human body as a non-invasive method for helping doctors for diagnose and treat. An early breast cancer diagnosis occurs with any of the available imaging methods, it cannot be confirmed that these images are malignant alone. There is a high risk of cancer cells being placed in the interstitial tissue veins or fluid until the microscopic exam of tissues from cancer to confirm their malignancy begins. Mammography related to clinical and self-breast examination is the practical and effective method for mass screening to identify breast cancer. It appears in women in the form of tumors [3]. Since mammograms are medical images, fractal geometry is an appropriate method to identify texture features in the mass region [4]. For normal tissues, dense breasts have intensities for the same to those in cancer regions and tumor regions must be successfully identified. Many imaging techniques have been developed for early detection and treatment of breast cancer reduce the number of deaths and many aided breast cancer diagnosis methods have been used to increase the diagnostic accuracy. Thus, in order to detect the mammograms using CAD techniques accurately and are used for determining the breast cancer which is the motivation of the research.
Basically, a CAD system plays a crucial role in early detection than the other methods like a biopsy. Though the CAD systems have desirable properties it poses inherent challenges. The main issue identified is, it is semi-automatic and needs expert radiologist for mass region identification [5]. On the other case when sudden grey scale variation at the borders of neighbouring box leads to box under counting situation. In this framework a modified DBC approach is proposed and is described in next section [6][7][8]. To address this challenge and reduce experts' overhead, we propose automatic mass region extraction using the maximum entropy principle and shape feature (Circularity). Since mass regions are identified successfully, the discriminate features are essential for efficient classification of a mammogram. The intermediate pixel value is considered with maximum and minimum grey level of box for the computation [9][10][11]. In this manner, the neighbor value of the pixel is effectively analyzed for suitable box size selection. Generally, the mammogram images are textural in nature, specifically the mass region [12,13]. And to extract this texture feature the methods like Local Binary Pattern (LBP), Co-occurrence matrix, shape, contours, and fractals have been considered in various existing methods for breast cancer classification were based on Active contour, Active region segmentation, Shape and texture feature, Hybrid method, Fractal method, SURF, Colour features, CNN, and Differential Box counting [14,15]. The traditional DBC model applying the same roughness of intensity surface with various gray-levels creates over counting and under counting problem which is a research gap. These works influence the proposed framework for automatic ROI extraction. This paper extracts fractal features and classifies the mammogram with KNN and SVM classifiers. The rest of our paper is structured as follows. Section 2 describes about the existing methods involved for breast mass detection and Section 3 outlines the proposed MDBC framework. Section 4 discusses about the simulation results and discussions. Section 5 presents the conclusion and future works for the present research.

Related Work
In order to identify breast cancer, radiologist mostly depends on CAD screening setup. Few remarkable works are discussed in the kinds of literature and some of them are presented below.
Kaur et al. [16] applied k-means clustering for Speed-Up Robust Features (SURF) selection and Multiclass SVM with deep learning method is applied for classification. The mini-MIAS dataset was used to evaluate the performance of the proposed model. The developed model has the overfitting problem in the training data. Agnes et al. [17] develop Multiscale All CNN (MA-CNN) model for detection of breast cancer in the medical breast images. The multiscale filter is applied to fuse the wider context of information to improve the classification of system accuracy. The mini-MIAS dataset was used to evaluate the performance of the MA-CNN model. The developed MA-CNN model has the overfitting problem that affects the performance of the method.
Rabidas et al. [18] applied Local Photometric Attributes (LPA) method to analysis the local information in the medical images. The mini-MIAS dataset was used to evaluate the performance of the developed method. The analysis shows that the developed model has the higher performance compared to existing models. The proposed model has the lower efficiency in the feature analysis. Ghasemzadeh et al. [19] developed a deep learning assisted efficient AdaBoost algorithm for breast cancer detection and early diagnosis. The developed deep learning method had higher accuracy in detecting breast cancer mass due to effective feature analysis and increases the patient survival rate. The developed algorithm was too weak to classify the images and resulted in low margins and overfitting problem.
Dhahri et al. [20] developed an infrared high classification accuracy hand held machine learning based method for breast cancer detection. The developed method showed effective performance in terms of sensitivity and specificity for the detection of breast cancer. The computational complexity of the developed method was more and consumed more time in classification. Wang et al. [21] diagnosed breast cancer Using an Efficient CAD System Based on Multiple Classifiers. First, the mammogram images were enhanced to increase the contrast. Second, the pectoral muscle was eliminated and the breast was suppressed from the mammogram. Next, k-nearest neighbor (k-NN) and decision trees classifiers were used to classify the normal and abnormal lesions. However, the developed CAD system could be considered as a powerful tool to detect and classify abnormalities in the breast Indra et al. [22] developed a dual mode deep transfer learning system for breast cancer detection by using contrast enhanced digital mammograms. The developed model used deep transfer learning method effectively that classified the benign and malignant tumors using deep transfer learning system. However, the optimization problem used to generate the reconstructed graphs and rigorous criteria for evaluating the graphs was the limitations of visualization approach. Pezeshki et al. [23] developed Texture Analysis of Gradient Images for Benign-Malignant Mass Classification. In addition to the local texture feature, Local Binary Pattern, approximation coefficients have been extracted from the gradient images using wavelet transform to evaluate their efficiency in a Computer-Aided Diagnosis (CADx) system. However, other texture features along with different classifiers can be incorporated in future which may enhance the efficiency of the system.
In this section, the works of literature related to ROI segmentation and feature extraction methods under the fractal domain are discussed separately with their advantages and disadvantages. Since most of the work discussed above is semi-automatic, this paper discussed fractal features computation and extraction. It concluded that the fractal dimension is one of the prominent features of fractals. The next section discusses the proposed framework MDBC work with recently developed methods showed the model performs automatic detection and classification.
The proposed model has the advantages of applying the suitable box selection to improve the performance

Proposed Approach
The proposed automated CAD system framework comprises of four phases (i) Pre-processing, (ii) Automatic mass region Identification and Extraction, (iii) Feature extraction using the proposed MDBC which uses box over counting and undercounting that covers the whole image with required scale. (iv) Classification is performed for the extracted features that are fed to K-Nearest Neighbour (KNN) classifier and Support Vector Machine (SVM) which categorize the mammograms into normal, benign, and malignant. The diagrammatic representation of the framework is shown in Fig. 1.

Pre-Processing
The mammogram images are always noisy and it contains artifacts and labels which affects the results of classification. Pre-processing is the initial step of automatic analysis of mammograms. It involves the segmentation of the breast region and removing pectoral muscles that can minimize the search area for abnormalities and make it limited to the relevant region of the breast without excessive influence from the mammogram's background. Pectoral muscle appears as a triangular opacity across the upper posterior margin of the image and the pectoral muscle can bias and affect the result of any mammogram processing system, so it is necessary to identify and segment the pectoral muscle automatically. The present research performs two stages of the pre-processing methods such 41 as Breast Region segmentation and Removal of Pectoral muscle. The median filter and morphological operation is applied to remove the speckle noise in the medical image.

Mass Region Identification and Segmentation
After pre-processing, the automatic ROI segmentation is to be performed. For the dense breast tissues, the mass region frequently merges with them, and it makes the CAD system complex to locate and segment the masses accurately. So that most of the CAD systems are designed for the selection of mass regions with radiologist. Hence experienced and trained persons are required in the successful operation of CAD systems.
To address the issue, the proposed MDBC method is performed that comprises of two steps (i) Binarization and (ii) ROI extraction. For effective binarization, the maximum entropy principle is applied and hence the suspected regions are identified in this step. The circularity value is computed for all suspected regions, which will help differentiate the mass region from other tissue regions.

Binarisation
The detection of the complex mass segment region in mammogram images because complicated tissues and ambiguous shape margin surround them. In order to differentiate normal over-complicated tissues, the foreground and background objects are to be separated. Binarization is the process of separating the mass region from the surrounding tissues based on threshold value. A threshold can be calculated by global or adaptive methods [24]. Generally, adaptive thresholds are preferable for mammogram images. The proposed framework computes the optimal adaptive threshold with maximum entropy principle model.

Maximum Entropy Principle
In information theory, entropy is used to measure the amount of information [25]. In the proposed framework the partition of the mass region from the background tissue is extracted with entropy based on gray distribution. Suppose a random variable of discrete type x with possible outcomes {x 1 , . . . , x n } is assumed, n is the number of gray level and then P (x k ) be the probability of the outcome X k where k ranges from n to k − 1 and the entropy is defined in Eq. (1).
Let us consider the task to partition input image into mass (A) and background tissues (B) and the probability distribution of grey level in the given input images are {p 0 , p 1 , . . . , p n }.
Then the Probability Distributions of mass (p A ) and background (p B ) are given in Eqs.
(2), (3) where To obtain the optimum threshold values the total entropy has to be maximized as in Eq. (4).
where the entropy H(A) and H(B) can be calculated by the Eqs. (5) and (6) Now using this optimal threshold s the image is Binarized and the pixels higher than this s value are considered to be a mass region. As improving the entropy in the method, the gray level complexity of the images are decreases.
These mass regions are having holes and are connected with nearby objects that can be corrected by applying morphological operations opening and closing.

ROI Extraction Using Circularity
Generally, mammograms are characterized by circular, lobulated or speculated shape [26]. As many regions of different shapes are extracted in a previous step, but not all regions are masses.
To detect the mass region from normal one the proposed framework applies the circularity method as one of the shape features. This method considers two parameters such as circumference and area of the mass to identify the circularity as defined in the Eq. (7).
where C r = circularity, C = circumference, and N = Area.
From the extracted region the total number of 1's is considered as area. And the parameter circumference can be calculated by summating the total number of 1's in the boundary region. Based on the Cr value, the shape of the mass is identified The methods like Hough transform and Template matching failed to identify the mass region when the size varies. But this Eq. (7) is derived so that the mass region of any size can be identified from normal because the circularity value of blood vessel region must be higher than the mass region.

Feature Extraction
In the previous section, the ROI is extracted and the features in the ROI are extracted in this section. As described in Sections 1 and 2 the mass regions are of texture in nature. Basically texture is described by fractal geometry using three aspects: (i) FD, (ii) Lacunarity and (iii) Succolarity. FD is one of the features which groups self-similarity and roughness in medical images. Lacunarity is another fractal feature that measures gap distribution in mammograms and this feature is useful in representing the inner structure of the tumor. The two features like FD and Lacunarity is more studied and well used in mammogram image analysis [27], where Succolarity has not been considered widely. This succolarity is one of the fractal features which are used to discriminate images with flow information allied with it. Hence this framework extracts the fractal features such as FD, Lacunarity and Succolarity and the method to compute FD in modified way is described in next section.

Fractal Dimension (FD)
Traditionally FD is computed by using methods wiz, (i) Ruler, (ii) Blanket, (iii) Box counting, (iv) Differential Box counting, (v) Triangle prism surface area, and (vi) Power spectral analysis. Among these methods, Box counting is a simple and frequently used method in the estimation of FD. Let us consider a bounded set A in Euclidean space and the FD of A can be estimated by the Eq. (8).  × s), where s the height of each box is computed by the Eq. (9) and G the total number of grey levels.
The number of box count n r covering on each grid is counted in Eq. (10).
where l is Maximum grey level intensities, and k is Minimum grey level intensities. The l and k can be calculated as in Eqs. (11) and (12).
The total number of boxes of (M × M) the image is computed with Eq. (13).
Then the FD (D) of grey scale image is estimated using Eq. (8) by substituting value obtained from Eq. (13).
Then the FD of an image or the slope of a line is computed by fitting all the points (1/r, Nr) using Linear Least squares.

Proposed Modified DBC (MDBC)
In the MDBC method the over counting problem is encountered by Selection of suitable box size and Modified way of n r (i, j) computation for Under Counting Shifting rules are formulated. The proposed MDBC derives two assumptions, firstly increases the box-count precision based on the unequal triangle box partition. Secondly, the weights of the box count the size of triangle box partition proportions and based on the assumptions, squat box in each of the grid are divided to 4 asymmetric triangle box patterns. Each of the patterns will calculate the counts of boxes using box-counting technique. Maximum is the number of box counts better will be the estimation. The MDBC follows the Under Counting Shifting rules that outperforms in terms of fitting error that are as follows:

i) Selection of Suitable Box Size
Selecting box size is also an important issue in DBC method because if M cannot be appropriately partitioned by s then zero will be taken as values in that partition this may affect accuracy of the method. So while choosing the box size the divisor of M can be used. Example for image of size (256 × 256), the box sizes must be 2, 4, 8, 16, 32, 64 and 128 and the proper partitioning will increase the accuracy.

ii) Modified Way of n r (i, j) Computation
As n r (i, j) computed in the traditional method using Eqs. (10) and (11) considered only the maximum and minimum grey level of box, the importance of intermediate pixel is omitted. This may affect the accuracy of the system and cause an over-counting problem. To avoid this situation, in our work the average values in the (i, j) th block is calculated as I avg . The minimum and maximum values are computed as I nmin and I nmax . This method reduces the box count and in turn, increases the accuracy.
If there are grey scaled variations at neighbouring boxes' borders, then the undercounting of boxes may occur at z direction. So that shifting of boxes along x and y direction and finding the maximum n r value into consideration will improve accuracy of method and avoid undercounting of boxes.
While finding n r value the boxes of size (s × s) is shifted along (x, y) plane with α pixels and then find the new n r value (new_n r ) and compare it with the n r value obtained without shifting (old_n r ) and select the maximum of two as in Eq. (14). This method is used to catch the borders of neighboring boxes so that the undercounting problem can be avoided. Here the value for α is taken as 1, because an enormous value α will result from inappropriate FD values.

Shifting Rules
In traditional DBC method the unreasonable results may obtained when FD is computed for tumour regions with the same roughness (tumour) of intensity surface but different gray-levels. But this can be avoided using our proposed MDBC method because this method overcomes the problem of box over counting and undercounting. Although this FD is a significant feature, it yields better results when combined with Lacunarity.

Lacunarity
Lacunarity [28] is the counterpart to the FD that describes the texture of a fractal. The higher lacunarity indicates that the area is more heterogeneous nature. It is defined as the ratio of the variance over the mean value of the function and shown in Eq. (15).
where, M is the sizes of the FD processed image.
Q (N r , s) is probability of N in box size s, L r is the lacunarity of box size s N r and is computed using MDBC method.
Lacunarity unambiguously characterize the spatial organization of the tumor region and the feature yields good results when combined with other fractal features with improved accuracy.

Succolarity
The fractal feature in researches pay less attention in mammogram image analysis but having a wide area of application in texture analysis [29]. A Succolarity is defined as an estimation of the degree of filaments that allow percolation. Before implementing the algorithm, our input grey scale image is converted into binary image as described in Section 3.2. The algorithm is as follows, This feature evaluates the percolation capacity of fluid in the tumour region at all four directions. The region with benign type tumour has smooth contour but it occupies a major region so that the percolation capacity is low compared with irregular rough contours malignant region. Though the fractal features FD and lacunarity measure the inner complexity and roughness of the tumour effectively, this succolarity characterize the nature of the tumour in terms of roughness of contours. The combination of these three features shows the improved result when combined with an effective classifier.

Classification of Breast Cancer
The one major advantage of the SVM is the use of convex quadratic programming, which provides only global minima hence avoid being trapped in local minima. The binary classification is performed using the below Eq. (18).
From the equations, x i are known as the data points and y i corresponds to the labels. The labels present in the hyper plane separate the data using the hyper plane equation where, w is known as the d-dimensional coefficient vector that is normal with the hyper plane and the value b is known as the offset from the origin. Based on the optimal separable margin, the optimization problem is solved by using the Eq. (18).
The present research work uses K-Nearest Neighbors (KNN) algorithm that does not require a learning phase. During the training phase, the distance function is used as a class choice function that works on the basis of classes in KNN model. The KNN considers the class which appears as one among them and is assigned to element neighbors that needs to be classified. The neighbors present are weighted based on the distance that separate to the new elements for the classification. The parameter K is used in the KNN Algorithm that chooses to assign the class for each new element which is calculated by using the Eq. (18).
In this step the k-NN [30] and SVM [31] classify the extracted features.
In the KNN algorithm, the testing set is identified by assigning it to the nearest point's class label in the training set. Euclidean distance metric is chosen to measure the distance between data points in KNN and it is given by Eq. (19).
From the equation, X , Y are the two points in Euclidean space, (X 1 . . . X n ) and (Y 1 , . . . , Y n ) are the Euclidean vectors starting from the space to the origin. The n is the n-space values.
SVM is broadly used in mammogram image classification, though it was designed to solve binary classification problem, it has been extended to multiclass classification problems. In this work, Radial Basis Function (RBF) kernel is used to perform mapping of data from input space to feature space. Decision intelligence design is about making decisions based on objective principles that may or may not apply. It's making the most objective decision you can with an understanding that in the end those decisions are all subjective. It produces good results with high accuracy and low error rate than polynomial kernel. The KNN is implemented in Weka and the libSVM library is used for SVM classifier that classifies the breast mass image as Benign, malignant or normal.

Pseudo Code for The Proposed Modified DBC
Input H with x samples, y lines, and m bands Given window size M, grid size s, and total gray level G.

Result and Discussion
The proposed methodology is simulated with respect to two benchmark dataset (i) Mini-MIAS [32] and (ii) INbreast [33]. CAD, or computer-aided design and drafting (CADD), is technology for design and technical documentation, which replaces manual drafting with an automated process. If you are a designer, drafter, architect or engineer, you have probably used 2D or 3D CAD programs such as Auto CAD or AutoCAD software. These widely used software programs help you draft construction documentation, explore design ideas, visualize concepts through photo realistic renderings and simulate how a design performs in the real world. System Requirement: The proposed simulations are conducted on an Intel (R) core (TM) i7 CPU 965@3.20 GHz system with 4.00 GB RAM.

Dataset
The mini-MIAS are grey scale images of size (1024 × 1024), totally it contains 322 images from left and right breast of 161 patients. This is a reduced version of the original MIAS database reduced to 200-micron pixel edge. Among them, 266 images are considered for simulation, in that 207 are normal and 59 images contain masses of benign and malignant type. There are three characteristics of background tissues are present in the images such as fatty, fatty-glandular, and dense-glandular. Two types of severity are present in the dataset such as Benign and Malignant.
Another publicly available INbreast dataset is also used with total 82 images, in which 45 are benign cases, 8 are normal and 29 are malignant. Several types of lesions such as distortions, asymmetries, calcifications, and masses are present in the images. The Groundtruth of the lesions are present in the XML format. For conducting an experiment, an image of size (1024 × 1024) is taken from min-MIAS database and it is pre-processed as described in Section 3.1 and shown in Fig. 2a. Then the pre-processed image is shown in Fig. 2b, mass identification image is shown in Fig. 2c, and classified image is shown in Fig. 2d.
In this step, circularity values for suspected regions are computed using the Eq. (6). The result shows that all the circular mass regions have value ranges from 0.9 to 1.5 and speculated mass regions from 1.5 to 2.0. This region is taken as a mask and overlapped on the original image for mass segmentation as in Fig. 2b, the same process is repeated for all images in the databases and ROI are extracted successfully. Then the performance of automatic mass segmentation is computed with Eq. (20).
where X A is Accuracy, TP is True Positive, TN is True Negative, FP is False Positive and FN is False Negative.
In research work simulation the performance of the mini-MIAS is computed for 266 images. Among which it produces 58, 200, 5 and 3 images corresponding to TP, TN, FP and FN, respectively. Accuracy of the proposed technique is compared with existing methods and the images from INbreast are also tested and the results are shown in the Tab. 1. Although [5,6] produced good result compared to our method it is not proven to be a reliable method for varying mass size. All the template methods depend on the template's size is the major drawback in automatic mass segmentation. The method proposed by [7] is robust but complexity is high and also provides low result. But the proposed MDBC method uses box over counting and under counting that covers the whole image with required scale. MDBC selects a suitable sized box and Under Counting Shifting rule computation handles over counting problem. Thus, the proposed MDBC obtains 1% to 8% improvement in Accuracy values compared to existing models. Dataset insights are visualized graphically to describe the pros and cons of data and reaching the aim defined is shown in the Fig. 3. But our method yields satisfactory results irrespective of mass size. This method yields good results for both the mini-MIAS and INbreast dataset.

Quantitative Analysis for the Proposed MDBC
The exact FD values are extracted as result with less computational time. Hence the simulation values of FD, Lacunarity and Succolarity obtained from MDBC method is shown in Tab. 2. The succolarity computation evaluates the degree of percolation in the region of interest. For region with smooth contours, the information flow is very low because mass region inside the rectangle occupies more space. So that in our work, the benign region is having low succolarity value than the malignant one. The Fig. 4 shows the region with the smooth contour of benign type and speculated mass of malignant type is flooded in all directions using Eq. (17). After pre-processing, the PSNR range of the image is 46 dB and effectively removes the noise. The change in the slightest PSNR range of the images doesn't much affects the performance of the model.

Comparative Analysis
The proposed MDBC result is compared with existing work and result analysis is shown in Tab. 3. Although [16,17] have achieved an excellent result, they have manually segmented the mass region. In [18] they have attained good result with Local Photometric Attributes method. In [34] they have extracted 34 features in various categories viz., Intensity, GLCM (texture), Shape, Texture and Margin. The proposed model applies suitable box sizes and Under Counting Shifting rule to effectively handle the over counting problem. Finally, they have obtained good result with SVM classifier. Our proposed automatic method has reached good result on both the databases. However, the proposed MDBC model failed to detect masses with blurred edges and ill-defined shapes, which impacted on the feature extraction step also. So combining some other contour detection techniques in future produce improved result. The results obtained in Tab. 1 shows the efficiency of our proposed MDBC framework in automatic mass region detection. Then the feature extraction and classification accuracy of mammogram is discussed in Tab. 3. Comparing the proposed MDBC work with recently developed methods showed that our method outperforms in automatic detection and classification. The proposed model has the advantages of applying the suitable box selection to improve the performance. The AUC of the proposed model for normal, Benign, and Malignant is shown in Fig. 5. The area under ROC of the proposed model for normal is 1, Benign is 0.89, and Malignant is 0.89.

Conclusion
In the present research, an automated CAD system framework without human interventions was proposed. Generally, the majority of the system depends on the radiologist to select the mass region which requires much computation time. Therefore, the proposed framework was used for automatic mass region identification and segmentation which proved to be better for all sizes of masses. Second part of the framework proposes MDBC for fractal feature computation, and this method solved the problem of the box over counting and undercounting simply. The extracted features are fed to K-Nearest Neighbour (KNN) classifier and the Support Vector Machine (SVM) categorizes the mammograms into normal, benign, and malignant. The KNN model has the higher efficiency when number of training data is more and SVM has the capacity to effectively analysis the features. The method is tested on mini MIAS datasets yields good results with improved accuracy of 93%, whereas the existing FD, GLCM, Texture and Shape feature method has 91% accuracy. Future work of this model involves in detecting micro-calcification in early stage of breast cancer that act as an expert system.