CARM: Context Based Association Rule Mining for Conventional Data

: This paper is aimed to develop an algorithm for extracting association rules, called Context-Based Association Rule Mining algorithm(CARM), which can be regarded as an extension of the Context-Based Positive and Negative Association Rule Mining algorithm (CBPNARM). CBPNARM was developed to extract positive and negative association rules from Spatiotemporal (space-time) data only, while the proposed algorithm can be applied to both spatial and non-spatial data. The proposed algorithm is applied to the energy dataset to classify a country’s energy development by uncovering the enthralling interdependencies between the set of variables to get positive and negative associations. Many association rules related to sustainable energy development are extracted by the proposed algorithm that needs to be pruned by some pruning technique. The context, in this paper serves as a pruning measure to extract pertinent association rules from non-spatial data. Conditional Probability Increment Ratio (CPIR) is also added in the proposed algorithm that was not used in CBPNARM. The inclusion of the context variable and CPIR resulted in fewer rules and improved robustness and ease of use. Also, the extraction of a common negative frequent itemset in CARM is different from that of CBPNARM. The rules created by the proposed algorithm are more meaningful, significant, relevant and insightful. The accuracy of the proposed algorithm is compared with the Apriori, PNARM and CBPNARM algorithms. The results demonstrated enhanced accuracy, relevance and timeliness.


Introduction
It is an information age as all is transferred to computers and the use of the information system has become a necessity of life. Knowledge extraction from data takes place through the data mining process. Data mining is a step-by-step process that begins with data analysis, classification/prediction and finding trends and patterns [1,2]. A variety of data mining techniques are used for classifying, extracting association rules, clustering, and regression analysis. The accumulation of data in databases using different devices produces a pool of data that serves as a foundation for knowledge generation. The Size of the data and the reliability of knowledge extraction are directly proportional to one another. With the advent of internet technologies and community applications, millions and trillions of users are generating data every minute and the growth of the repositories storing this data is exponential. As a result, human dependence on data has also increased. Numerous challenges in text mining, web analytics and knowledge discovery have emerged [3]. The discovery of knowledge from databases is a non-trivial process in identifying logical, understandable and innovative patterns from the data [4]. Knowledge extracted through data mining can take different forms, such as rules, clusters, decision trees, classes, rough sets and many others [5][6][7].
Data mining prepares the data for processing by recovering the erroneous and blank data fields that are then stored in the warehouse and finally applying algorithms to it [1]. Data mining leads to classes, clusters, rules and predictions [8]. It can be applied to different datasets, including educational data [9,10], spatial data [7], satellite data [2,11], scientific experiments [12], biological data [13]. Association rule mining is used to discover the fascinating interdependencies between the set of variables and reveals a hidden pattern in the set of data and variables concomitant with high frequencies. A comprehensive review of the association rule extraction algorithms is provided by AI et al. [14]. Wu et al. [5] emphasized the importance of negative association rules which were not taken into consideration in the mining of association rules before that. A typical association rule of shape (X → Y) is positive if it indicates a presence association between X and Y. (X → Y) is a negative association rule if the presence of X assures the absence of Y in the database. Many studies have been carried out to mine positive and negative association rules from different datasets [15][16][17][18].
Shaheen et al. [7] introduced a variable called context that can essentially be used to mine valid positive and negative association rules. The context variable can produce valid but false rules that qualify the support value criteria and are included in the final rule set. For example, the higher selling rate of sanitisers in summer produce a rule (summer → High_sanitiser_sale) whereas the actual reason for the increase in the sale of sanitisers in the last two summers was the spread of coronavirus. Thus the spread of the coronavirus in this example is a context variable. It may not be permanently stored as an attribute in the database, or ignored. These variables are context variables that can affect the validity of the retrieved association rules. CBPNARM [7] uses context variables and has given very good results in terms of the number of rules, confidence and interestingness. The use of context variable for mining association rules can also be cited in other studies, but the definition of context seems confined to time and location [19].
The CBPNARM algorithm was developed for the extraction of Spatio-temporal association rules, which was applied only to spatial data. Spatial data differs from conventional data in that it relates directly or indirectly to a location on earth. Spatial data attributes combine to represent an image that is drawn on the geographic information system (GIS) or other similar information systems [20]. Attributes that are not spatial are represented by non-spatial attributes and are known as characteristic data. A context-based algorithm for non-spatial data is also required which can be applied to non-spatial GIS data and any other dataset established through conventional data procedures. Apriori algorithm is the one most commonly used for exploring positive association rule mining on these datasets.
Apriori algorithm proposed by Agarwal et al. [21] is used to derive the relationship between frequent items of transactional databases. An Apriori association rule is written as (Antecedent → Consequent) and can be elaborated as "if antecedent happens, it is more likely that consequent happens." The selection criteria for a rule in the final rule set differ depending on the algorithms.
The most common are support, confidence, lift, interestingness measure, dependency, etc. Apriori uses support, confidence and lift to select rules [22]. The Apriori algorithm only looks for positive association rules. An exceptionally sheer number of rules is mined when the database is considered for extracting positive and negative association rules. Different pruning measures are proposed by Wu et al. [5] to reduce the number of positive and negative association rules, thus increasing the prospects for outcomes. The context variable in CBPNARM also served as a pruning technique to reduce the final set of association rules. The value of the context can sometimes lead to violating the validation criteria of the association rule for which they are either reckoned or pruned and are not included in the final rule set. The influencing factor, that is to say, the context variable may alter the value of another variable, which may cause the final rule to change [7,12]. Given the context variable, the patterns and rules generated may be more accurate and meaningful.
The proposed algorithm is implemented for sustainable energy development indicators. Sustainability in the energy sector is the primary need of almost every country in the world. The commission on sustainable development has provided a list of indicators [23] that were refined by the International Atomic Energy Agency (IAEA) for its use in evaluating sustainable energy development [24]. These sustainability indicators are used in many studies to assess energy development [8], energy security [25], environmental impacts [26], energy poverty [27], energy consumption and relationships with one another [28]. A classification mechanism for a country's energy development is developed by Shaheen et al. [8] using only quantifiable indicators. The algorithm proposed in this study is also implemented for the same dataset. The algorithm applied to the sustainable energy indicators returned association rules which defined the covarying sustainability metrics. Depending on the value and extent of the covariance between these indicators, a decision-maker can develop an optimum plan to ensure the sustainability of the energy sector. This paper is intended to develop an algorithm for exploring positive and negative contextbased association rules for conventional/characteristic data as an extension to the CBPNARM algorithm. The accuracy of the proposed methodology is compared with Apriori, CBPNARM at the methodological level and is also compared to sustainable energy development, categorized at the application level. The contribution made in this study is given below: 1) CBPNARM algorithm was designed for spatial data only. CARM is the algorithm proposed in this paper which can be applied to non-spatial or conventional numeric and ordinal data. 2) The algorithm is applied to energy datasets to mining rules for energy sustainability. 3) CPIR is not used in the CBPNARM algorithm as the complexity of CBPNARM became greater after CPIR when the results were not remarkable. CPIR is added to the proposed algorithm. 4) The extraction of negative frequent items in the CARM differs from that of CBPNARM. 5) Four CARM algorithm cases given in the pseudo-code differ from CBPNARM.

Indicators for Sustainable Energy Development
The importance of energy is vigorous in eliminating scarcity and elevating the standard of human life [29]. The world has acknowledged that sustainable energy development is important. In 2005, the Commission for Sustainable Development (CSD) recognized the role of the energy sector in the sustainable development of a country [23]. A list of 30 energy sustainability indicators was finalized. These indicators are classified into three categories that are essential ingredients for sustainability; (1) social domain (2) economic domain and (3) ecological domain. The social domain of sustainability indicators is divided into equity and health as shown in Tab. 1. Equity is about equitable access and the availability of all the energy resources at an affordable price. Health covers safe access to energy by caring for accidents in the fuel cycle and eradicating problems related to air pollutants, etc. The social domain indicators selected for this study are placed in the first section of Tab. 1. Energy use per capita Energy use, Total population 4.
Energy use per unit of GDP Energy use, GDP 5.

Efficiency of energy conversion and distribution
Losses in electricity generation, transmission and distribution 6.
Reserves-to-production ratio Proven recoverable reserves, Total energy production 7.
Resources-to-production ratio Total estimated resources, Total energy production 8.
Value added by energy in industrial sector Use of energy in industry, Value added 9.
Value added by energy in agriculture Use of energy in agriculture, Value added 10.
Value added by energy in service sector Use of energy, Value added 11.
Value added by energy in household Use of energy in household, Value added Value added by energy in transport Use of energy in transport, Value added Fuel shares in energy and electricity Primary energy supply and final consumption by fuel type, Total primary energy supply and final consumption 12.
Non-carbon energy share in energy and electricity Non-carbon energy supply and final consumption, Total primary energy supply and final consumption 13.
Renewable energy share in energy and electricity Renewable energy supply and final consumption, Total primary energy supply and final consumption 14.
End-use energy prices by fuel and by sector Energy prices with and without tax 15.
Net energy import dependency Energy imports, Total primary energy supply 16.
Stocks of critical fuels per corresponding fuel Stocks of critical fuel, Critical fuel consumption (Continued) Ratio of waste generated in energy production to energy obtained Amount of generated waste from the source, Total energy production from the source 21.
Ratio of waste properly disposed of total generated solid waste Disposed solid waste, Total solid waste

22.
Ratio of solid radioactive waste to units of energy produced Units of solid radioactive waste, Energy produced 23.
Ratio of solid radioactive waste awaiting disposal to total generated solid radioactive waste Solid radioactive waste awaiting disposal, Total solid waste The economic domain of sustainability indicators can be divided into consumption, production patterns and security of supply. The indicators related to the consumption and production of energy include energy use per GDP per capita, energy supply efficiency, energy production, etc. The ecological domain covers the impacts of energy-related indicators of atmosphere, water and land [30]. IAEA [23] did not consider the institutional dimension of sustainability because the data associated with this aspect was unquantifiable. The report also suggested some auxiliary statistics that measure demographics, wealth, economic development, transportation, urbanization, etc. These measures include GDP per capita, population, shares of sectors in GDP, distance travelled per capita, freight transport, income inequality, floor area per capita and manufacturing value. The commission also recommended the analysis of time-series data, the preparation of data for analysis and the interpretation of the discourse of the data collected for that purpose. This study specifically followed the recommendations of the report and proposed an algorithm for such an assessment. CBPNARM being specifically for spatiotemporal data mining does not adapt exactly, the need for the problem.
The basis for the selection of energy sustainability indicators for this study is identical to that proposed by Shaheen et al. [8], where quantifiable and available indicators were selected. In this study, only indicators for which data are available on online energy portals are selected. The list of selected sustainability indicators is given in Tab. 1. The data for the marked attributes in the grey-shaded boxes in Tab. 1 was not available where such attributes were excluded from the database. Data for 16 of the other 23 attributes was readily available, while the remaining data was derived from the available datasets.

Support
Support is a measure of finding the frequency of an itemset in the database [31,32]. The support of an association rule X → Y is 0.6 if X and Y appeared in a transactional database T for 60% times of the total transactions in T. The equation to compute support is given below: represents the size of the set containing X and Y. (1)

Confidence
Confidence is an indication of how often a rule is true [32,33]. The confidence of an association rule X → Y is 1 if X appeared in the database 10 times and Y appeared with X in all the transactions. The equation to compute confidence is given below:

Lift
Lift is used to measure the correlation value of the antecedent and consequent of an association rule [31,32]. Lift of an association rule X → Y is 1 if X is not correlated to Y. Lift is computed by the equation given below:

Interestingness
Interestingness is a measure used to find potentially positive and potentially negative item sets from a dataset. A rule X → Y is not interesting if its support is lesser than the product of individual supports of X and Y [5].

CPIR
The conditional-probability increment ratio (CPIR) of a rule is computed based on the dependence of the antecedent and consequent. In an association rule X → Y , X is positively dependent on Y, if the value of lift of X → Y is greater than 1 and negatively dependent if the value is lesser than 1. The dependence when equated as per Eq. (5) returns the value of CPIR [5].

Context
Context is the state of the entity, environment or action that can affect the results of association rule mining. The value of the context variable must be within the normal range to make a matching rule valid. For example, the change in vegetation color in the surrounding area may indicate an emergency below the earth's surface. If the value of the "waterflood" context variable is not normal and is not in normal ranges, then the change in vegetation color may indicate the presence of a volcano. The color, in this example, was changed due to the waterflood so that the waterflood, which in this case is a context variable, whose value for this rule was over the normal range [7]. The value change of the context variable can have four cases that are addressed in [7].

Proposed Method
The method proposed for extracting positive and negative association rules in conventional data sets is named CARM and is dependent on support, confidence, interestingness, CPIR and the value of the context variable. This method fetches the rules from the non-spatial datasets. CBPNARM [7] is developed as an extension of [5,34] and is used in some successful studies [12,35]. The proposed algorithm is an extended CBPNARM. A positive association rule BaseSupVal is the user-defined threshold value of support. According to Eqs. (1) and (2), support is defined by Supp(X ∪ Y ) and confidence is defined by where BaseSupValNeg is the user-defined threshold value of support for negative association rule.  The aforementioned mathematical procedures generate a large number of positive and negative association rules. The measure of Interestingness measure proposed by [5] is used to apply first level pruning. The interestingness of the rules can be calculated using Eq. (4). After applying the first level pruning through an interestingness measure, the second-level pruning is applied to further reduce the number of rules. The second level pruning measure is the CPIR, which is defined in Eq. (5). All rules that are positively and negatively dependent are eligible to be included in the final rule set. In this level of pruning, only the rules in which antecedent and consequent are independent of one another are omitted. The values of the context variable are then taken into account to evaluate the validity of the rules included in the final rule set. Four possible cases for the context variable as given in Tab. 2 are then applied. Rules that are wrongly added to the final list due to the out-of-range value of the context variable are omitted. Rules that are erroneously omitted on these grounds will be added to the final list.
The proposed algorithm for context-based association rule mining is given in the section below: The time complexity of the proposed algorithm is O(N 2 ) if one looks at the years and the number of countries. However, if the number of countries is set at its maximum, the time complexity is O(N), where N represents the number of years. The working of the proposed method is given in Fig. 1 below. In Fig. 1, the values of energy indicators are stored in a database that is then discretized to convert the data from conventional numeric format to ordinal format. A frequent itemset is obtained from the dataset based on support, confidence, interestingness and CPIR thresholds. The positive and negative association rules are then mined and evaluated using the values of context variables. The context variable in each dataset is selected by the user/ domain expert. Possibilities/cases in the context are also given in Fig. 1, the details of which appear in the algorithm above.

Experiments and Results
The algorithm proposed in the present document is encoded in python Jupyter notebook which is an open-source programming language. The experiment is performed on a machine with an i7-2.11 GHz Processor, 16 GB RAM and 500 GB hard disk installed with all necessary network conditions required for the Windows 10 operating system. Data for 23 sustainable energy development indicators are collected from 28 countries over 25 years from 1990 to 2015. All data is collected from the online energy data portals. Energy sustainability indicators contain quantifiable and unquantifiable attributes from which quantifiable attributes are used in this study. Data for the 30 attributes were not available in the online sources, and 23 of the 30 attributes are included in the final database. There were some attributes for which data were not available through online sources but they could be derived from the available attributes. The context variables taken into consideration for the study of sustainable energy development are presented in Tab. 3. The data from the first phase of the experiment are averaged and discretized to produce significant associations. As there were three dimensions of the data, the value of sustainability indicator, country and year, so for the discretization, it was necessary to convert the data into two dimensions. The values of each indicator were averaged over 25 years to obtain one value. The process of discretization was straightforward. Range values are determined for all data attributes on which data has been converted from values to ranges. An example of three indicators can be found in Fig. 2. In Fig. 2, an example of discretization of sustainability indicators for different countries from C1 to C6 is provided. The table on the left shows the non-discretized value that is converted in the table on the right illustrated in Fig. 2. For example, the SI2 value of C1 is converted in the interval 0-5 after discretization. The results shown in Fig. 2 show the relationship between the various energy SIs. The covariance of SI17 with SI19 shows that the greenhouse gas emissions caused by energy products have a strong association with the rate of deforestation caused by the energy products. Based on this pattern, energy decision-maker can build an optimal plan for sustainable energy development in the future. Another issue that the decision-maker can raise relates to the extent of covariance between SI17 and SI19. This can be calculated by using CPIR, interestingness, support and confidence measures. A significant number of positive and negative association rules were extracted from the dataset using the CARM algorithm. It was nearly impossible to learn from these many rules. Different level of pruning's strategies as described in the proposed method is used. Some of the final rules extracted after pruning are given in Fig. 3 and the detailed reduction in the number of rules after each pruning level is given in Tab. 4. In Fig. 3, a snapshot of the extracted rules is given. SI in the figure represents the sustainability indicator and C represents one country. SI3 ⇒ SI4 indicates that SI4 varies with SI3 and C1 ⇒ C17 indicates an association between two countries represented by C1 and C17. Examples of negative rules from the dataset are also shown in Fig. 3. Tab. 4 summarizes the total number of positive and negative association rules in different scenarios. The results of our algorithm are also compared to some of the existing association rule mining algorithms including Apriori, PNARM, CBPNARM with normal context and CBPNARM with out-of-range context. The results of the algorithm are compared to the number of rules, average confidence of the rules, average dependence and execution time of the algorithms. Two plots Figs. 4 and 5 show the number of rules extracted by different algorithms. Many rules retrieved without applying a pruning measure are shown in Fig. 4. In Fig. 5, the number of rules extracted after the pruning measure is applied are reported. The number of rules extracted in these two plots is at the largest in support of 0.2. The rules extracted by CARM are the minimum although both positive and negative rules are extracted by CARM. The reason there are fewer rules is to include different pruning measures. Including the context variable in this dataset further reduced the number of rules in this case. Some rules were pruned by other pruning measures but the context variable included them in the final list. The numbers of rules extracted by CARM after pruning exceeds that extracted before pruning, which is interesting. The reason for an increase in the number of rules after pruning is to include the context variable. The context variable added few rules to the final set that were not included when the value of the context variable was in the normal range. This is the only case where the number of rules increased after applying the context variable. As mentioned earlier, the context variable can also increase the number of rules by adding those rules in the final rule set that were previously ignored because of the out-of-range value of the context variable. The number of rules applying the pruning measures is shown in Fig. 5. CARM is at the lowest level, which is evident to CARM as it extracts more meaningful rules, which helps to reach the decision. The average confidence graphs for the unpruned and pruned rules extracted through all algorithms are given in Figs. 6 and 7 respectively. The average confidence of the proposed algorithm is greater in most cases. For support values 0.2 and 0.4, it is practically equal to CBPNARM. This is because the context variable value for the sustainability indicators dataset was normal in most cases for which the net impact on the final association rule set was too low. Rules confidence is a predictor of the CARM algorithm producing rules with higher certainty. The higher average confidence value for CARM indicates that the rules extracted are not unfamiliar. There is a higher co-occurrence of antecedent and consequent and the consequent is less escorted by any other antecedent. This proves the certainty about the rules extracted from the database. The average dependence plots are almost the same with pruning and without pruning for which a single plot is given in Fig. 8. PNARM performed best over the given dataset because the PNARM algorithm is intended to maintain rule dependence by interestingness, dependence and CPIR. The proposed algorithm applied interestingness and CPIR for the curve are in the lower domain of PNARM but higher in comparison with the rest. Fig. 9 illustrates the execution time of all algorithms. After integrating all pruning techniques, the execution time of the proposed algorithm is inferior to PNARM and CBPNARM and superior to the Apriori algorithm. The reason is obvious because the Apriori algorithm mine frequent items only and level-1 pruning is only done in Apriori. PNARM, CBPNARM and CARM are all considering various pruning measures where their execution time is expected to be at the top end. The execution time of CARM is less than CBPNARM because CBPNARM was developed for spatial data that is a pseudo form of image data. CARM with additional pruning measure took lesser execution time than CBPNARM. The execution time of CARM is at last but one position if no pruning technique is used for association rule mining. The algorithm was designed to improve the quality of association rules extracted from the datasets for which comparing algorithms based on precision, recall and F-measure depicted a clearer picture. The comparison of the algorithms based on average values on multiple energy datasets is shown in Tab. 5. The rules extracted from the dataset is divided into true positives, false positives, true negatives and false negatives according to the measures above. Higher precision, recall, and F-measure for the CARM algorithm indicate that the algorithm has extracted more useful rules. The values given in Tab. 5 are calculated by comparing the result of the algorithms for extracting association rules with the real rules that are used in the energy sector and validated by the expert of the domain.

Conclusion and Future Work
The CARM algorithm for mining context-based association rules is proposed in this paper as an extension of the CBPNARM algorithm. A few association rule pruning techniques are incorporated into the CARM algorithm including confidence, interestingness and CPIR to improve insights by decreasing the number of rules extracted. The context is used in the algorithm to eliminate certain rules and/or add those excluded from the final rule set defined based on the outof-range-value of the context variable. The algorithm is applied to sustainable energy indicators to find co-varying sustainability indicators and countries for sustainable energy development. The rules produced by CARM are more robust, relevant and insightful in terms of average confidence, dependence and relevance.
The proposed method outperformed the previous methods in terms of the number of rules generated, confidence and dependency. The inclusion of the context variable and CPIR reduced the number of rules and increased the robustness and usability of the rules. Confidence and dependency values show that fewer rules do not suggest a loss of useful patterns. The execution time of the algorithm is higher than a few other algorithms, which is expected due to additional functions added for the context variable and CPIR. The complexity of the algorithm can be improved in future by using object-oriented approaches for context variable and CPIR.
The results obtained in terms of the application domain of sustainable energy development are also insightful and reported interesting covariances in the indicators and underlined the criticality of some countries for their energy development. The energy sector in a country can use associations derived from the proposed method to construct an optimal plan to ensure sustainable energy development. The associations among sustainability indicators can lead the energy sector to devise a plan according to the individual deficiencies of energy development and its relation with other developmental factors. Thus, the study can lead an energy sector to achieve optimal energy development without compromising the economy, ecology and social justice that are essential ingredients for sustainability. The work can be extended to automate the selection of context variable because manually selecting context variables can add some bias to the results. An automated mechanism interpreting negative association rules can also be added to the algorithm in future work. Different classification algorithms and learning approaches can be added to the system to reduce the complexity arising from the data structure.