A new secured database management system architecture using intrusion detection systems (IDS) is proposed in this paper for organizations with no previous role mapping for users. A simple representation of Structured Query Language queries is proposed to easily permit the use of the worked clustering algorithm. A new clustering algorithm that uses a tube search with adaptive memory is applied to database log files to create users’ profiles. Then, queries issued for each user are checked against the related user profile using a classifier to determine whether or not each query is malicious. The IDS will stop query execution or report the threat to the responsible person if the query is malicious. A simple classifier based on the Euclidean distance is used and the issued query is transformed to the proposed simple representation using a classifier, where the Euclidean distance between the centers and the profile’s issued query is calculated. A synthetic data set is used for our experimental evaluations. Normal user access behavior in relation to the database is modelled using the data set. The false negative (FN) and false positive (FP) rates are used to compare our proposed algorithm with other methods. The experimental results indicate that our proposed method results in very small FN and FP rates.
Data can be considered a significant asset for all modern organizations, which may store a huge amount in their databases and cloud. Database management systems (DBMSs) are used by organizations to control access to these data by both internal and external users. Access must be protected, especially from competitors, and thus security and privacy are considered key issues for organizations to protect their data [
The literature describes many threats to databases from both internal and external users. External users can access databases using web applications or computer networks [
As internal users are allowed to bypass fire-wall systems and network security, as such internal threats can be considered more difficult to detect than external threats. Internal users can login to the system and gain access to non-authorized resources [
Traditional DBMS assign resources to users to control access to database resources. However, they cannot handle user queries or SQL injections that can be defined as an attack in which malicious code is added to strings passed to the SQL server for execution [
SQL injections can be considered a key dangerous issue for database systems as legitimate users are able to apply them. Further, DBMS cannot guarantee data privacy and security against these threats; thus a traditional DBMS architecture must be developed to adapt to new threats and ensure high data privacy and security for organizations [
IDS can be considered one of the most important parts of any well-secured system as they can detect malicious actions. They can be used to protect network resources and distinguish between malicious and legitimate transactions [
The literature suggests that building profiles for representing users’ normal access behavior is the main way to detect malicious transactions [
Only one profile for each role needs to be built, to minimize the number of profiles; the role represents all users who have similar database access behavior. This can help the IDS to easily control and maintain profiles for organizations with large numbers of users.
We can classify organizations into two types when creating an IDS for them: (1) Organizations with role-based access control where one or more roles are assigned for each user; and (2) Organizations without detailed role architecture assigned for each user [
In the first type, the role can be considered a package of authorizations for database resources where there is a clear role hierarchy assigned to each user. In this case, a machine learning algorithm is used to permit or deny non-authorized resource use via issued queries, by comparing queries against user-related roles and identifying them as malicious transactions or otherwise.
In the second type, the application’s authorizations can be used by organizations as the role for each user. In this case, clustering algorithms in the database log file can be used to cluster users who have similar interaction behavior in such a profile. Intrusion-free log files, where users’ normal behavior guarantees zero intrusions, can be used and then profiles can be used for each user as the role to detect anomalous behavior. Every user has to be mapped into one profile. Finally, similar to the first type, machine learning algorithms can be used to detect malicious transactions.
The literature shows that most existing algorithms have several drawbacks. Therefore, a new secured DBMS architecture using IDS where there is no previous role mapping for users in an organization is proposed in this paper. A simple representation for SQL queries is proposed to enable the clustering algorithm to be easily used. A new clustering algorithm known as tube search with adaptive search memory (TSASM) is applied to database log files to create user profiles that demonstrate the roles for the relevant users. Then, each user query issued is checked against the related user profile using this classifier to determine whether or not the query is malicious. The IDS will stop query execution or report the threat to the responsible person if the query is malicious. A simple classifier based on the Euclidean distance is used. The issued query is transformed to the proposed simple representation using a classifier where the Euclidean distance between the centers and issued query for all profiles is calculated. The IDS considers the issued query to be an authorized query if the representative user profile has the minimum distance with it. Otherwise, the issued query is considered malicious.
A synthetic data set is used for our experimental evaluations. Normal user access behavior for the database is modelled using this data set. The false negative (FN) and false positive (FP) rates are used to compare the proposed method with other methods. The experimental results indicate that the proposed method attains very small FN and FP rates.
The reminder of the paper is organized as follows. Section 2 represents related work on the ID. Section 3 presents the proposed database ID based on a hybrid meta-heuristics algorithm including the basics of the proposed IDS architecture, the proposed simple representation for SQL queries, and the proposed method to build IDS. Section 4 explains the numerical experiments and algorithm evaluations. Section 5 presents conclusions.
The literature shows that most IDS research has been focused on the networks area. For example, a survey on network-based and host-based IDS was proposed in [
A payload-based network IDS was proposed in [
The literature indicates little research has been undertaken on IDS proposed for database ID. Therefore, a new DBMS architecture is required that involves adapting traditional DBMS architecture by adding an ID layer to handle internal threats.
A new secured DBMS architecture was proposed in [
Moreover, the detection of misuse in database systems (DEMIDS) approach was proposed in [
Another IDS approach involving the back-end database was proposed in [
An anomaly-based system was proposed for web databases in [
Another approach that focused on web database attacks from users who make a large number of requests was proposed in [
Another IDS approach that depended on information from database log files was proposed in [
Another database IDS using data-centric IDS was proposed in [
In summary, the literature shows that almost all of the existing approaches have several drawbacks. Thus, a new simple query representation in log files is proposed in this paper. This representation is based on using real values—instead of binary arrays—in database tables, as well as attribute size. If the tables and attributes are large, large memory resources are needed. Further, a new clustering algorithm named TSASM is used in this paper. Role granularity is supported in building profile representations where profiles represent roles instead of users.
The system architecture, data representation, and methodology for the proposed database ID based on the hybrid meta-heuristics algorithm are explained in the following subsections.
In this paper, the system architecture for the traditional DBMS has been adapted by adding the ID mechanism before the execution process for queries. Thus, the DBMS uses the IDS to analyze each query before execution.
In this proposed IDS architecture, the query transformer is first used to work on the log files of the database to transform SQL queries to quiplets. Quiplets can be considered a simple representation of SQL queries. Second, the profile creator, which can be considered a clustering technique, is used to process these quiplets. This cluster technique is used to cluster the data from log files into profiles. All users with similar access behavior are represented using these profiles. Third, the detection engine, which is a machine learning algorithm known as the classifier, is used to determine whether the users’ issued queries are malicious or not. Fourth, the response policy base is used to identify the triggered action while a malicious query is detected. Authorized queries can mistakenly raise alarm using the proposed IDS. There are several occasional database user activities that can be classified as anomalous activity, such as fixing occasional problems. As the detection engine may not recognize these occasional activities, they can be considered intrusions. Further, each organization has policies for treating various situations in their response policy base. Sending an alarm to those responsible for the database, such as SSOs or DBAs, is an example of such polices. Last, the IDS uses authorized queries to update the profiles.
Extracting the information from log files is the most difficult challenge in the proposed approach. A query transformer is used to transform SQL queries into an applicable representation for classification and clustering algorithms. A simple representation for SQL queries, termed the quiplet, is used to build the profile blocks. Cluster algorithms in quiplets are used to build user profiles that detail normal interactions.
Quiplets are built as short structures for SQL queries. Each query is transformed into quiplet form. The user may issue several commands, such as insert, delete, update, and select. The select query can be represented as follows:
SELECT (Target-List),
FROM (Relation-List),
WHERE (Qualification).
where SELECT is the command; Target-List includes all selected fields or attributes; Relation-List represents the field relations in tables, which represent the join conditions between tables; and Qualification represents the conditions that must be fulfilled by returned results.
Each query is divided into five partitions and thus is known as a quiplet, which is the main unit of profiles. The quiplet is expressed as (Q(C,SR,SA,PR,PA)), where the five parts are command, selection relations, selection attributes, propagation relations, and propagation attributes [
The quiplet representation form is applicable for all SQL commands including insert, delete, and update. Like the select command syntax, the delete and update commands have a qualification list and target list. However, the inserted data are placed in the relation and columns to be encoded as the projection relation and projection attributes respectively for the insert command syntax.
The tables and attributes are represented in real numbers in the proposed query representation. The SR and SA, which represent the tables and attributes respectively in the target list, can be considered vector numbers of real value, as can the PR and PA, which represent the tables and attributes respectively in the qualification. They also can store the tables and attributes in the relation list when there is any join condition.
Table number | Table name | Column number | Column name |
---|---|---|---|
1 | R1 | 1 | A |
1 | R1 | 2 | B |
1 | R1 | 3 | C |
1 | R1 | 4 | D |
2 | R2 | 5 | E |
2 | R2 | 6 | F |
2 | R2 | 7 | G |
2 | R2 | 8 | H |
SQL command | Quiplet |
---|---|
SELECT R1.A, R1.C,R2.F, R2.DH | SELECT |
FROM R1, R2 |
Each SR, SA, PR, and PA real vector number is represented by two computed values known as (1) the number average value in vector; and (2) the number standard deviation in vector. Further, the command text is represented in numbers of real value. A new quiplet (n-quiplet) representation, written as Q(C, SRAvg, SRStd, SAAvg, SAStd, PRAvg, PRStd, PAAvg, PAStd), is used.
Quiplet | New Quiplet (n-quiplet) |
---|---|
SELECT | 1 |
There is no predefined role information in the worked database. In addition, there is no role hierarchy assigned to each user; thus, generating profiles can be considered an unsupervised operation. A clustering algorithm is employed to generate profiles from a training data set. These generated profiles are used to represent the roles. Clusters are mapped to users, as every specified user has one cluster that includes the maximum number of records in the log files of that user. One cluster may be allocated to one or more users. Further, a machine learning algorithm is used to classify each issued user query against the user-mapped cluster for the ID process.
A distance function is specified by the clustering algorithm to define the similarities and differences among quiplets. The distance function between any two quiplets is computed using the Euclidean function. The quiplets are represented using the real number where each quiplet has nine values. Suppose that we have two quiplets named Q1(C1, SRAvg1, SRStd1, SAAvg1, SAStd1, PRAvg1, PRStd1, PAAvg1, PAStd1) and Q2(C2, SRAvg2, SRStd2, SAAvg2, SAStd2, PRAvg2, PRStd2, PAAvg2, PAStd2). The distance function (θ) between the Q1 and Q2 quiplets can be calculated using
where
The TSASM clustering algorithm is used in this paper. As the K-means (K-M) algorithm [
The Euclidean distance method is used by TSASM to define the similarities and differences among data. The center of each cluster is calculated as the mean value for points belonging to the cluster. For instance, let
To measure the goodness of different sets of centers, the objective function
where
Each user is mapped to their representative profile once it has been created using the clustering algorithm. A representative profile can be considered a profile that includes the maximum number of user-submitted queries. The proposed ID mechanism can be considered a simple classifier as it uses the θ between the issued query and the center of the mapped profile. The classifier is detected for both authorized and non-authorized queries. The query is used to update the profile and re-compute its center if it is an authorized query. Otherwise, an action is predefined in the response policy base using IDS triggers.
A number of numerical experiments have been performed on a synthetic data set to verify the effectiveness of our proposed IDS algorithm. The data set is synthetically created to simulate users’ access patterns in the real world. A zipf probability distribution function (PDF) is used to model the non-uniform user access and is defined for the random variable
where
In addition, uniform and multinomial PDFs are used in this paper. The multinomial distribution function can be considered a generalization of the binomial distribution, which is the probability distribution for the number of successes in
Suppose
A synthetic data set is generated to verify our proposed method by modelling real-world database log files. Each role
Malicious queries are created to simulate both internal and external threats. They are created from the same probability distribution via a different role number. For instance, when the role information related to a normal query is equal to 1, the role is simply changed to any role other than 1 to make the query anomalous.
For our proposed method used to create the synthetic data set in this paper, the database scheme includes 20 tables and 10 columns in every table. It has nine roles, as shown in
Role No. | Table access | Col access |
---|---|---|
1 | Zipf(1-5,s) | Zipf(1,10) |
2 | Zipf(6-10,s) | Zipf(1,10) |
3 | Zipf(11-15,s) | Zipf(1,10) |
4 | Zipf(16-20,s) | Zipf(1,10) |
5 | Zipf(1-10,s) | Zipf(1,10) |
6 | Zipf(11-20,s) | R_Zipf(1,10) |
7 | Zipf(1-20,s) | Zipf(1,10) |
8 | Zipf(1-20,s) | R_Zipf(1,10) |
9 | Uniform(20) | Uniform(10) |
The experimental results for our proposed algorithm are compared with the IDS results, which use three query representations as follows: (1) m-quiplet; (2) c-quiplet; and (3) f-quiplet.
The single query of the database log file is represented by a c-quiplet that includes five fields: SQL-CMD, PROJ-RELCOUNTER, PROJ-ATTR-COUNTER, SEL-REL-COUNTER, and SEL-ATTR-COUNTER. The SQL-CMD field is symbolic and links to the issued SQL command. The PROJ-RELCOUNTER and PROJ-ATTR-COUNTER fields are numerical and link to the number of relations and attributes involved in the SQL query projection part. The SEL-REL-COUNTER and SEL-ATTR-COUNTER fields are numerical and link to the number of relations and attributes that involved in the SQL query selection part.
The single query of the database log file is represented by the m-quiplet, which includes five fields: SQL-CMD, PROJ-REL-BIN[], PROJ-ATTR-COUNTER[], SELRELBIN[], and SEL-ATTR-COUNTER[]. The SQL-CMD field is symbolic and links to the issued SQL command. The PROJ-REL-BIN[] field is a binary vector of size represented as a bit, which equals the number of relations in the database. The bit is set to 1 at position
A detailed representation of the log query is represented by the f-quiplet, which includes five fields: SQL-CMD, PROJ-REL-BIN[], PROJ-ATTR-BIN[][], SELRELBIN[], and SELATTR-BIN[][]. The SQL-CMD field is symbolic and links to the issued SQL command. The PROJ-REL-BIN[] field is a binary vector of size presented in bits, and equal to the number of relations in the database. The bit is set to 1 at position
The SQL command corresponding to the select statement is shown in
SQL command | c-quiplet | m-quiplet | f-quiplet |
---|---|---|---|
SELECT R1.A, R1.C, | select |
select |
select |
R2.F, R2.H | |||
FROM R1, R2 | |||
WHERE R1.B = R2.F |
Two clustering techniques, K-centers and K-M, were used in [
Our results are reported for each n-quiplet representation and are compared using the two clustering methods. However, the TSASM algorithm is used in this paper for clustering data. In addition, a simple classifier based on Euclidean distance 1 is used for the ID task.
The main idea of the NBC is that each instance
The outlier detection technique is dependent on the median of absolute deviations (MAD) test [
Moreover, for every point
Checking if |
The FN and FP rates are used for comparisons. FN represents the fraudulent transaction percentage identified as genuine and FP represents the genuine transaction percentage identified as malicious [
The proposed algorithm is dominated all results of quiplet representations in the FN rates as shown in
A comparison between the proposed algorithm and c-quiplet representation is shown in
A comparison between the proposed algorithm and c-quiplet representation is shown in
A comparison between the proposed algorithm and m-quiplet representation is shown in
A comparison between the proposed algorithm and m-quiplet representation is shown in
A comparison between the proposed algorithm and f-quiplet representation is shown in
A comparison between the proposed algorithm and f-quiplet representation is shown in
New challenges facing DBMS are presented in this paper. It explains how traditional DBMS can be effective for guaranteeing security appropriate for high-value data. A new DBMS architecture is proposed in this paper. The ID technique is used to protect the database against internal and external threats. IDS is used by organizations that do not have a clear role architecture assigned to every user. The TSASM clustering algorithm can be used to generate profiles from intrusion-free log files. A machine learning algorithm is used for the ID task for new issued queries. A synthetic database is built to represent real data log files. The experimental results indicate that the proposed method performs effectively against malicious transactions. The proposed method can lead to extremely low FN and FP rates; thus database security can be increased. However, this paper has the limitation that we only tested our proposed algorithm on one data set. Thus, in future, our algorithm should be tested with more than one large real data set.