Android devices are popularly available in the commercial market at different price levels for various levels of customers. The Android stack is more vulnerable compared to other platforms because of its open-source nature. There are many android malware detection techniques available to exploit the source code and find associated components during execution time. To obtain a better result we create a hybrid technique merging static and dynamic processes. In this paper, in the first part, we have proposed a technique to check for correlation between features and classify using a supervised learning approach to avoid Multicollinearity problem is one of the drawbacks in the existing system. In the proposed work, a novel PCA (Principal Component Analysis) based feature reduction technique is implemented with conditional dependency features by gathering the functionalities of the application which adds novelty for the given approach. The Android Sensitive Permission is one major key point to be considered while detecting malware. We select vulnerable columns based on features like sensitive permissions, application program interface calls, services requested through the kernel, and the relationship between the variables henceforth build the model using machine learning classifiers and identify whether the given application is malicious or benign. The final goal of this paper is to check benchmarking datasets collected from various repositories like virus share, Github, and the Canadian Institute of cyber security, compare with models ensuring zero-day exploits can be monitored and detected with better accuracy rate.
The smartphone generation started later this generation and have achieved the peak in the recent past years, especially we focus more on Android stack used by the end user’s all over the world at different levels for different categories, some like to more time on Social Applications, some on Business Applications and others in Gaming and Media Applications and to certain extend research activities and projects done with Android platform. The framework used in Android API (Application Programming Interface) is more vulnerable because of open-source nature and stack can be modified henceforth resulting in security breach when distributed across third party and the major source of advertisement. The attacker is keener on finding useful information from the victim by sending fake Short Messaging Service (SMS), recording, and access call logs, dropping malicious activities using payload with the help of Trojan and Backdoor mechanisms [
In this paper, we have added these sections focusing on related works, understanding existing methodologies, proposed work, implementation and results, conclusion, and further directions can be done for research work. In the upcoming sections, we discuss static, dynamic, and hybrid analysis of android applications applying reverse engineering using disassembly tools. The results obtained from sandboxing environment are taken along with dataset repositories collected from various sources for creating the model. One of the major issues inside the dataset is the dependency between variables that can lead to Multicollinearity problem and henceforth using principal component analysis we handle that issue in the proposed system we build a classifier based on sensitive android permission for handling and decision making on zero-day exploits and recent threats which is not updated with most of the antivirus scanning tool. In the upcoming chapter’s, we discuss about static and dynamic analysis under related work. The study about data set is done with various machine learning classifiers and its terminologies. In the proposed work we implement enhanced PCA to overcome multicolinearity issue shows the novelty of research done and with future conclusions.
The application classified as the label and unknown application performed a dynamic analysis based on behaviour extraction and from there generate feature extraction using vector template and feed into the base classifier. The probability prediction is based on learning and stacking and decides whether the given application is Benign or Malicious as shown in the
Static analysis is the process of analyzing a binary without executing it. It is easiest to perform and allows you to extract the metadata associated with the suspect binary. Static analysis might not reveal all the required information, but it can sometimes provide interesting information that helps in determining where to focus your subsequent analysis efforts to extract useful information from malware binary. According to V. Joseph Raymond et al. [
Dynamic Analysis is the process of executing the suspect binary in an isolated environment and monitoring its behavior. This analysis technique is easy to perform and gives valuable insights into the activity of the binary during its execution. This analysis technique is useful but does not reveal all the functionalities of the hostile program. According to V. Joseph Raymond et al. [
Cuckoo Sandboxing [
In the second phase, the hybrid approach is done with data processing, cleaned, and given input to machine learning classifiers such as logistic regression and target is considered when both approaches are termed as malicious. It will be less inaccurate in the dynamic analysis [
The experimental setup is done by taking the dataset from payloads collected from virus share, Github, and Canadian Institute for Cyber security [
Details | Malware Family | Number of samples |
---|---|---|
CCCS-CIC-AndMal2020 | Adware | 102 |
CICMalDroid 2020 | Backdoor | 105 |
Darknet 2020 | Dropper/Trojan | 204 |
Investigation of the Android Malware (CIC-InvesAndMal2019) | File Infector | 167 |
Android Malware Dataset (CIC-AndMal2017) | Ransomware | 107 |
Android Adware and General Malware Dataset (CIC-AAGM2017) | Scare ware | 106 |
ISCX Android Botnet dataset 2015 | SMS Attack | 102 |
ISCX Android Validation dataset 2014 | Spyware | 106 |
Virus Share 2021 | Zero-Day | 100 |
CCCS-CIC-AndMal2020 | Benign | 304 |
In the existing approach, our paper we are discussing supervised and unsupervised learning without feature reduction.
In the traditional approach, under classification, the primary classifier is linear or logistic regression based on the need of the user for training the model. The dataset is made by collecting datasets from repositories along with the own result obtained from the hybrid analysis. In the first part, logistic regression is supervised classification where our target variable is a discrete value stating whether the application is malicious or not called binary classification. This model uses the sigmoid function given below in figure.
The logistic regression is categorized based on either low/high or high/low precision-recall where in the first case we reduce the false negative for sensitive data and in the latter reduce the false positives. The presence of android malware comes under binomial logistic regression where the presence is treated as ‘1’ and not present as ‘0’.
The data has ‘m’ feature variables and ‘n’ observations and the matrix is represented as shown in figure below.
We can define conditional probabilities for two labels (0 and 1) for the
Here, y and h(x) present the vector and predicted response. Xj representing the observation values of the jth feature.
The learning is defined from
ID | System Call | ID | System Call | ID | System Call |
---|---|---|---|---|---|
1 | writev | 11 | pread | 21 | close |
2 | Unlink | 12 | unmask | 22 | lseek |
3 | Socket | 13 | bind | 23 | connect |
4 | recvfrom | 14 | write | 24 | ioctl |
5 | readv | 15 | chdir | 25 | execve |
6 | read | 16 | sendto | 26 | dup |
7 | Open | 17 | Rename | 27 | fchown |
8 | Mkdir | 18 | access | 28 | Chmod |
9 | Fcntl | 19 | Recvmsg | 29 | sendmsg |
10 | epoll | 20 | dup2 | 30 | fchown |
ID | API CALLS | ID | API CALLS |
---|---|---|---|
1 | _WRITE_SMS | 18 | _PROCESS_OUTGOING_CALLS |
2 | _WRITE_SETTINGS | 19 | _MODIFY_PHONE_STATE |
3 | _WRITE_HISTORY_BM | 20 | _INTERNET |
4 | _WRITE_EXTERNAL_STORAGE | 21 | _INSTALL_PACKAGE |
5 | _WRITE_CONTACTS | 22 | _HARDWARE_TEST |
6 | _WRITE_APN_SETTINGS | 23 | _HARDWARE_TEST |
7 | _VIBRATE | 24 | _GET_ACCOUNTS |
8 | _USE_CREDENTIALS | 25 | _FACTORY_TEST |
9 | _SEND_SMS | 26 | _EXPAND_STATUS_BAR |
10 | _RESTART_ PACKAGE | 27 | _DIABLE_KEYGUARD |
11 | _RECEIVE_SMS | 28 | _DEVICE_POWER |
12 | _RECEIVE_BOOT_CMD | 29 | _CHANGE_WIFI_STATE |
13 | _READ_SMS | 30 | _CHANGE_NETWORK_STATE |
14 | _READ_PHONE_STATE | 31 | _CALL_PHONE |
15 | _READ_LOGS | 32 | _ACCESS_NETWORK_STATE |
16 | _READ_EXTERNAL_STORAGE | 33 | _ACCESS_LOCATION |
17 | _READ_CONTACTS | 34 | _ACCESS_GPS |
ID | API CALLS | ID | API CALLS |
---|---|---|---|
1 | setSerialNumber | 18 | getMethod |
2 | sendTextmessage | 19 | getMessage |
3 | RequestFocus | 20 | getLongitude |
4 | loadClass | 21 | getLoaction |
5 | killProcess | 22 | getLineNumber |
6 | isProviderEnabled | 23 | Getlatitude |
7 | getWifiState | 24 | getInputStream |
8 | getSubscriberID | 25 | getDisplayAddress |
9 | getSIMSerialNumber | 26 | getDeviceID |
10 | getSimOperatorName | 27 | getCredential |
11 | getSession | 28 | getCookies |
12 | getPackageName | 29 | getClassLoader |
13 | getPackageInfo | 30 | getCertStatus |
14 | getOutputStream | 31 | getAppPackageName |
15 | getNetworkType | 32 | exec |
16 | getNetworkOperator | 33 | CreateFromPdu |
17 | getMsgBody | 34 | abortBroadCast |
This approach collection of algorithms where every pair classification is done independently, we divide into first feature matrix second response vector. The first part contains dependent features and later contains prediction in simple terms output. The equations given below explains the working principle of the classifier.
where X and Y are events and checking whether P(Y)? 0 considering the probability of occurrence of X assuming Y is true termed as evidence. The prior and posterior probability have to be monitored. In our paper, we have used a Gaussian classifier. The demerit with this approach model will assume ‘0’ as output in case the dataset having errors or missing values.
This approach ensembles with meta-estimator based on voting make a prediction, works similar to black-box testing where the system is based on input and output comparing random dataset with actual entities. The merit of this approach will reduce overfitting taking into account variance by removing multiple entries of the same record. The implementation is done by resampling taking instance and present multiple times will not be considered.
This approach is non-linear where we map high dimensional features with input data set. The outcome is non-probabilistic binary classification. The optimization of linear discriminant represents perpendicular distance. This classifier works on black-box testing where the input training data is compared with the output label and achieves the result.
In the second part of the experiment, we have implemented using K-Nearest Neighbours supervised machine algorithm one of the most essential classification algorithm mainly used for intrusion detection part of cyber security. This algorithm can be used for real-time data and henceforth is mostly suitable for hybrid analysis. Here we classify the data sets identified with the help of attributes.
The table shown above gives us about selected permissions, selected API (Application Programming Interface) calls and selected system calls for android malware detections considered for our dataset created by exploring through hybrid analysis. We can be considered different categories of APK files like social engineering app, banking, and financial app, games, and sports app. Media app, educational app, etc. and ensure to take around 800 samples with a combination of malware and benign app from various repositories like Canadian university of cyber security, virus share, virus total, Drebin, MalDrozer, and Android Tracker and gathered recent 2020 Applications [
From the above formula [, True Positives (TP) results in actual prediction same as the expected outcome. False Positives [FP] actual prediction not as the expected outcome. True Negative [TN} were predicted not in YES but not in it. Finally, False Negative [FN} predicted not in Yes but actually in it. We can be considered the same dataset as used for the earlier approach and how this essential approach can give us better accuracy than the previous model. It is purely based on the number of points from the data set that creates the training model. We have implemented using the google colab platform importing all the necessary packages. The first step is to fix the ‘X’ and y variables for the training set and output label. Here also we can use a label encoder for predicting the label to be malicious or not. The scalar features are transformed for the input training data set. We split the training data set into train and test based on 80–20 approach before we fit it into the model. We apply KNeighbors Classifiers and predict the accuracy score which results in both classification and regression problems. We have to focus more on the scale of variables and the distance between the observations. First, we find the K value so that compare it with the error rate assuming n_neigbors as “I” ranging from 1 to 40. This shows Error Rate
From the above graph, we can see that the error rate tends to get slower after k > 15, so we fix n_neigbors as ‘15’ and get the accuracy rate as 0.76 as shown in
Dataset | Accuracy (%) | F-Score | Recall | Precision |
---|---|---|---|---|
Naive Bayes | 76% | 0.70 | 0.76 | 0.74 |
Bagging Decision Tree | 68% | 0.69 | 0.68 | 0.66 |
Support vector machines | 76% | 0.69 | 0.76 | 0.74 |
Logistic Regression | 69% | 0.65 | 0.69 | 0.64 |
The proposed approach implemented with Principal Component Analysis (PCA) is one of the best feature extraction techniques well suited for a hybrid approach for detecting whether the android application is malicious or not. This approach helps in lowering the dimensions giving focus to very important and critical attributes from the training data set also helps in finding linear combinations which can countermeasure Multicollinearity problems. The flowchart below given shows how the enhanced PCA can reduce search time in
The input variable ‘X’ defines the number of features in the dataset. The original dataset is transformed as N × d matrix X into an N × m matrix Y. The calculate covariance or correlation matrix using the equation given below
We calculate the Eigenvector and values from the covariance matrix as ~X = λV and then calculate dissimilar matrix and local-based similarity calculation. We find local feature minimum distance and global feature minimum distance. The output of the hidden layer is computed by summing input multiplied with weights shown in the below equation.
The error is then used to adjust the weights of the input vector according to the delta learning rule. The outcome will be based on weight-based features. The value of ‘m’ feature variables and ‘n’ observations based on coefficient’s obtained from maximum variance calculations.
The feature selection approach also called the variable selection approach or attribute selection approach can be used to choose features that are most relevant to the predictive modelling problems. Some irrelevant features and redundant features may appear in feature sets. Irrelevant features should be reduced because they will have a low correlation with the class. Redundant features should be screened out as they will be highly correlated with one or more of the remaining features. Since the feature selection approach can remove irrelevant and redundant features, it usually gives a good or better accuracy whilst requiring fewer data. We have taken one example of Logistic regression which is a linear classifier whose parameters are weights, usually in terms of the weight vector, and the regularization parameter to explain the importance of feature selection for creating models. After training logistic regression is estimated, and the value of each weight represents how important that weight is for classification. The logistic Regression model uses Akaike Information Criterion for feature selection. The feature selection algorithm reduced the number of features to the eight most relevant ones. The experiments finally achieve better accuracy as shown in
The experiment is carried out with objective feature reduction with PCA and implements using logistic regression and K-NN machine learning classifiers and seeks for better accuracy time. The first step is data transformation using encoding techniques, find the correlation between variables. This computes the pairwise correlation of columns shown in
The bigger the values, the more strongly two variables are correlated and vice-versa. The standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model. The standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data. Before computing Eigenvectors and values we need to calculate the covariance matrix. To decide which eigenvector(s) can be dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped as shown in
The comparative analysis of Naïve Bayes, Bagging Decision Tree, Logistic Regression, Support Vector Machine classifiers, and k-NN approaches are made without using principal component analysis and with principal component analysis and from the table given below we can see that model has achieved 80% accuracy with PCA for a k-NN approach using k = 15 better than the accuracy of 76% without PCA comparing with other classifier’s. We have 100 epochs for validating and testing model. The outputs of static features are given as input for the machine learning classifier and using the classification approach decides whether the APK application is malicious or benign as discussed in the proposed framework. When we compare the performance with the existing approach, the proposed has better accuracy, F-Score, Recall and Precision as shown in
Dataset | Accuracy (%) | F-Score | Recall | Precision |
---|---|---|---|---|
Naive Bayes | 77% | 0.71 | 0.77 | 0.76 |
Bagging Decision Tree | 76% | 0.72 | 0.76 | 0.73 |
Support vector machines | 78% | 0.72 | 0.78 | 0.76 |
Logistic Regression | 75% | 0.68 | 0.75 | 0.82 |
The detection of android malware using a hybrid approach applying supervised learning for a dataset consisting of zero-day exploits and archives collected from recent payloads achieved an accuracy rate and visualize the model. We have explored and used effective forensics tools like Android Forensics (AF) logical tool for extracting features removes redundant data for creating the model. We filter out redundant features and optimize feature selection for better accuracy which can be considered as an optimization approach. We have also saved time and space by keeping minimum hardware and software requirements for building a model which can be used for a small-scale approach. The Multicollinearity problem is handled using principal component analysis by extracting and reducing features based on correlation. By using PCA (Principal Component Analysis) we ensure that dependency between variables is reduced resulting in better accuracy. The comparison is done with existing classifiers without using PCA and found that the error rate little bit higher. The potential limitation with proposed work might be some error rates achieved while implementing k-NN approach. In future work, we can move forward to implement unsupervised learning like Multilayer Perception apply clustering accuracy and further go for CNN (Convolution Neural Network) by applying XG- Boost Model increases the accuracy and then rank android malware payload and choose based on the level of impact. Henceforth by this approach, we can better the accuracy rate by using this model. As the functions of each application are increasingly powerful it has become mandatory for us to protect the user from vulnerable threats as we know that most of the Applications on the Android platform are not encrypted. As the next generation moving towards smart cities where most users will be using the android application for various purposes like banking, finance, fitness and health, social applications the possibility of threats might increase in the case of unencrypted applications as well as weaker applications. This might be one of the major challenges while establishing IoT platforms where most devices are connected with the internet resulting in breaking of integrity and confidentiality. Our proposed work leading to the threat model can suggest or helps in decision-making for users while installing an application from not trusted resources.