The continuous destruction and frauds prevailing due to phishing URLs make it an indispensable area for research. Various techniques are adopted in the detection process, including neural networks, machine learning, or hybrid techniques. A novel detection model is proposed that uses data mining with the Particle Swarm Optimization technique (PSO) to increase and empower the method of detecting phishing URLs. Feature selection based on various techniques to identify the phishing candidates from the URL is conducted. In this approach, the features mined from the URL are extracted using data mining rules. The features are selected on the basis of URL structure. The classification of these features identified by the data mining rules is done using PSO techniques. The selection of features with PSO optimization makes it possible to identify phishing URLs. Using a large number of rule identifiers, the true positive rate for the identification of phishing URLs is maximized in this approach. The experiments show that feature selection using data mining and particle swarm optimization helps tremendously identify the phishing URLs based on the structure of the URL itself. Moreover, it can minimize processing time for identifying the phishing website instead. So, the approach can be beneficial to identify such URLs over the existing contemporary detecting models proposed before.
A new trend related to internet scammers called phishing has emerged recently. In this process, a fraudster tries to contact the victim with the help of an email message. The appearance of the message and sender profile appears to be similar to a financial institution. The victim tries to connect with the links provided in the invitation email. The website appears similar to the original website for the financial organization. A similar CSS/HTML/JS element is expected to encounter in this fake URL. Once the user inputs his information into this website process of phishing starts. Depending on the fraudster, the process can take place in three different methods.
Impersonation: A fake website is created by the fraudster. The link to this website is presented via mail sent to the user. When the user inputs his credentials on this site, the credentials are revealed to the fraudster. The original website is opened with them and the user does not even suspect that he has been trapped. Now with the credentials of the user, the fraudster misuses them to cause financial or reputational loss to the genuine user.
Forwarding: The phishing e-mail itself asks for the login details. When these are entered, they lead the user to the original website. The fraudster gets hold of the user credential. However, the hacker does not even have to take the effort of creating a mirror website in this case.
Pop-up: The phishing e-mail contains a URL link, which is the phishing link. It opens the original website with a fake pop-up created by the fraudster when clicked. The pop-up asks for the credentials. The credentials are saved by the fraudster in the database and open the genuine website. The users are redirected to the genuine website and they do not even realize that something has gone amiss or their credentials have been compromised. This attack is not prevalent nowadays as pop-up blockers are available at the browser level.
In the proposed research, the concentration is on phishing caused by impersonation attacks, as these are the most prevalent and frequent attacks. Here, fake or mirror websites are created, which have the complete look and feel of the original. The main task is to distinguish between phishing/malicious sites and genuine sites. The advent of technology and the internet has caused an instant spur in this kind of attack. By looking at these websites, generic users would not be able to make out that it is a phishing website address. The phishing websites ask the users for their account user name and password, which are read by the fraudster. He then uses the credentials to perform malicious operations on the original website. The features selected are analyzed using particle swarm optimization and classification technique. The results retrieved are further tested with various algorithms to confirm their authenticity and accuracy. Finally, the study derives a conclusion and also suggests directions for future works.
Recent research has shown that data mining has been used extensively to analyze the different URL features and detect phishing/fake URLs. [
Data Mining is used to perform Classification (Supervised learning) and Clustering (Unsupervised Learning) [
A combination of two techniques and a fusion of results is used in this study. Such research shows more precision in detecting phishing websites, as the flaws of one technique need not shadow the results of the other. Even if one method is not able to detect a discrepancy, the other might detect and report it. [
In the proposed system, phishing URLs are recognized by analyzing the URL structure. There is no requirement to click on and put the phishing site. The time required to handle the information and analyze it for any vulnerabilities is thereby reduced. URL web page content need not be intelligently analyzed in this case.
PSO algorithm provides a more robust means to classify data. Unlike genetic algorithms, it does not use any mutation or crossover techniques. Instead, the entire algorithm focuses on collaborating and identifying the similar candidate value from the bulky data. Every candidate in this algorithm is called a particle. A clear fitness function is applied to all the particles, which provides a fitness value. Two values are maintained in the algorithm for every particle, viz “pbest” and “gbest.” Gbest represents that value of the fitness function, which yields the hygienist factor amongst all the particles. Pbest corresponds to the highest output value from the fitness function related to the neighborhood of a particle. The main target to achieve in this method is to identify the highest value of pbest and gbest. The main reason to use this type of technique in the study is the nature of finding the most required candidate amongst the various particles available. The fitness function for particle swarm optimization relates as:
This function measures the quality of a particular solution existing with the associated value of the particle [
For any given particle value, there exists a new value, which is supposed to be closer to the desired solution. The successor value for this particle with the changed position after the next iteration is completed to the pbest and gbest. A similar calculation holds good for the particle’s velocity from any initial position towards a final position with updated pbest and gbest velocity vectors.
The velocity function, which is responsible for the movement of a particle towards its most optimum solution value, is given as:
When all the fitness function values corresponding to the particles are calculated, the groups of particles are identified as Swarms. These groups are expected to travel towards the optimum solution with the velocity vector. This one, which is most likely to reach the destination, is selected as the most suitable candidate or precisely the optimum value. The methodology proposed in this schema uses the PSO technique for adjusting the weights of the underlying artificial neural network. By using the global optimization toolbox in MATLAB, significant results can be achieved. The proposed algorithm for this technique is depicted in the pseudocode:
Pseudo Code for Selecting Particle Values in PSO.
The vital parameter for this problem is the fitness function for valuation. It is determined on the miscalculation rate of the artificial neural network system.
In the URL,
For classification in this study, 10,000 URLs are collected. This entire set of information comprises of 6000 genuine connection links and 4000 phishing URLs. These links are under the Public license of the DMOZ repository. The data and information are available for ethical use. These data repositories are considered as one of the gigantic directories of digital data on the web [
For the training phase, an initial set of 250 URLs was taken (comprising of 125 fake and 125 real authentic URLs). For the second training phase, another set was taken containing 500 URLs divided into 250 fake and 250 genuine. Subsequently, the third set comprises of 1000 and the fourth set comprises of 2000 URLs for training purposes. The final set, makes use of 6000 genuine and 4000 nongenuine URLs.
The standard approach for producing phishing URLs is with the help of bot programs. These programs try to generate various phishing links that refer to a target website URL. Just in case one of the URLs is identified as phishing, a parallel copy of the variant URL from the bot program gets activated as the successor of the URL, which is referred to as fake. Negligible change in the URL structure occurs, which is difficult to identify easily. One of the factors related to the bot programs refers to the similarity of URL structure [
Attribute | Data type |
---|---|
IP address presence | Nominal {0, 1} |
Unknown noun presence | Nominal {0, 1} |
Suspicious URLs | Nominal {0, 1} |
Out of position top level domain | Nominal {0, 1} |
No of dots in the URL | Numeric |
Security sensitive word presence | Nominal {0, 1} |
No of links to this site | Numeric |
Real traffic rank of the site | Numeric |
Age of the domain | Numeric |
Genuine | Nominal {Y, N} |
Use of IP Address: quite often, when a URL is created, a name server-oriented domain name system is used to provide the name for the website. But in the case of fake URLs use of IP addresses is widespread. The domain name does the masking of the IP address for genuine URLs. This lacks in fake URLs as such. The presence of an IP address [
Count the number of dots present in URL: Certain studies [
Three features from the URL are extracted in this work.
Presence of Security Sensitive Word: If the URL has any of the following words, confirm, account, banking, secure, web-src, login, and sign-in, then the URL can be classified as phishing as per earlier works [
Suspicious Symbol Presence: Programmatically, the use of the “@” symbol is done with text and email addresses. It is also worth mentioning that the text before is supposed to be ignored whenever this symbology is used. e.g.,
Misplaced Top Domain: e.g.,
A close look analysis of the URL given above shows that the URL seems to derive from the famous PayPal. However, the misplacement of the domain is done, which refers to a hypothetical fake domain giving rise to phishing [
URL Site connections: It is most likely that if a URL is connected to a large number of pages, then it is also genuine [
Traffic Received: certain websites measure the incoming and outgoing traffic once they are connected to a specific URL example, Alexa (a subsidiary of Amazon.com). The data collected twice such services can help identify phishing sites [
Domain Age: Various phishing websites are reported and blocked in a concise span of time. The domain creation date can be easily monitored in the WHOIS properties. It can be derived that if the site is older, its chances of being phished will be lesser [
To improve the results of classification, feature selection has been employed. This study selects the most relevant features. By using feature selection, redundant data is removed and accuracy is also improved. The problem of over-fitting is also eliminated. WEKA tool has been used to perform feature selection and classification. The feature selection techniques [
The prediction ability and degree of redundancy for all the considered features are used to calculate the weight of each subset of features. There is a high correlation between all the subsets [ Security sensitive word presence Unknown noun Presence Out of positioning Top Level Domain Age of the domain Suspicious URLs IP Address Presence Number of links to this site
This algorithm evaluates the importance of an attribute by measuring its correlation with the other attributes in the class. The weighted average is calculated to determine the overall correlation. The merit value for a subset feature S having n features is given by:
Security sensitive word presence Unknown noun Presence Dot’s pattern/reoccurrence in the URL Out of position in the Top-Level Domain Age of the domain
This parameter evaluates attribute importance value by comparing the gain ratio [ Security sensitive word presence Out of Top position Level Unknown noun Presence Number of dots in the URL Number of links to this site
This parameter valuates the worth of an attribute by comparing information gain [
The top five attributes selected using Information gain feature selection:
Number of links to this site Security sensitive word presence Occurrence of multiple dots in the URL. Unknown noun Presence Out of expected position Top-Level Domain
Hypothesis- The “Unknown Noun” feature that has been proposed in this research work is consistently among the top five features during feature selection. This hypothesis has been tested using chi-square and t-test attribute selection methods. The T-test is used to test if the sample means significantly differs from the hypnotized value. The implementation of these two feature selection mechanisms was done in the Tanagra tool. The features that were selected from the t-test are:
Security sensitive word presence Unknown noun Presence Out of position Top Level Domain Age of the domain Suspicious URLs Security Sensitive Word Presence Unknown Noun Out of Position Top Level Domain Suspicious URLs IP Address Presence
Chi-square is a standard feature selection algorithm that has been used to rank the features in the order of relevance by comparing the observed and hypothetical proportions of a value. The features that are delivered as output for this test are:
After performing Correlation, Gain Ratio and Information Gain feature selection, the attributes are ranked as shown in
It is observed that the new feature proposed, Unknown Noun Presence, is ranked among the top 3 features in all the feature selection techniques.
The data have been thoroughly scrutinized and refined. Now classification is performed on this data. First, the feature selection process was executed using the WEKA tool. Then, the classification process in conjunction with PSO was implemented using De Jong’s fifth function in the MATLAB from the Global optimization toolbox [
The accuracy of both the classifiers after applying the different feature selection techniques is shown in
Accuracy (with cross validation 10 folds) in % | Without feature selection | Subset valuation | Correlation | Gain ratio | Information gain |
---|---|---|---|---|---|
MLP | 92.83 | 92.63 | 92.1 | 91.84 | 92.53 |
Random tree | 93.63 | 93.63 | 93.23 | 93.7 | 93.77 |
Time taken (in seconds) | Without feature selection | Subset valuation | Correlation | Gain ratio | Information gain |
---|---|---|---|---|---|
MLP | 2.97 | 2.26 | 1.83 | 2.68 | 1.27 |
Random tree | 1.02 | 0.69 | 0.63 | 0.53 | 0.50 |
Accuracy (with cross validation 10 folds) with random forest classifier | Subset valuation | Correlation | Gain ratio | Information gain |
---|---|---|---|---|
Without unknown noun feature | 92.067% | 92% | 92% | 92.93% |
With unknown noun feature | 93.63 | 93.63 | 93.23 | 93.7 |
WEKA tool is used to classify the data after feature selection using Naïve Bays, Multi-layer Perceptron, J 48 Tree, LMT, Random Forest, Random Tree, C 4.5, ID 3, C-RT and K-Nearest Neighbor algorithms [
The classification accuracy, precision, and recall values are higher for the Tree-based classification algorithms than the other frequently used algorithms from
Classification algorithm | Training accuracy (%) | Cross validation (10-fold) (%) | Cross validation (3 folds) (%) | Leave one out (%) |
---|---|---|---|---|
Naïve bays | 89.73 | 89.63 | 88.16 | 89.73 |
J 48 tree | 93.3 | 93.46 | 92.83 | 92.26 |
LMT | 94.16 | 94.86 | 93.13 | 93.53 |
Random forest | 95.5 | 95.07 | 95.17 | 95.93 |
MLP | 92.53 | 92.83 | 91.8 | 91.63 |
Random tree | 95.6 | 95.63 | 96.67 | 96.4 |
C 4.5 | 92.97 | 91.07 | 91.3 | 91.93 |
ID 3 | 91.17 | 90.33 | 93.87 | 92.13 |
C-RT | 92.7 | 91.53 | 92.47 | 91.97 |
K-nearest neighbor | 92.6 | 92.5 | 92.47 | 92.5 |
Classification accuracy (
Classification with PSO
In this part of the study, the impact on the classification by introducing the PSO algorithm is computed.
In the observation set in
Domains classified | Accuracy (%) | |||
---|---|---|---|---|
J48 tree | LMT | Random forest | Random tree | |
Gaming section | 92.98 | 92.97 | 95.02 | 94.89 |
Banking section | 93.28 | 93.63 | 95.39 | 97.99 |
News and advertising | 93.9 | 93.7 | 94.7 | 94.82 |
Online shopping section | 93.70 | 94.69 | 95.68 | 94.99 |
Algorithm | Accuracy (%) | False positive rate |
---|---|---|
Naïve bays | 88.16 | 0.6 |
K-nearest neighbor | 92.47 | 0.5 |
ID tree | 93.87 | 0.45 |
SVM | 91.5 | 0.58 |
NN with PSO | 98.7 | 0.21 |
There is a considerable increase in accuracy with the help of PSO as per
Phishing is a problem that is constantly troubling internet security analysts. New attacks keep sprouting despite current research being carried out in this field. Extensive research needs to be performed in this field to bridge the gap. In the proposed methodology, certain unique features have been selected and the accuracy has improved by using feature selection techniques. The time taken to perform the model building and then the classification is also reduced considerably. The application of hybrid methods like a combination of PSO with neural networks has given better results when compared to the traditional classification techniques. The data mining technique applied in this study provides good results and performance in identifying URL phishing. Classification of the dataset is done with the help of machine learning algorithms to find the best possible features. These features are trained with a machine learning model. The dataset training was completed using various algorithms and the results are explained. A collective comparison is made and results are recorded to identify the performance of the proposed model. The precision values received by the model’s help were satisfactory and acceptable. The model yields a substantial decrease in the false-positive rates of the phishing URL structure based on the features selected by the classification techniques. Almost all the classifiers have given more than 91% results in identifying the URL phishing under this model. This is a considerable result and it provides more than 98% accuracy in identifying the phishing nature of the URL. The model is sufficient to prove the best results, but more enhanced algorithms from data mining can be applied as future work to the existing model. The study identifies only a limited future for feature selection and there can be more improvement to the features available. The model is not yet tested with more classification algorithms and this can be a further next level of study in the future. Processing time for identifying URL phishing is also one of the future aspects of this study.
Authors would like to thank College of Computing and IT, Shaqra University for their support during this study. We would like to extend our thanks to the entire department and faculty of Computer Science for their motivation.