Cyber Attacks are critical and destructive to all industry sectors. They affect social engineering by allowing unapproved access to a Personal Computer (PC) that breaks the corrupted system and threatens humans. The defense of security requires understanding the nature of Cyber Attacks, so prevention becomes easy and accurate by acquiring sufficient knowledge about various features of Cyber Attacks. Cyber-Security proposes appropriate actions that can handle and block attacks. A phishing attack is one of the cybercrimes in which users follow a link to illegal websites that will persuade them to divulge their private information. One of the online security challenges is the enormous number of daily transactions done via phishing sites. As Cyber-Security have a priority for all organizations, Cyber-Security risks are considered part of an organization’s risk management process. This paper presents a survey of different modern machine-learning approaches that handle phishing problems and detect with high-quality accuracy different phishing attacks. A dataset consisting of more than 11000 websites from the Kaggle dataset was utilized and studying the effect of 30 website features and the resulting class label indicating whether or not it is a phishing website (1 or −1). Furthermore, we determined the confusion matrices of Machine Learning models: Neural Networks (NN), Naïve Bayes, and Adaboost, and the results indicated that the accuracies achieved were 90.23%, 92.97%, and 95.43%, respectively.
The internet is a wealthy source of social media applications [
Cyber-Security is the standard used to block any attacks on systems [
Recently, after the rise of the Internet of Things, the Cybercrime problem has increased greatly, which is considered a big challenge in the information technology field [
In recent years, Internet users are insecure because of Web-Threats from social networks like social botnets. It is a collection of social users that convince users to release their personal information via malicious activities [
With the progress of network technology and the development of networking applications, security issues have become at risk. Phishing websites are capable of avoiding detection by looking legitimate which attracts these users to use these sites [
Cyber-Security researchers and domain experts use Machine learning (ML) algorithms to build Anti-Phishing detectors models which can be applied in a Real-Time environment and interpret the results to defend against multimedia application attacks [
The rest of the paper is structured as follows. Section 2 introduces social engineering and the life cycle of a social engineering attack. Section 3 describes social network attacks. The Anti-Phishing solutions are explained in Section 4. The Cyber-Security techniques to subdue attacks are introduced in Section 5. The limitation and threads are investigated in Section 6. Finally, the conclusion is given in Section 7.
SE is a developed threat via different web applications. In the cyber domain, the human factor is more critical than the technical aspect. 95% of attacks are daily caused by human errors like providing personal information [
Phase | Activities of the attacker |
---|---|
Attack formulation | Identify goals and targets |
Information gathering | Gather credibility information like preferences, affiliation, backgrounds, and social information to establish a trusting relationship |
Planning | Analysis of collected information to develop an attack |
Develop a relationship | The collected data is used to establish communication and build trust with the target |
Exploit the relationship | Cheat the victim by a different wicked task, like logging in, spam email, password reset, and cloud access |
Debrief | Received sensitive information used to access the cloud or system |
Because of the rapid development of SE incidents, SE researchers confirm that there is no helpful defense method against these attacks [
Traditional SE attacks, like phishing, do not require lots of knowledge to occur, so they are the reason for hundreds of millions in economic losses. Phishing attacks have globally increased by 1,220,523 in 2016 when compared with the preceding year. An appropriate SE framework contributes to the defense against SE attacks by illustrating the relationships between attack components [
Cyber-Security is a critical issue in different industries as it causes enormous economic and reputation loss the in the majority of organizations [
Cyber-Attack | Techniques |
---|---|
Identity theft/Personal information | De-Anonymization |
Neighborhood | |
Profile cloning | |
Existing profile cloning | |
Cross-Site profile cloning | |
Social phishing attack | |
Spam | Simple spam attack |
Email-Based spam | |
Broadcast spam | |
Context-Aware | |
HTTP session hijacking | |
Malware | Fake profile |
Social network API | |
Drive-by download | |
Shortened and hidden links | |
Cross-Site scripting attack | |
Click-Jacking |
According to the Kaspersky Lab Global IT Risk Review, half of the business threats are cyber threats. Remote attacks increase Cyber-Attacks as they allow attackers to attack any PC anytime anywhere around the globe. RAKKSSA framework provides safety guidelines to reduce the risk of Cyber-Attacks and protect the organization's information. Cyber threat intelligence can provide a timely reply to attacks [
In social sites, a phishing attack is the most serious Cyber-Attack [
Phishing is a malicious technique for stealing others’ data ethically and technically. Attackers contact people via different channels in social media [
For the past two years, the Anti-Phishing working group detected about 97.36% of phishing websites. Security companies provide solutions for users to manage malicious activities. PhishMe develops software for organization security workers to deal with phishing attacks just by clicking on a button provided in the E-Mail client Add-in [
The ecosystem of the phishing attack process assumed that the victim receives a phishing email for instance with a fake link by the attacker and the attacker deals with a queue of phishing websites. These websites receive fake hosting and send sensitive data collection from the phasing dataset from the attackers. Mihai [
Spear Phishing (SP) is an attack that may be directed to steal the users’ information from a specific company website. It is helpful by knocking towards intrusion in the system. While Clone Phishing (CP) is the attack here that depends on cheating victims by making an identical copy of the legal website in which the trap is made by attackers. Otherwise, Whaling Phishing (WP) is a Cyber-Attack similar to spear Phishing, but it targets High-Profiles [
The detective technique is the most significant as it can reduce human errors by filtering and blocking access to phishing URLs that have installed kits. It is observable that using a combination of hiding techniques may delay the detection of the site for up to ten hours. The preventive technique introduced by Well-Built authentication. A corrective technique introduced by like site removal [
This strategy distinguishes phishing and authentic sites by depending on programming devices that secure and differentiate attacks [
Phishing programming detection methodologies | |||
---|---|---|---|
Approaches | List-Based | Heuristic-Based | Machine Learning-Based |
It is a list of harmful IP addresses, or Anti-Phishing toolbars (e.g. Google Safe Browsing API). | It depends on some URLs standards determined by cyber experts like lexical features, host and webpage information, etc. | Discover phishing websites within given URLs via online learning which depends on several training classifiers. | |
Techniques | Whitelist-Based schemes | Phish-Guard | Bag-of-Word Model-Based Methods |
Phish-Wish | Support Vector System | ||
K-Nearest neighbor | |||
Visual similarity | Bart (Biayasian additive regression Tree) | ||
Neural networks | |||
Blacklist-Based schemes | Cantina | AdaBoost | |
Decision tree | |||
Cantina+ | Random forest algorithm | ||
Naïve bayes classifiers | |||
Boosting | |||
Data mining | Logistic regression | ||
Bogus biter |
These approaches depend on users’ awareness and their ability to differentiate between phishing and authentic sites by improving their understanding of malicious assault [
To protect user accounts, researchers provide guidelines for securing accounts [
Phishing is a deceptive attempt to get sensitive information in which attackers are always finding new ways to trick clients using social networking tactics by persuading them to follow instructions in a flow [
The Collection of URLs websites contains numerous validated phishing URLs, such as the phishtank-dot-com website, as an alternative. The drawback is that it necessitates an additional feature extraction process based on rules, and it is reliant on third-party services. This approach is independent of third-party services and unnecessary specialist knowledge; however, the learning process will take longer. It’s simple to start using published datasets like the UCI machine learning dataset for the training process in academic articles, especially for complicated structured models like multi-layer neural networks [
Authors | Technique | Advantage | Disadvantage | Results obtained |
---|---|---|---|---|
(Gupta et al., 2021 [ |
Random forest | Without the usage of third-party services or the restricted attributes acquired from a URL, high accuracy and low response time were achieved. | There were no multiple datasets utilized to train the model, compare outcomes, or evaluate the model's resilience. | For 11964 instances of authentic and phishing URLs, the RF accuracy is 99.57%. |
(Sabahno et al., 2022 [ |
ISHO + SVM | The ISHO (improved spotted hyena optimization) technique has been improved to identify more efficient features. | A feature extraction process was not included in the proposed strategy. | They used SVM+ISHO with 98.64% accuracy using the UCI repository. |
(Odeh et al., 2021 [ |
Adaboost | Weka 3.6, Python, and MATLAB 2 were employed in the suggested model. | There were no numerous datasets used to train the model, compare the findings, or assess the model's robustness. | A collection of different sites such as PhishTank, MillerSmiles, and Google searching archives, achieved 99.00% accuracy. |
(Alsariera et al., 2020 [ |
Meta-learning algorithms and extra trees: LBET (logistic regression) | The accuracy is high, and the false-positive rate is minimal. | Additional methods to extract features and optimization strategies are required to boost the results obtained. | A collection of UCI repository websites is used to detect phishing attacks with 97.00% accuracy. |
(Adeyemo et al., 2020 [ |
Bootstrap aggregating + logistic model tree | To reduce bias and variance, the classifiers were trained and evaluated using 10-fold cross-validation. | There is a lack of information about the way used to extract features. | UCI repository dataset is used and achieved 97.18% accuracy. |
(Zamir et al., 2020 [ |
Random forest + neural network + bagging | Focuses on detecting phishing websites using a feedforward NN and ensemble learners. | To verify its applicability in a real-time setting, the suggested approach may be integrated with alternative feature extraction models. | They used a dataset from the Kaggle website and achieved 97.4% accuracy. |
(Wang et al., 2019 [ |
Recurrent neural network (RNN) + convolutional neural network |
The first to use a deep learning model to identify phishing in the context of cybersecurity concerns, as well as train and test with hundreds of thousands of phishing and non-phishing website URLs. | The training session was far too long. When the URL of the phishing website lacks crucial semantics, PDRCNN will be unable to classify effectively, regardless of whether the website matching the URL is active or has a problem. | A dataset including nearly 500,000 URLs gathered from Alexa and PhishTank obtained 97.00% accuracy. |
(Aljofey et al., 2020 [ |
CNN | To compare the results of various sets of tests, four different groups of features are extracted. | The training time is extensive. The model is unconcerned with whether the URL of the website is active or includes an error. The algorithm will misclassify short links, sensitive terms, and phishing URLs that do not duplicate other websites. | They obtained 95.02% accuracy by collecting URLs from several sources (Alexa, openphish, spamhaus.org, techhelplist.com, isc.sans.edu, and PhishTank). |
(Anupam et al., 2021 [ |
Grey wolf optimizer + SVM | Nature-inspired optimization methodologies, in addition to the grid search-optimized RF classifier, may be utilized to tune the parameters of the Support Vector Machine (SVM) model to achieve high accuracy. | Because the dataset is so small, there is no way to compare the findings of different datasets to the model. | The used UCI-ML repository dataset with an average accuracy reached 90%. |
(Ali et al., 2019 [ |
Genetic algorithm (GA) + DNN | It’s a novel concept to use GAs to pick effective characteristics and weights. | There isn’t a way to extract features. Using GAs for feature selection and weighting may take longer. The detection accuracy may be reduced as compared to prior methodologies. | Using DNN, they obtained 93.34% accuracy. Out of 1353 websites in the UCI phishing websites dataset, there are 702 phishing websites, 548 legal websites, and 103 questionable websites. |
(Deepa, 2021 [ |
Convolutional auto encoder + DNN | A convolutional autoencoder was used to extract features. | When compared to previous approaches, the detection accuracy may be lower. For deep learning models, the dataset is small. | They collect 16000 phishing and legitimate URLs. The phishing sites are made up of 12000 phishing URLs taken from PhishTank. They were also 89.00 % accuracy. |
(James et al., 2013 [ |
J48, JBK, SVM, NB | For analyzing numerous elements of benign and phishing URLs and detecting phishing websites, use lexical features, host attributes, and page priority properties and use fine-tuned parameters to separate the phishing sites from benign sites. | To constantly build new methods to fight defense measures, algorithms that react to new examples and features of phishing URLs are required. | A collection of URL websites from different resources are utilized with an average accuracy of 93.00%. |
(Mao et al., 2018 [ |
SVM, RF, DT, AB | When it comes to detecting phishing pages, this tool is both accurate and robust. Create rules to determine the layout similarity of web pages and then detect phishing pages automatically. Phishtank.com provided over 2,900 phishing websites. | They should employ a prototyped strategy and test it against a huge number of phishing websites. Their technique has the potential to significantly improve the performance of existing antiphishing systems. | They compiled a list of phishing websites from phishtank.com. They verified and filtered such invalid pages first. They achieved an average 93.00% accuracy. |
(Buber et al., 2017 [ |
Decision Tree, Adaboost, K-star, kNN (n = 3), Random Forest, SMO and Naive Bayes, and different number/types of features as NLP based features, word vectors, and hybrid features. | The utilization of a significant number of phishing and genuine data, real-time execution, new website detection, independence from third-party services, and the use of feature-rich classifiers are all benefits. The Random Forest approach with solely NLP-based features has a 97.98 percent accuracy rate for phishing URL recognition. | Deep learning can be used to build the knowledge base to improve the system's efficiency. | Many tests were run on the proposed system, and the results indicated that the Random Forest algorithm attained 97.2 percent accuracy. |
(Xiang et al., 2011 [ |
Feature-rich machine learning approach | Expand the number of features from their prior work to catch the continually evolving novel phishing attempts | 8118 phishing pages and 4883 authentic web pages in a small dataset Take advantage of third-party services employ data about a certain location (top 100 English sites) 6 URL-based features, 4 HTML-based features, and 5 web-paged features | They achieved 92% accuracy based on the applied URL websites. |
(Le et al., 2011 [ |
Detects phishing websites by categorizing them using URL characteristics. | Based on an online classification, this product is suited for client-side deployment. tolerant of noisy data (training) | Using third-party services, we obtained a limited dataset (6083 malicious URLs and 8155 benign URLs). | They achieved 92.00% accuracy |
(Jeeva et al., 2016 [ |
Algorithms for generating apriori and predicting apriori rules. | Rapid rule detection (particularly with apriori rules). | Make use of classification rules. Depending on how well the regulations are. 1200 phishing URLs and 200 authentic URLs in a restricted dataset 14 heuristic characteristics Nine apriori rules are a priori and nine predictive rules. | They obtained 93.00% accuracy. |
(Babagoli et al., 2019 [ |
A nonlinear regression approach based on meta-heuristics and two feature selections. | The original UCI dataset has been reduced from 30 to 20, and decision trees will perform better with this feature set. | 20 features are used in a restricted dataset (11055 phishing and authentic web pages). | The Harmony Search-based nonlinear regression yielded accuracy rates of 94.13% and 92.80% for the train and test procedures, respectively. |
(Mohammad et al., 2014 [ |
Self-structuring neural networks with a type of artificial neural network. | In order to create network language independence, it employs an adaptive technique. | Third-party services (such as domain age) are utilized. a small sample size (1400 data). There are 17 features. | The major results indicated that the accuracy is 94.07% for 1000 Epochs. |
(Feng et al., 2018 [ |
Neural network with Monte Carlo algorithm | Not reliant on third parties. Real-time detection improves detection accuracy and consistency, as well as the ability to detect new phishing websites (zero-day attacks). | The use of third-party services necessary to obtain the whole page dataset is restricted (11055 data, 55.69 percent of which are phishing). 30 characteristics (address bar based, abnormal based, HTML and javascript based, domain based). | The use of third-party services necessary to obtain the whole page dataset is restricted (11055 data, 55.69 percent of which are phishing). 30 characteristics (address bar based, abnormal based, HTML and javascript based, domain based). |
(Smadi et al., 2018 [ |
Neural network approach with reinforcement learning | Phishing emails were detected before the end-user saw them. Do not rely on real-time detection from third parties. | A limited sample size (9118 data, 50.0 percent of them are phishing). The 50 characteristics of PhishTank, 12 of which are URL-based, may be utilised to establish a blacklist. | The accuracy = 98.63%. |
(Peng et al., 2018 [ |
NLP and machine learning (using the Nave Bayes classifier). | Natural language processing is used to determine whether each sentence is suitable. | Based on the analysis of email text. |
The accuracy = 95%. |
Machine learning models based on Neural Network (NN), Adaboost, and Naïve Bayes (NB) are utilized in this work to investigate the detection of phishing attacks using a dataset found on the Kaggle website “
Neural Network (NN) | Naïve Bayes (NB) | Adaboost | |
---|---|---|---|
Accuracy | 90.23% | 92.97% | 95.43% |
Precision | 86.83% | 92.54% | 95.70% |
Sensitivity | 97.21% | 95.05% | 96.12% |
Specificity | 81.46% | 92.97% | 95.43% |
F1-score | 91.72% | 93.77% | 95.91% |
The major limitation of the current efforts to detect phishing attacks can be concluded in the following points. The preprocessing of data enrolled from the applied URL websites including imputation, and normalization should be performed before feature selection and extraction, especially for large-scale datasets. Due to the variability and change of URL information including the updated version, IP setting, or any other criteria. Therefore, the need to maintain and track the change should be simultaneously performed to tackle any new attacks and detect phishing attacks. In addition, the training period is lengthy. The model is indifferent whether the website’s URL is active or contains an error. Short links, sensitive phrases, and phishing URLs that do not replicate other websites will be misclassified by the system.
Phishing is a serious security concern. It has a significant impact on the economic and online shopping sectors. Because online applications are a crucial interface for accessing and configuring user data, improper use of the web opens the door to targeted assaults by phishers who choose websites that are aesthetically and semantically identical to legitimate websites. Securing the online interface necessitates solutions that address dangers posed by both technological and social vulnerabilities. In the field of secure computing, preventing phishing attacks is a top goal and a serious difficulty. In this paper, we have presented comparative research for multiple classifiers to improve webpage security by detecting phishing websites by inspecting URLs. Machine learning techniques are a formidable defense and have a high learning capacity for making online message recipients aware of attacks and fraudulent websites. It can determine whether a website is safe or a phishing one. We can use detection approaches to check properties such as datasets, feature extraction and detection algorithms, and performance evaluation metrics as prevention tools. Attackers frequently overcome existing phishing defense methods based on URLs or page contents. The results of the paper investigated that the accuracy achieved was 90.23%, 92.97%, and 95.43% using NN, NB, and Adaboost ML models which indicates the reliability and robustness of the proposed method compared with the state-of-the-art methods.
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
A data availability found at
The authors declare that they have no conflicts of interest to report regarding the present study.