The Tor dark web network has been reported to provide a breeding ground for criminals and fraudsters who are exploiting the vulnerabilities in the network to carry out illicit and unethical activities. The network has unfortunately become a means to perpetuate crimes like illegal drugs and firearm trafficking, violence and terrorist activities among others. The government and law enforcement agencies are working relentlessly to control the misuse of Tor network. This is a study in the similar league, with an attempt to suggest a link-based ranking technique to rank and identify the influential hidden services in the Tor dark web. The proposed method considers the extent of connectivity to the surface web services and values of the centrality metrics of a hidden service in the web graph for ranking. The modified PageRank algorithm is used to obtain the overall rankings of the hidden services in the dataset. Several graph metrics were used to evaluate the effectiveness of the proposed technique with other commonly known ranking procedures in literature. The proposed ranking technique is shown to produce good results in identifying the influential domains in the tor network.
The rapid growth of Internet technology and its outreach to the general public has made it easier to conduct a range of services. Along with this, the ever-growing technology has also been used to perform illegal and criminal activities. The dark web is a prime example of the abuse of Internet technology where unethical and illegal acts are predominantly present. The sale of forged documents, counterfeits, illegal drugs and firearm trafficking are some of the grey activities that are carried out on the dark web [
The profound level of privacy and anonymity provided by the Tor dark web has increased the illegal activities on these anonymous platforms. The law enforcement agencies have been continuously monitoring the hidden services to track the criminals. Among all the hidden services in the Tor network, some of the influential or key hidden services that exclusively offer illegal products and services usually have a significant user base. Such hidden services need immediate and dedicated efforts by the law enforcement agencies to track them down. The identification of such hidden services could prove beneficial and to the initiatives being undertaken by the law enforcement agencies in monitoring the illegal activities over the Tor dark web network.
Most of the studies on the Tor dark web network explore the attributes and working of the hidden services by scraping the data using a web crawler followed by its analysis to get insight into the hidden services and possible conclusions. However, the data collected could also be used to identify the prominent hidden services among others by using the graph theory and link analysis algorithms. The influential domain in a Tor network is the one on which the users end after surfing the network moving from one hidden service to the other. In this paper, we present a hyperlink based algorithm for ranking the Tor hidden services to identify the influential services in the Tor network.
The government and law enforcement agencies could focus their efforts on the top-ranked hidden services for further surveillance and monitoring. Though the top-ranked websites identified by the proposed ranking algorithm represent the key hidden services, the other hidden services should not be left off, instead, the concerned authorities should devise strategies to appropriately divide their efforts in targeting the illicit content on the Tor dark web. The proposed ranking algorithm is shown to be performing better than the other link analysis algorithm for the dark web. The performance of the proposed ranking algorithm is evaluated on the metrics like the graph density, clustering coefficient, average shortest path length, giant component and diameter of the equivalent graph representing the Tor network. The dataset for the experiment was collected with the help of the
The rest of the paper is divided as follows: Section 2 reviews the existing work on the link analysis algorithms. Section 3 describes the proposed ranking methodology. Section 4 provides the experimental setup followed by results and discussion in Section 5. Section 6 provides the conclusion of the work.
The Tor dark web has experienced a considerable increase in the number of hidden services after the rise of the Silk Road marketplace that has reported the business of more than USD 150 million [
Al-Nabki et al. [
In another study, a content-based approach was proposed for ranking the Tor hidden services [
While only limited work has been performed to rank the Tor content for identifying key services, plenty of work has been done on the same ranking task for the surface web content. Both the link-based and content-based approaches were adopted including the hybrid approach for ranking the surface web services [
In yet another study, several graph metrics like
The ToRank algorithm only considers the hyperlinks between the hidden services to rank them and to detect the influential domains. However, it has been found that most of the Tor hidden services also have connectivity to the surface web [
The proposed ranking approach requires the generation of a web graph corresponding to the Tor hidden services in the dataset. The ranking algorithm is then applied on the graph to obtain a list of the hidden services in a ranking order.
The web graph is created by representing each of the hidden services in the dataset as the nodes or vertices of the dataset. The hyperlinks between the two hidden services are represented by a directed edge between their corresponding nodes in the graph. All the self-loops and parallel edges were removed from the graph. The out-going hyperlinks to the surface websites were recorded for each of the nodes. In case of several hyperlinks from a node to the different web pages of a surface website, only one hyperlink was considered for that node. A hyperlink is discarded if it points to a hidden service that does not exist in the dataset. The web graph thus obtained is used by the algorithm for ranking.
The proposed ranking approach consists of three different components to measure the influence of nodes in the graph. The definition of the various symbols is given in
Symbol | Definition |
---|---|
Set of nodes in the Tor web graph | |
Set of edges in the Tor web graph | |
Number of surface web hyperlinks of a node |
|
Degree centrality of a node |
|
Betweenness centrality of a node |
|
Closeness centrality of a node |
|
The overall influence of the node |
In this paper, we have considered the significance of the out-going surface web links to the importance of a hidden service in the network. The hyperlinks to surface websites may indicate that the hidden service is willing to advertise its product and services over the open web where a relatively large number of users are present. The influence of the surface web hyperlinks of a node v_i is
The influence of the node connectivity is defined by the significance of the location of the node in the Tor web graph. A user arriving at a node with a good location in the network would have a better probability of moving to other regions of the network through the neighboring nodes, a neighbor of the neighboring nodes and so on until the user reaches the desired node. Therefore, such nodes hold an influential position in the Tor ecosystem and identification of the same may be fruitful in terms of law enforcement perspective.
The centrality metrics used in the graph theory indicates the relative significance of a node in the graph. The different centrality measures reflect the importance of the node’s location in the graph. The centrality value can indicate the ability of the node to bypass a large number of users because it has several paths passing through them [
In the Tor web graph, a node with a high degree centrality would facilitate the movement of a higher number of users through it than the node with a low degree centrality. If the node has a greater closeness centrality value, it would better control the movement of the user and allows the rapid dispersal of the users. The greater value of the betweenness centrality of the node would enable a quicker movement of the users to the entire web graph through a few intermediate nodes. The influence of the connectivity of the node v_i in the network is represented by
The ranking metric presented here is used to assign an influence score to a node based on the influence value of the node’s surface web links and its connectivity in the graph. The influence metric that measures the overall influence of the node differs from the centrality metrics in the sense that it considers the ability of the node in facilitating the movement of the users through the entire Tor network. Each of the nodes in the graph has a different ability to propagate the users and hence have a different influence score. The influence score of the node v_(i) is given by
In the Tor network, each of the hidden services has a different influence and significance as compared to the other services. Thus the procedure of identifying the key hidden services is equivalent to the ranking problem where the top-ranked service would be the most influential one among others. The influential nature of a hidden service is governed by other hidden services to which the former has out-going hyperlinks along with its characteristics. If a hidden service has outgoing connections with the highly influential hidden services, then it would positively contribute to its influence. The PageRank algorithm is appropriate in such scenarios for incorporating the influence of the neighboring nodes [
The proposed ranking algorithm is based on a modified PageRank algorithm to detect the influential hidden services from the Tor web graph. Each of the nodes in the Tor web graph is assigned an initial value reflecting the influence score and is updated iteratively according to the
In
The final rank of the node v_i is governed by two factors: A.) the influence score
The dataset for the study was created using a customized Python web crawler for scraping the Tor hidden services. The web crawler was designed to connect to the hidden services using the SOCKS proxy [
For our ranking procedure, the content of each of the hidden services needs to be parsed to extract only the hyperlinks and discard other textual content. A parser based on regular expressions was utilized to extract the hyperlinks to the surface web and Tor dark web. The hyperlinks to other dark web networks, internet relay chat addresses and sub-domains were eliminated.
The performance of the proposed ranking algorithm is compared with that of the PageRank [
The graph density curve could be used to measure the robustness of the graph [
The clustering coefficient [
The Python NetworkX library was used to construct the graph from the corresponding hidden services in the dataset. The proposed ranking algorithm iteratively computes the score of each of the nodes until convergence is achieved. The algorithm is supposed to achieve convergence when the error between the score of a node in the current iteration and the preceding iteration is less than 0.0001. The parameter value of the PageRank and ToRank were set according to as in previous work [
The Tor web graph generated from the dataset consists of 4041 nodes representing the corresponding hidden services and 14059 edges. The influential score returned by the ranking algorithm is used to rank the nodes in descending order. The top-ranked nodes were repeatedly removed and the graph density was calculated at every iteration. The graph density curve of the PageRank, ToRank and the proposed ranking algorithm are shown in
The performance of the proposed ranking algorithm on the other graphs’ robustness is shown in
Algorithms | Nodes removal (%) | Clustering coefficient | Average shortest path | Giant component | Diameter |
---|---|---|---|---|---|
Full graph | 0.199 | 4.37 | 4761 | 12 | |
PageRank | Top 1 | 0.288 | 4.42 | 4689 | 12 |
Top 5 | 0.248 | 4.49 | 4521 | 12 | |
Top 10 | 0.221 | 4.53 | 3873 | 13 | |
Top 20 | 0.07 | 4.55 | 3325 | 13 | |
ToRank | Top 1 | 0.041 | 4.85 | 2997 | 17 |
Top 5 | 0.02 | 5.43 | 2030 | 20 | |
Top 10 | 0.011 | 6.07 | 752 | 23 | |
Top 20 | 0.006 | 7.12 | 131 | 25 | |
Proposed | Top 1 | 0.051 | 4.83 | 3160 | 19 |
Top 5 | 0.018 | 5.78 | 1606 | 23 | |
Top 10 | 0.007 | 6.24 | 537 | 25 | |
Top 20 | 0.0 | 7.69 | 10 | 26 |
The proposed ranking algorithm is based on an iterative calculation of the influence score depending on the connectivity of the node with its immediate neighbors. Initially, a small influence score is assigned to each of the nodes, once the algorithm achieves convergence, the node that has connections to the other influential nodes in the network will have a much better score than the other. However, being a link analysis algorithm, it cannot evaluate the influence of isolated nodes with zero outgoing and incoming hyperlinks.
An iterative algorithm is proposed in this work to calculate the influence of a hidden service in the Tor network. The influence of a hidden service depends upon its location and connectivity in the Tor network as well as its connectivity to the surface web. Moreover, the influential nature of the hidden service is also determined by its connection to the other services in the network. All these factors are incorporated into an algorithm to determine the overall influence of a hidden service. The proposed algorithm is implemented on a dataset of Tor hidden services and is compared with other link analysis algorithm from the existing literature. The proposed algorithm achieves better performance than the other algorithms measured in terms of various graphs’ robustness metrics.
This research was supported by Taif University Researchers Supporting Project Number (TURSP-2020/231), Taif University, Taif, Saudi Arabia.