Analysis of Flight Punctuality Rate Based on Complex Network

: Aviation service plays an indispensable role in the process of social and economic development. In this process, the problem of flight punctuality becomes more and more serious. Flight delay will bring a variety of implicit and explicit losses to individual passengers and airlines, so it is necessary to analyze the influencing factors of flight punctuality rate. Complex network can be used to study various objects with complex relations, obtain the relations of these objects and calculate the influence of different indicators on the objects. This article mainly has carried on the three aspects: Firstly, get the flight data from the Internet, and use the knowledge of the complex network to obtain the data to build into a directed network. Secondly, to analyze the building of directed network, computing network centricity index, study the influence of different centricity index for flight punctuality. Thirdly, get the statistics of airport indicators including network indicators and calculate the correlation of different indicators and punctuality. Through the above three points, the degree of influence of different influencing factors on flight on-time rate is studied, which is the positive influence or the negative influence. The study of these factors will provide some references for the practical application of the improvement of flight punctuality.


Introduction
With the development of social economy, air service is playing an increasingly important role in our daily life. However, due to the influence of weather, policy and other reasons, flight delays often occur. If flights often fail to arrive on time, it will seriously affect the development of airports, airlines and even the entire aviation industry. Machine learning and aviation big data can be used for air traffic flow analysis [1]. In addition, fusion of meteorological data can also be used for flight delay prediction [2]. Some algorithm models have also been proposed for air traffic flow management [3]. Meanwhile, many research are also devoted to robotic path-planning using machine learning [4], multi-robot path planning using graph neural networks [5], and heuristic approaches [6] can also be applied to robot path planning. Deep reinforcement learning [7] can also be used for motion planning and control. At the same time, there are also researches on real-time motion planning [8].
Many complex systems in real life can be studied by constructing complex networks, such as power networks, aviation networks, transportation networks, computer networks and social networks. The optimization of cascading failure of complex network can be studied by using the NNIA (Non-Dominated Neighbor Immune Algorithm) [9], cascading failure of close networks [10] can also be studied by using different coupling preference under targeted attack, and there is study for stopping failure cascades [11]. Percolation of networks with dependent group [12] can also be studied. Besides, complex networks can be used to study the robustness of interdependent weighted networks [13] and multimodal networks in face of random node failure [14]. For directed gene logic networks [15] and multilayer network's dynamic interdependence and competition [16], there is also some certain research. The cascading vulnerability of network [17] is a research direction, and also for rumor spreading or rumor controlling [18]. we can also explore central attacks in complex networks [19] and the link removal strategy of complex networks [20]. Thus it can be seen that complex network is not only a manifestation of data, but also a processing and analysis method of data. Therefore, the complex network can be used for the visual display of relevant data of aviation network, and the relevant properties of the network can be studied and analyzed, so as to explore the relevant influencing factors of flight on-time rate and put forward reasonable suggestions for practical application.
The rest of the paper is organized as follows. Section 2 introduces the requests method for data acquisition and the complex network for data processing and analysis. Section 3 covers details from data acquisition to network construction and network analysis to correlation calculation. Section 4 shows the experimental results and makes a certain analysis according to the results combined with practical application. Section 5 gives a brief summary of this article.

Method
In this paper, we focus on the potential connections between on-time rate of flights and the state of current airline network. To meet the need of providing pattern about on-time rate of airports, we did valuably explore on the structure of airline network hoping to find some effective indicators.

The Building of Airline Network
We construct the airline network as G = (V, E) applying complex network theories in which V represents the node set of the network and E represents all the existing edges between nodes. In this paper, nodes of the airline network denote airports and the airline between airports is taken as edges. The weight of an edge is determined by the count of flight between the airport at each end.

The Basic Features of Airline Network
Some basic features of airline network will be listed as follows in Tab. 1 for our convenience, including index about degree, degree distributions and path length.

Main Centrality Index
Degree centrality: The centrality of a node is measured by the ability to make direct connections with other nodes. It can be written as (1).
(1) Closeness centrality: It is defined as the reciprocal of the sum of the shortest distances between a node and all the other reachable nodes in the network. The closer a node is to other nodes, the less the node is dependent on other nodes in transmitting information, therefore the node has a higher degree of centrality. The specific definition can be seen in (2).
(2) Betweenness centrality: The centrality of a node is determined by the ability to be the "middle man", which reflects the control role of a node in the overall network. It can be written as (3).
Eigenvector centrality: The importance of a node depends on both the number of its neighbors which is the degree of the nodes and the importance of its neighbors. It can be abstracted as (4) where c is a proportional coefficient which can reach the steady state after multiple iterations. A is the adjacency matrix of the network.

Problem Analysis
When the on-time rate of flights is too low, it will not only cause losses to customers, but also affect the credibility of airports and airlines, cause certain economic losses to the aviation industry, and affect the future development of the aviation industry. Therefore, we constructed the flight data into a complex network, and used the complex network to analyze and calculate the relevant properties, explored the influencing factors of flight punctuality rate and put forward some suggestions based on the actual situation.

Data Acquisition and Network Construction
To analyze the influencing factors of flight on-time rate, it is necessary to obtain relevant flight data, such as flight departure airport, destination airport, on-time rate and other information. Crawl daily flight information on the Internet with python's requests library and store it locally as a json file for processing and analysis. Some of the data crawled is in the form of pictures, and the text in the pictures needs to be extracted, so python's pytesseract library is selected to recognize the text in the pictures.
After getting the data, we need to build the data into a complex network for subsequent processing and analysis, and we need to use python Networkx library to build the network. We first read the json file that is saved locally, and read the flight information item-by-item in the form of a python dictionary. In the flight information, the take-off airport is the source node in the network, the landing airport is the destination node, and the airline is the edge. The weight of the edge is the number of flights of this route. In the data obtained, each flight has a punctuality rate. Take an average of the punctuality rates of all flights corresponding to each route, and sum up the average of the punctuality rates of all edges corresponding to each node to finally obtain the punctuality rate of each node. In the constructed network, the airport is the node, each node corresponds to a punctuality rate, the edge is each airline, and the weight of the edge is the number of flights of this airline. The network built is shown in the Fig. 1:

Figure 1: Aviation network
The network is constructed in the form of directed graph, with the direction from the take-off airport to the landing airport. The network we built includes more than 40 airports and more than 1200 airline routes. The ID of the airport node is the three-character code of each airport.

Network Analysis
After the completion of network construction, it is necessary to analyze the related properties of the network. For all nodes, it is necessary to record the on-time rate distribution of all nodes statistically, observe the overall situation of the on-time rate of the entire network, divide airports into different airport groups according to their geographical locations, and compare the on-time rate distribution of different airport groups. After that, degree centrality, closeness centrality, betweenness centrality and eigenvector centrality are calculated. These centralities are taken as the x-coordinate and the punctuality rate of nodes as the y-coordinate, and a graph is plotted to analyze the influence of different centrality on the punctuality rate of nodes.

Correlation Calculation
In addition to the nature of the network, the GDP, population, trade and other indicators of the city in which the airport is located may also affect the airport's on-time performance. Therefore, we need to study not only the impact of the various properties of the constructed network on the on-time rate, but also the impact of economic, demographic and other indicators. The influence of different influencing factors on punctuality rate is different, and the influence size is different, and some factors may be positively correlated, and some factors may be negatively correlated. Therefore, we selected some representative airports, listed their on-time rate and various indicators, calculated the correlation between different indicators and on-time rate, and calculated Pearson, Spearman and Kendall in total.

On-Time Rate Distribution
The Fig. 2 shows the frequency distribution histogram of on-time rate of all nodes in the whole network. It reflects the whole distribution of punctuality. To the airport in accordance with the geographical position is divided into three groups, area1 for southwest and south central airport group, consisting of area2 is central China area, north China, northeast, northwest and Xinjiang area of the airport group, area3 for airport in east China area of the airport group. As can be seen from the figure, the on-time rate is distributed in the interval of 75% to 85% at the most. Relatively speaking, the proportion of airports with extremely high on-time rate and extremely low on-time rate is lower. Area2's airports had less than 70% on-time, area1's airports had more than 85% on-time, and area3's airports had all on-time rates. And area2 and area3 generally have higher on-time rates than area1's airports, possibly because of geographic location. From the figure, we can see that area1 has the highest proportion of airports in the low on-time interval of 65% to 75%, and area3 has the highest proportion in the high on-time interval of 85% to 90%.

Centricity
Degree centrality reflects the importance of nodes in the network from the degree of nodes, that is, degree of nodes out, degree of nodes in. On the other hand, degree centrality reflects the number of nodes directly connected to the node. The network constructed in this paper is the number of airports directly connected to an airport. If the degree centrality of an airport is greater, there will be more airports associated with the airport and more air routes in the airport, indicating that the airport is more likely to be a hub airport or play a role as a transit station in the whole network.
Closeness centrality reflects the closeness of a node to other nodes. The higher closeness centrality of a node, the less it needs to rely on the participation of additional nodes when passing information to other nodes. In the flight network constructed by us, the closer the centrality of an airport is, the less the participation of additional airports is required when the airport flies to other airports.
Betweenness centrality reflects the importance of nodes as mediators to some extent. The constructed flight network reflects the degree to which an airport is located between multiple airports. If an airport has a high betweenness centrality, it is located between many airports and acts as an intermediary for many airports. The more important the airport may be in the network.
The centrality of eigenvectors reflects the importance of the connecting nodes of a node. If the eigenvector centrality of a node is higher, the node connected to the node is more important. In the flight network constructed by us, the higher the centrality of the feature vector of an airport indicates that it is connected with more important airports, and it is more likely to be an important airport itself.
The relationship between each centrality and node punctuality rate is shown in the Fig. 3: Figure 3: Centrality and node punctuality rate Fig. 3(a) shows the relationship between degree centrality of nodes and punctuality rate. The degree centrality of nodes is widely distributed in the interval from 1.2 to 1.8, indicating that most nodes in the constructed network have a large number of routes. In the sparse interval with degree centrality less than 1.2, the difference between the maximum value and the minimum value of nodes is small, while in the dense interval with degree centrality greater than 1.2, the difference between the maximum value and the minimum value of nodes is large.In less than the range of 1.2, node degree of centricity, smaller show that associated with the airport airport is less, it has a number of flights, routes for less, when confronted with may cause delays of emergency, shall be the number of flights, routes, the number is less, is not easy for a flight delay involved more flights, so it is not sensitive to emergency more, so the smaller values of the difference between maximum and minimum values. On the contrary, when degree centrality is greater than 1.2, such airports have more flights, more routes and are more sensitive to emergencies, so the difference between maximum and minimum values is larger. Fig. 3(b) shows the relationship between closeness centrality of nodes and punctuality rate.The distribution of closeness centrality is similar to that of degree centrality. In the interval with small value, the distribution is sparse, the difference between the maximum and minimum value is small, and in the interval with large value, the distribution is opposite. The greater the degree centrality of a node, the more nodes it is directly connected to, and it does not need too many additional nodes to participate in the transmission of information, so its closeness centrality will be greater, so degree centrality and closeness centrality will present similar distribution. Fig. 3(c) shows the relationship between the betweenness centrality of nodes and on-time rate.The distribution of betweenness centrality and degree centrality is opposite. The distribution of betweenness centrality is dense in the interval with small value and sparse in the interval with large value. The difference between the maximum value and the minimum value is larger in the interval with the smaller value, and the difference between the maximum value and the minimum value is smaller in the interval with the larger value. Because the nodes with greater degree centrality are directly related to more nodes, rather than indirectly related to other nodes as intermediate nodes, the distribution of the two nodes is opposite. Known from the analysis of the above, the degree of centricity larger nodes deliver the greater the difference between maximum and minimum values, when building the directed network of moderate centricity and betweenness centrality appear contrary distribution, so the degree of centricity large betweenness centrality is small, so the betweenness centrality is small, deliver the maximum and minimum value, the greater the difference. Fig. 3(d) shows the relationship between the centrality of feature vectors and punctuality rate of nodes.It can be seen from the figure that, in the network constructed by us, the distribution of eigenvector centrality is close to that of degree centrality. The distribution of eigenvector centrality is dense in the interval with large value, and sparse in the interval with less distribution. Because the greater the degree centrality of an airport is, the more airports it is directly associated with, and the greater the number of airports it is associated with, the more likely it is to contain important airports. Therefore, the greater the centrality of its feature vector will be. The difference between the maximum value and the minimum value in the densely distributed region is greater than the difference between the maximum value and the minimum value in the sparse region. Because eigenvector centrality reflects the importance of nodes connected to the, so the greater the eigenvector centricity, is connected to the more important the airport, the airport when breaking factors lead to flight delays, characteristic vector in mind big important airport, the airport will be implicated in more, so the more sensitive to sudden factors, the maximum and the minimum of the greater the difference.

Correlation
The Tab. 2 lists the on-time rate and statistical indicators of some airports, and calculates the correlation according to this table. Routes represents the number of routes in the table of the route to the airport, d_c for degree centricity, c_c for closenness centricity, b_c for betweenness centrality, e_c as eigenvector centricity, GDP for the airport city's GDP, population for the total population of the city at the end of the year, the trade for the city's total import and export of goods, and the rate is punctuality rate for the airport. Each column in the table is treated as a vector, and the normalization of each onedimensional vector is first performed. Then the correlation between different vectors is calculated. The three correlations calculated in this paper are shown in the Fig. 4. In total, Pearson, Spearman and Kendall were calculated. The Pearson correlation coefficient measures the degree of linear correlation. Pearson correlation coefficient is calculated based on the variance and covariance of the original data, so it is sensitive to outliers and measures the linear correlation. Therefore, even if Pearson correlation coefficient is 0, it can only indicate that there is no linear correlation between variables, but there may still be a curve correlation. Spearman correlation coefficient and Kendall correlation coefficient were both obtained based on the rank and the relative size of the observed values. As a more general non-parametric method, spearman correlation coefficient is less sensitive to outliers and thus more tolerant. The main measure is the correlation between variables. The Fig. 4 compares the three correlations between the ontime rate and different indicators.

Figure 4: Correlations between the punctuality rate and different indicators
It can be seen from the figure that the correlation between different indexes and on-time rate is not only reflected in the difference of absolute value, but also in the difference between positive correlation and negative correlation. Among them, the absolute value of correlation between population and d_c is relatively large, indicating that the change of on-time rate may be correlated with these two factors. It can be seen that the on-time rate and population show a strong positive correlation in three correlations, indicating that the increase of population in a certain range may make the on-time rate of flights show a certain rising trend. Similarly, the on-time rate and the number of routes, b_c and the city's GDP show the same trend within a certain range. There is a strong negative correlation between on-time rate and c_c in three kinds of correlation, which indicates that, within a certain range, on-time rate and c_c present the opposite change trend. Punctuality was positively correlated with d_c, e_c and trade in Pearson correlation, among which, the positive correlation was greater with trade, while spearman and Kendall coefficient were negatively correlated. As Pearson coefficient reflects the degree of linear correlation of variables, it indicates that d_c, e_c and trade have a certain linear correlation with on-time rate, among which trade has a strong linear correlation with on-time rate. Spearman and Kendall coefficients are more common, and what they measure is the relationship between variables. Therefore, although these three influencing factors have a certain linear correlation with on-time rate, since both spearman and Kendall coefficients are negative, these influencing factors show an opposite trend with on-time rate.
The above analysis is based on flight information, airport information and so on, but the punctuality rate of flights is also related to the route planning of the entire airspace. For example, the length of air route and one-way air route will influence the on-time rate of flights. Therefore, based on the air route information and the complex network knowledge, the entire air route network of China is constructed, and the relevant properties of the air route network are calculated and analyzed. The same eight airports are still selected to conduct correlation analysis between relevant indicators and flight on-time rate.
The Tab. 3 statistics the route indicators of relevant airports and the corresponding on-time rate. The shortest_l is the average of the shortest path length from an airport node to other nodes in the network. The max_f is the average of the maximum flow between an airport node and other nodes. In addition, r_D_c, r_C_c, r_b_c and r_e_c are the degree centrality, compact centrality, medium number centrality and feature vector centrality of the point in the air route network.  The on-time rate is positively correlated with the average of the shortest path length in Pearson coefficient, and negatively correlated with Spearman and Kendall coefficient, indicating that there is a certain linear correlation between them, but the change trend is opposite within a certain range. Because the shorter the shortest path, the shorter the flight time, the less likely the delay is to occur. There is a completely negative correlation between the on-time rate and the maximum flow, because the greater the traffic flow of one air route, the more likely it is to cause conflicts due to airspace congestion, resulting in the decrease of on-time rate. In the four centrality, the airway network presents different correlation performance. There is a strong positive correlation between punctuality and degree centrality and eigenvector centrality. The Pearson correlation of closeness centrality is negative, and spearman and Kendall coefficients are 0, indicating that the punctuality rate of the airway network has almost no correlation with closeness centrality. Spearman coefficient of betweenness centrality is 0, and the absolute value of the other two coefficients is small, indicating that there is a certain positive correlation between punctuality and betweenness centrality, but the correlation is weak.
The analysis of flight network mainly reflects the relationship between airports, while the analysis of air route network reflects the impact of overall path planning. The Fig. 4 shows that the higher degree centrality and eigenvector centrality is, the higher on-time rate is not necessarily. Because the higher degree centrality and eigenvector centrality is, it just means that the airport or the airport connected with the airport is connected with many other airports, and the number of cities to be transported does not necessarily mean the higher on-time rate. However, in the route network in the Fig. 5, the on-time rate is positively correlated with the two, because the higher the two are, the more paths of the current node or its neighbors, and the more paths it has, the more options it has to adjust in case of conflict, so as to avoid flight delay. In the Fig. 4, there is a certain relationship between on-time rate and closeness centrality and betweenness centrality, while in the Fig. 5, there is little relationship between on-time rate and these two, because the closer a node is to other nodes or between many nodes, it is not necessarily related to the number of air routes owned by the node or the shortest distance to other airports.
According to the above analysis, in order to improve the punctuality rate, the number of available flights in the airport should be larger, and the GDP and population of the city where the airport is located should be larger. At the same time, more available routes will also improve punctuality to a certain extent. In order to achieve a higher punctuality rate, it is necessary to make the average shortest path and maximum flow from the airport to other nodes smaller.

Conclusion
This paper first uses the knowledge of complex network to construct the flight information obtained from the Internet into the form of directed graph. The airport is the node, the airline is the edge, the weight of the edge is the number of flights, and the direction of the edge is from the take-off airport to the landing airport. We calculate the distribution of flight on-time rate, calculate the different centrality of the constructed directed network, analyze the influence of different properties on flight on-time rate, and finally calculate the correlation between different indexes and on-time rate. In addition, we set up a route network and analyze the influence of route parameters on on-time rate. In the process of analysis, we don't consider the impact of airlines on flight on-time rate, and many characteristics of complex networks can still be considered in the impact factors of on-time rate. The influence factors of flight punctuality rate based on complex network still need to be further explored.