Communication opportunities among vehicles are important for data transmission over the Internet of Vehicles (IoV). Mixture models are appropriate to describe complex spatial-temporal data. By calculating the expectation of hidden variables in vehicle communication, Expectation Maximization (EM) algorithm solves the maximum likelihood estimation of parameters, and then obtains the mixture model of vehicle communication opportunities. However, the EM algorithm requires multiple iterations and each iteration needs to process all the data. Thus its computational complexity is high. A parameter estimation algorithm with low computational complexity based on Bin Count (BC) and Differential Evolution (DE) (PEBCDE) is proposed. It overcomes the disadvantages of the EM algorithm in solving mixture models for big data. In order to reduce the computational complexity of the mixture models in the IoV, massive data are divided into relatively few time intervals and then counted. According to these few counted values, the parameters of the mixture model are obtained by using DE algorithm. Through modeling and analysis of simulation data and instance data, the PEBCDE algorithm is verified and discussed from two aspects, i.e., accuracy and efficiency. The numerical solution of the probability distribution parameters is obtained, which further provides a more detailed statistical model for the distribution of the opportunity interval of the IoV.

The concept of the Internet of Vehicles (IoV) is closely related to the Internet of Things (IoT) [

Communication opportunity interval is affected by many factors, including but not limited to vehicle movement model and location distribution, vehicle density, broad geographic distribution, communication accessibility, etc. The previous scholars mainly considered establishing the vehicle movement model and mine the statistical model of communication opportunity interval through theory or simulation. In a Random Waypoint Mobility (RWM) model [

With the development of big data technology, more scholars tended to collect the actual spatial-temporal data sets of vehicle movement, and then used big data mining techniques to determine the statistical model of the data [

For the mixture statistical distribution of spatial-temporal big data, a parameter estimation algorithm with low computational complexity based on Bin Count (BC) and Differential Evolution (DE) (PEBCDE) is proposed. BC is used to count all data in a small number of intervals, and a small number of calculated values are used as statistics for the next step of parameter estimation. DE is used to solve nonlinear optimization problems similar to the Expectation Maximization (EM) algorithm [

The rest of this paper is arranged as follows. Section 2 describes the related work. Sections 3 and 4 introduce the system model and the algorithm. Analysis and simulation results will be described and displayed in Section 5. Section 6 gives the conclusions of this paper and an outlook for future work.

For the estimation of a high-dimensional latent variable model, a novel semi-stochastic variance-reduced gradient designed for the Q-function in the EM algorithm was proposed [

Reference [

The actual spatial-temporal big data often have missing or hidden variables, the traditional methods are difficult to accurately estimate the parameters, and the data may come from different populations. The single distribution model cannot accurately reflect the nature of the data, and it is difficult to mine the effective information of the data. A more reasonable approach is to explore the mixture distribution model of data [

At present, the EM algorithm is usually used for the parameter estimation of mixture statistical distribution modeling [

In this paper, BC and DE are introduced to reduce the computational complexity of mixture distribution parameter estimation. Bin counting and DE based on a small amount of data are used to replace the EM algorithm with a large number of data to realize the optimization of the opportunity interval distribution model of the IoVs.

In this paper, based on the longitude and latitude trajectory data of Beijing taxis in May 2010, the algorithm of data cleaning and extraction of inter-contact times interval is presented [_{i}

where, _{i}

The EM algorithm is widely used in parameter estimation of mixture models. The likelihood function can be simplified by introducing suitable potential data. It first assumes that the known observed data of the total sample number _{i}_{i}_{i}_{ip}

(1) Set the initial value

(2) Step E: Calculate the conditional expectation of the likelihood function based on the complete data set of the hidden variable

where

(3) Step M: Evaluate

The operations of Step E and Step M above are iterated alternately until

The basic flow of a parameter estimation algorithm based on BC and DE proposed in this paper is as follows: (1) First, the total sample data is segmented according to a certain interval, and the number of sampled data in each segment is counted. The statistical probability of each segment is calculated by the mixture distribution probability density function (including parameters), and the probability of event occurrence is calculated by multinomial distribution; (2) the event probability calculated in the previous step is taken as the objective function, and the parameter value at the maximum of the objective function is iterated through DE algorithm.

Suppose the probability density function of the total data set is _{1}, the number of data values in interval _{2}, and so on, the number of data values in interval _{m}

If special cases such as the existence of impulse function in probability distribution are not considered, then the probability of data points in the interval

The probability that _{i}

According to the multinomial distribution, the occurrence probability

The time complexity of this step is

DE algorithm is a heuristic parallel search method based on group difference. In this step, DE algorithm is used to solve the estimated parameters so that

In the

where

Crossover operation increases the diversity of population as follows,

where

After completing the mutation and crossover operation,

The flow of DE algorithm is shown in

The size of the statistic values that DE algorithm needs to process at each step is

This section analyzes and verifies the effectiveness of the proposed algorithm in the system model. Firstly, the correctness of the algorithm is verified and the accuracy of the algorithm is analyzed. Correctness is verified by comparing the similarity between the given model data and the evaluation model data. The accuracy of the algorithm is observed by modeling and analyzing the ideal data. Based on ensuring the effectiveness and availability of the algorithm, we further apply the algorithm to the actual data set to analyze and verify the actual effect of the PEBCDE algorithm on the movement trajectory data of Beijing taxis. Then, we compare the operation efficiency of the traditional EM algorithm and the proposed PEBCDE algorithm under different parameters, and summarize the performance of the algorithm as a whole.

The simulation experiment is implemented in Python (version 3.7.3) and on an AMD Ryzen^{TM}7 3700X processor with 3.60 GHz CPU 16 GB RAM running on the platform Microsoft Window 10.

In this subsection, the simulated data obey a Gaussian mixture distribution with

Name | Population | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Value | 500 | 15 | 0.3 | 2.0 | 0.8 | 4.0 | 0.6 | 0.5 | 0.75 |

The parameters of the statistical model obtained by the PEBCDE algorithm are

The specific iteration of some parameters is shown in

In the initial stage of data processing, the PEBCDE algorithm needs to segment all the data. The difference in the number of bins leads to different degrees of information loss of part of the raw data, resulting in the different effects on the calculation accuracy. However, the process can also be speeded up, allowing users to adjust the accuracy of calculations to suit their specific needs. Kolmogorov Smirnov (KS) test is used to evaluate the accuracy of the model obtained by the algorithm. The results are shown in

Based on the latitude and longitude trajectory data of Beijing taxis in May 2010, this subsection conducted statistical modeling for the time interval of the IoV communication opportunities. Taking as an example the trajectories of, 100 taxis which were randomly selected from all the data sets, we analyzed the statistical model in the case of the number of vehicles mentioned above.

Through multiple tests and analysis, we believe that the communication opportunity interval between vehicles obeys the mixture of lognormal distribution. The statistical model of different vehicle densities can be fitted with the Gaussian mixture distribution of different parameters.

If the time span of a single iteration is used as the unit, the EM algorithm had the fastest initial speed. After more than 10 iterations, the EM algorithm basically converged, and a relatively ideal result was obtained. By comparison of the results of the PEBCDE algorithm with 8 and 32 bins, it holds that the greater the number of bins is, the faster the PEBCDE algorithm was converges. Then, similar to the EM algorithm, it fell into the near optimal solution after a number of iterations. Here after about 20 iterations, the near optimal solution was obtained.

The EM algorithm requires all data in the data set to participate in the calculation. When the amount of data is very large, the calculation speed is slow and it is difficult to deal with massive big data scene. This subsection analyzes the operation efficiency of the EM algorithm and the proposed PEBCDE algorithm.

When the EM algorithm is used for parameter estimation of Gaussian mixture model, the responsiveness of sub-model _{j}_{i}_{i}

Firstly, let us set the size of the data volume to 1000 and 100000, respectively, to obtain the running times of the EM algorithm and the PEBCDE algorithm. As shown in

In this paper, the PEBCDE algorithm based on BC and DE is proposed to overcome the disadvantages of the EM algorithm in parameter estimation of big data mixture model. Through modeling and analysis of simulation data and instance data, the PEBCDE algorithm is verified from two aspects of accuracy and efficiency. The research shows that the computational complexity of the PEBCDE algorithm is significantly reduced. As the increase of data volume, every iteration round of the EM algorithm needs all the data, while the PEBCDE algorithm only needs to traverse all the data once. Therefore, when the amount of data is large, the running time of the PEBCDE algorithm is far less than that of the EM algorithm.

It is worth noting that the PEBCDE algorithm sacrifices precision to some extent compared with the EM algorithm. This is because differences between data within the same interval are ignored after interval calculation. Therefore, the first few moments of the data can be obtained by improving the PEBCDE algorithm combining moment estimation in the first step, and then applied to DE algorithm in the second stage, to further improve the accuracy of this algorithm.

We gratefully acknowledge anonymous reviewers who read drafts and made many helpful suggestions.