Generating Synthetic Trajectory Data Using GRU

With the rise of mobile network, user location information plays an increasingly important role in various mobile services. The analysis of mobile users’ trajectories can help develop many novel services or applications, such as targeted advertising recommendations, location-based social networks, and intelligent navigation. However, privacy issues limit the sharing of such data. The release of location data resulted in disclosing users’ privacy, such as home addresses, medical records, and other living habits. That promotes the development of trajectory generators, which create synthetic trajectory data by simulating moving objects. At current, there are some disadvantages in the process of generation. The prediction of the following position in the trajectory generation is very dependent on the historical location data, but the relationship between trajectory positions tends to be ignored. Most commonly used methods only adopt the probability distribution of users’ positions to generate synthetic data. On the one hand, this type of statistical method is too rough, and on the other hand, it cannot bring more benefits in availability by increasing data volume. We propose a new trajectory generation method in this paper–Trajectory Generation Model with RNNs(TGMRNN), to address the deficiencies above. It adopts the RNN model to replace the traditional Markov model to generate trajectory data with higher availability. Meanwhile, it solves the problem that RNNs are unsuitable for continuous location data by representing trajectories as discretized data with the grid method. We have conducted experiments in a real data set. Compared with the Markov model, the results of TGMRNN demonstrate that it is superior to some existing methods.


Introduction
Current intelligent services are data-driven, and the provision and optimization of service are based on massive user data. And when service providers use user' data, they will inevitably violate the user's privacy. With the development of communication technology, positioning technology has been more widely applied. Mobile phones, computers, cars, and other devices, used in our daily life are all equipped with positioning technology. Newzoo Global Mobile Market Report, published in 2020, shows that The number of global smartphone users will reach 3.5 billion in 2020, up 6.7% year-on-year, and the number will reach 4.1 billion in 2023. Research on location-based services has made significant progress. Three main factors promote its rapid development. The first factor is that people are aware of the importance of human mobility research, which is very important to guide social construction, such as disease transmission, urban planning, and personal location services. The second factor is the rapid development of sensor devices. Our ability to collect location data information has been significantly improved with the rapid popularization and development of sensor devices (such as smartphones, wearable devices, onboard sensors, etc.). The third factor is the emergence of premium location services. More and more location-based services and applications, such as AutoNavi, Didi, Meituan, and others, collect users' locations to understand their spatiotemporal mobility patterns. And they are based on location to provide users with convenient life and rich entertainment [1]. But, some malicious applications and unreliable third parties, which are mounted on these devices and provide a variety of location-based services (LBS), may violate users' location privacy by acquiring and analyzing the subscriber's location information [2,3]. The lawbreakers can infer the user's sensitive information, such as a family address, medical records, religion, and other privacy. They may commit a crime based on location privacy. Therefore, a reliable and effective location privacy protection mechanism is urgently needed by users.
Trajectory generators can well solve the above problems. Namely, on the premise of guaranteeing the users' location privacy, it can publish the highly available synthetic trajectory data [4][5][6]. The trajectory generators extract features and model the moving pattern of users' to train the model. The synthetic data generated by the trained model should be highly similar to the original data, and it also could be used in data mining. At present, the main research content in trajectory generation is how to improve the availability of generated data under the premise of privacy. Researchers invest in the analysis and mining of mobile objects' location information and produce many research results. But most of the researches focus on analyzing the historical trajectory of moving objects and mining exciting information, and there are a few types of research on trajectory generation technology.
At present, the mainstream trajectory generation method is to generate synthetic trajectory data through the statistical distribution of the data set, constructing prefix tree or the Markov model etc. Compared with the availability, researchers pay more attention to protecting the privacy of the generated data. Hence, more research on the availability of the generated data needs to be invested [7,8]. The Markov model used in the current study has some shortcomings. First, the Markov model is not enough to process some complex data, and the accuracy of the Markov model is relatively low. Second, the simple low-order Markov model works better than the high-order Markov model in the location prediction. Still, low-order Markov can not obtain the context of the high dimensional sequence [9].
To address these problems, we have presented the Trajectory Generation Model with RNNs (TGMRNN), which analyzes the historical location data of moving objects to generate synthetic trajectory data. Recursive neural networks can learn and model these hidden patterns contained in the original trajectory dataset. Trajectory data is sequential data having long-term temporal dependence. In practice, we need to analyze the trajectory data for complex and hidden movement patterns. The position sequence is sequential, and there is a context relation between the positions. The model should be able to continuously learn the characteristics of the position before and after the current position. This paper is to generate the high availability data set by processing and analyzing the historical tracks. Recurrent neural networks (RNN) can store memory through connections with feedback nerves. Besides, RNN networks have a built-in advantage in learning and generating data with long-term time dependence. Therefore, we choose to use RNNs to achieve the generation of forged data sets.
The contributions in this paper are summarized as follow: (1) We transform location data to trajectory sequence by grid method. The location information mainly consisted of latitude and longitude, which is continuous. The grid method divides the geographic space into different cells and then takes the corresponding cell number as the number of points falling into the corresponding cell. The transformation from continuous data to discrete data is realized. And we convert the problem of continuous location generation to that of sequence generation.
(2) We adopt the GRU model to generate the trajectory, ensuring that the trajectory generator could process high-dimensional sequences and preserve the relationship between the positions. Markov model is only effective in processing low-order (1-order or 2-order) sequences relative to the GRU model. We design the training and generating process using the many-to-one pattern, combined with the N-gram method, which can customize the length of the processing sequence.
(3) We conduct experiments on a real dataset, demonstrating the effectiveness of TGMRNN compared with the mainstream model. The rest of this paper is organized as follows: we review relevant related work in Section 2. Section 3 describes the Trajectory Generation Model using RNNs in detail. The experimental results and performance analysis are presented in Section 4. Section 5 concludes this paper.

Related Work
This section describes related work on trajectory data generation based on Markov models and RNN models.
Mechanisms for generating trajectory data need to model the sequential data. Markov model is a standard sequence generation method. It models the past sequence behaviour of users to generate the next behaviour of users. Some studies use the Markov model to predict the next place for trajectory generation. Simmons et al. [10] trans the Hidden Markov Model (HMM) for each user to indicate which place the user will. Liao et al. [11] train a hierarchical Markov model on subscribers' daily trajectories. The methods above train the model and predict the future location based on a single object's historical trajectory. In this case, when the user reaches an area that has not been reached, the model will not predict. Xue et al. [12] propose that the collective-pattern-based method divides group tracks into sub trajectories to overcome the shortcomings of modelling individuals. These sub trajectories are combined into growth tracks, and then the Markov model is used to predict the location. However, there are also drawbacks to the collective-pattern-based method. The collective-pattern-based approach is too coarsegrained in predicting the next area, such as using a first-order Markov model that predicts the same next place for all users in the same location. In some estimation generation work based on privacy protection, Xue et al. [13] conduct a prefix-tree using individual tracks and then construct a Markov model to predict the users' next locations. Chen et al. [14] utilize a hybrid-granularity prefix tree structure, which is efficient data-dependent yet differentially private, to generate trajectories. Chen et al. [15] proposed a novel approach, which extracts the essential information of a sequential database in terms of a set of variable-length n-grams with large counts to generate and release trajectory data. He et al. [16] took the prefix tree structure and a hierarchical reference system to model users' moving pattern at different velocities. Wang et al. [17] proposed a private trajectories calibration and publication system (PTCP), which generates synthetic trajectories by the build noise-enhanced prefix tree. Xu et al. [18] proposed DP-LTOP, which divides the original trajectory sequence into different sub-segments, and then selects the appropriate locations and segments to form a synthetic trajectory. Wang et al. [19] proposed DP-PSP, which addresses the heterogeneity of trajectories by the anchor point clustering and road segment mechanism to better the synthetic trajectories' usability. Gursoy et al. [20] proposed Adatrace, which consist of four steps, features extraction, feature extraction, synopsis learning, privacy and utility preserving noise injection, and generation of differentially private synthetic location traces and generates synthetic trajectory by the Markov model with noise. This paper points out three threats, Bayesian inference threat, partial sniffing threat, and outlier leakage threat which are important but many studies have overlooked. Gursoy et al. [21] proposed DP-star, similar work to Adatrace. DP-star takes the Minimum Description Length metric to normalize the trajectory data to benefit the model training process. Ou et al. [22] adopted to extract reliable segments from sub-trajectories, build an exploration tree, and generate synthetic trajectories. Ghane et al. [23] proposed TGM, which could generate trajectories with arbitrary length, could encode trajectory data into the graphical generative model effectively.
Recurrent Neural Networks (RNNs) has been not only successfully applied in Nature Language Processing but also achieve outstanding performance for sequential data, such as text generation, image captioning [24], and location prediction [25]. RNN iteratively reads the sequence, iterating through each element of the sequence and updating its representation based on the input and the previous state. The connection between the hidden units and their respective projections is preserved. Gating units are often used in RNN models to transform the information flow in a more structured manner. A critical factor in determining the applicability of RNN is the size of the data set because RNN has poor generalization performance over small data volumes [26]. Based on the application of the current research results of RNN and the characteristics of the sequence generation problem, we adopt the RNN model to achieve trajectory generation. RNNs network is practised to process sequence data with long-term space-time dependencies, and it is ideal for achieving the inherent attributes of the continuous position.

TGMRNN Overview and Core Components
This section presents the architecture of TGMRNN.

Preliminaries
RNN is suitable for processing time-sequential data, and it can transfer the output and state of the current moment to the next moment as input. Therefore, this kind of string structure can maintain the relationship between moments. And there are problems of gradient disappearance and gradient explosion. Hence it is difficult for RNN to maintain long-term dependence. Researchers further create many excellent evolution models based on RNN, such as Long Short-Term Memory (LSTM) and GRU, to solve these problems. These models solve long-term dependence by adding memory units and avoiding gradient explosions by gating units. Compared with LSTM, GRU has fewer parameters and faster training.
The proliferation of digital mobile data, such as GPS tracking, wireless communication records, and social media location records, coupled with the superior predictive power of artificial intelligence, has sparked rapid development in the field of human mobility research. Trajectory generation technology is an essential part of human mobility research. We want to take feature extraction from a data set containing mobile users' actual location trajectories to build a generation model and achieve the purpose of maintaining statistical utility. The research on human mobility mainly includes three aspects: the next location prediction, the people flow forecast, and trajectory generation. Mobile communication networks, GPS, and social networks are the primary source of location data. For example, the mobile communication network has developed into the fifth generation of mobile communication technology. The mobile communication network can almost cover all the range of human activities by various heterogeneous networks (such as satellite networks). The mobile communication devices will interact with the base stations when we use mobile phones and other mobile communication devices to send and receive information. In this interaction process, the users' geographical location will be acquired by the mobile communication network.
Mobility data describes the movement of a group of individuals during the observation period, usually collected by sensors, and is stored as a spatiotemporal trajectory or flow of movement. An individual trajectory is a group of records, typically each containing the identification, the geographic location, and the timestamp of the location.
The format of a trajectory is defined as follows: Definition 1 (Trajectory). Let i denote a unit and a trajectory of i is T i ¼ ½ST 1 ; ST 2 …ST n i , which is composed of n i sequential locations visited by the unit i. A spatiotemporal location, denoted by ST, contains the geographic location and the timestamp, and ST ¼ ðl; tÞ. Let l ¼ ðlatitude; longitudeÞ because the geographic location of a unit is usually represented by latitude and longitude. And t represents the timestamp of the corresponding location.
We need to discrete the continuous two-dimensional(latitude and longitude) spatial data by embedding technology in trajectory generation. The embedding technology is to divide the two-dimensional space into a limited number of independent regions and then map the location points to labels. The grid method is defined as follows: Definition 2 (Grid method). For the geographic region A, a grid method G contains n independent regions, represented as G ¼ P n I¼1 g i , and it satisfies 8i 2 n; g i 6 ¼ [ and 8i; j 2 n; i 6 ¼ j g i \ g j ¼ [. And each region G has its own label. Therefore G could map geographic locations to labels.
Our goal is to extract and learn the mobility patterns of users to generate synthetic data.

TGMRNN Model
Consider a dataset of accurate location trajectories, denoted by D raw . We want to design a generative model for learning and modelling the original location data and generating synthetic trajectory data set, denoted by D syn . The synthetic trajectories D syn need a high degree of similarity to D raw .
We design and develop TGMRNN, a trajectory dataset generator. Fig. 1 illustrates the system architecture of TGMRNN.

Data Pre-Processing
TGMRNN preprocesses the input data set D raw . Our expected trajectory data should be collected by the same sampling rate and vehicle from users, and track locations are distributed within the specified geographic area. Although there is no unified format for the publicly available trajectory data sets, they generally contain user identification, time, and geographical location, which is sufficient for training. Therefore, before we start training the model, we need to preprocess the input data set D raw and transform it into our expected format, the standard format. In this process, we take to use the Grid method to transform the data, and its details are processing process is as follows: A. Transform the input data into a location sequence. Let P i denote a piece of data in the data set D raw and D raw ¼ ½P 1 ; P 2 …P n . The information P i contains geographic location and time, and some also include We require that the user's location should be collected at the same sampling rate. Therefore the time interval between adjacent locations in a trajectory is the same. We divide the data set D into multiple trajectories. If the time interval between two adjacent locations is greater than the time interval corresponding to the sampling rate, the front and back should be divided into two trajectories from the current location. D raw is transformed to D 0 raw by this way. There are trajectories in D 0 raw . Let T i denote a trajectory, and T i consists of the sequence of locations, which could be represented as T i ¼ ½ðlat 1 ; lon 1 Þ; ðlat 2 ; lon 2 Þ…ðlat n i ; lon n i Þ, where ðlat 1 ; lon 1 Þ is the latitude and longitude of the location. D 0 raw consists of multiple trajectories, which is represented as D 0 raw ¼ ½T 1 ; T 2 …T n . D 0 raw preserves the locations of the moving object and the context between the locations in this way.
B. Map the geographic location to the grid cell identity. TGMRNN discretizes continuous twodimensional (latitude and longitude) spatial data by the grid method. As shown in Fig. 2, for example, there is a trajectory T 1 in the area and T 1 ¼ ½L 1 ; L 2 ; L 3 ; L 4 ; L 5 ; L 6 where L i stands for a location of the trajectory. Then, TGMRNN maps the locations of the trajectories to the cell identifications of the grid to which the locations belong to. TGMRNN transforms the location sequence T 1 to cell identification sequence S 1 , where S 1 ¼ ½C 1 ; C 4 ; C 5 ; C 8 ; C 8 ; C 9 . C i is short for Cell 4 in Fig. 2. TGMRNN transform D 0 raw to D trajectory by mapping L i to S i one by one, and we denote the format D trajectory as the standard format.

Training
As shown in Fig. 3, we used the RNN model to train and learn the Sequence data. The system's core function is the RNN model. The TGMRNN configures the network model to learn the convolution sequence and extend it to the space-time domain for trajectory generation. This paper adopts the GRU model. Compared with LSTM, which can remember the information in the past and selectively forget some unimportant tips to model the long-term context and other relations, GRU reduces the gradient disappearance problem while retaining the long-term sequence information. TGMRNN divides the trajectory sequence dataset into sample sequences. For each input sequence, the corresponding output contains the same length of the sequence but moves one character to the right. For example, if the sample sequence is ½C 1 ; C 2 ; C 3 ; C 4 ; C 5 , the input is ½C 1 ; C 2 ; C 3 ; C 4 , and the output is ½C 2 ; C 3 ; C 4 ; C 5 . The problem can then be thought of as a standard classification problem: given the previous RNN state and the input of this timestamp, predict the class of the following sequence unit. We then added the optimizer and loss function, defined the appropriate epoch, and trained.  Fig. 4 illustrates the basic cyclic network generation architecture. The process of sequence generation using the trained RNN model is as follows:

Generation Algorithm
(1) Set the start position, initialize the RNN state, and set the trajectory's length to be generated. The starting position and RNN state are used to obtain the predicted distribution of the next position.
(2) The classification distribution is used to calculate the predicted location index, which is then used as the next input to the model.
(3) The returned state is fed back to the model. The model has more context to learn from than just one location. After the next location is predicted, the changed RNN state is fed back into the model. The model learns by always getting more context from the previously indicated place and continuously generates new sequences until the termination item is triggered.

Experiment
In this section, we report experiments conducted to assess the effectiveness of our generation algorithm.

Settings and Data Set
We implemented the experiment on Colaboratory of Google, which provided GPU for TGMRNN. We built TGMRNN by PyTorch 1.8.1 and Python 3.7.10. We compare it with the Markov model. We choose the T-Drive trajectory data sample [27], coming from Microsoft T-Drive project. It contains trajectory data of more than 10,000 taxis in Beijing for a week in 2008, which contains 15 million coordinate points and tracks over a total distance of more than 9 million kilometres. There is enough data to allow us to set up experiments of different sizes for a comprehensive comparison. We preprocessed the data set to form  tracks. After eliminating the abnormal data such as vacant position, single-point position, and error information, more than 1.8 million tracks with a length greater than one were generated in total. We compared the 1-order Markov model with the TGMRNN model during the deployment experiment because research shows that low-order Markov and high-order Markov performance is similar. To keep the training time reasonable, we set 100 epochs to train the model.
We set different numbers of mobile users, other grid partition measurements as a control experiment. We used different numbers of moving objects, 10, 100, and 1000 respectively, for training, and the corresponding number of tracks are 3215, 20306, and 172419, respectively.
We set the Markov model and TGMRNN model to learn data, respectively, generate trajectory data as the same amount of the training data, and then carry out statistical analysis.

Evaluation Metrics
We adopt Kullback-Leibler divergence (KL-Divergence) to evaluate the generated trajectory dataset's effectiveness and the real user movement trajectory dataset. KL-divergence is equivalent to the difference of Shannon entropy between two probability distributions. KL-divergence could evaluate the degree of similarity among different data sets.
Supposed PðxÞ and QðxÞ are two probability distributions on the random variable x. In the case of discrete random variables, the definition of relative entropy is : The KL-divergence is nonnegative: KLðPjjQÞ 0, equal to 0 when P ¼ Q. The closer a KL-Divergence value is to zero, the closer the generated trajectory data set's location distribution is to the entire data set.

Evaluation of TGMRNN
We evaluate the performance of the TGMRNN and Markov models.
We evaluate the different influence on TGMRNN and Markov model by adopting the grid measurement 25 × 25, 50 × 50, 75 × 75, 100 × 100. The experimental results are shown in Tabs. 1 and 2. With the increase of the value of the grid, from 25 × 25 to 100 × 100, the map is more finely divided, and the effectiveness of synthetic trajectory data, generated by TGMRNN, gradually strengthens, where KL-divergence declines from 0.223 to 0.204 in 100 objects and from 0.173 to 0.133 in 100 objects. It is clear that the result for 1000 objects is better than the result for 100 objects under the same gird size, and the best value occurs in 1000 objects and 75 × 75. However, changes in the grid size have little impact on the results of Markov, where the values of KL-divergence stabilize around 0.233 in 100 objects and 0.324 in 1000 objects. And as shown in Fig. 5, the changes in the grid size do not make a significant difference in the availability of the generated data of the Markov model. We also evaluate the effect of a change in the number of objects on the trajectory generation, with 10, 100, and 1000. In the experiment of 10 objects, the experimental results are not representative because the number of objects is too small. With the increase of the number of objects, the availability of data generated by Markov declines significantly from 100 objects to 1000 objects, whose average of KL-divergence is 0.235 and 0.324, respectively, indicating that with the rise in the number of data, the Markov model is challenging to model the trajectory data. On the contrary, the availability of TGMRNN's generated data significantly rise from 100 objects to 1000 objects. TGMRNN has better data availability than the Markov model of the same grid size and the same number of users, 100 objects and 1000 objects.

Conclusion
We propose TGMRNN to generate synthetic trajectory data. It could produce trajectory data for mobile service providers, who need a good deal of user data to improve the quality of their products and prevent the disclosure of users' privacy. TGMRNN could transform continuous location data to discrete sequence data and preserve the relationship between the location of high-dimensional sequences. TGMRNN adopts the many-to-one pattern to train the model, which has fewer parameters and faster training and take the trained model to generate synthetic trajectory by predicting the next location. And experiments show that TGMRNN is effective.
In the future, trajectory generation is a hot research field. There are still many problems to be solved, such as (1) we should solve the geographic data sparsity to improve the availability of the trained model;   (2) we need to improve the efficiency and effectiveness of high-dimensional sequence execution; (3) we need more measures for the availability of the trajectory data; (4) we should introduce stricter privacy mechanism in trajectory generation, such differential privacy; (5) we should eliminate some privacy threats, such as Bayesian inference threat, partial sniffing threat, and outlier leakage threat.