In this study, we developed software for vehicle big data analysis to analyze the time-series data of connected vehicles. We designed two software modules: The first to derive the Pearson correlation coefficients to analyze the collected data and the second to conduct exploratory data analysis of the collected vehicle data. In particular, we analyzed the dangerous driving patterns of motorists based on the safety standards of the Korea Transportation Safety Authority. We also analyzed seasonal fuel efficiency (four seasons) and mileage of vehicles, and identified rapid acceleration, rapid deceleration, sudden stopping (harsh braking), quick starting, sudden left turn, sudden right turn and sudden U-turn driving patterns of vehicles. We implemented the density-based spatial clustering of applications with a noise algorithm for trajectory analysis based on GPS (Global Positioning System) data and designed a long short-term memory algorithm and an auto-regressive integrated moving average model for time-series data analysis. In this paper, we mainly describe the development environment of the analysis software, the structure and data flow of the overall analysis platform, the configuration of the collected vehicle data, and the various algorithms used in the analysis. Finally, we present illustrative results of our analysis, such as dangerous driving patterns that were detected.
Currently, massive amounts of data are collected and studied in various domains, including engineering, computer science, commerce, security, chemistry, and bio-molecular science. New datasets are constantly being generated and collected in various fields, and the amount of data is growing explosively. For example, global companies such as Google, Facebook, and Alibaba process tens of terabytes to hundreds of petabytes of data per day [
In this section, we explore related research on vehicle big data analysis and the development of analysis software for vehicle big data. To date, little research has been reported on the analysis of OBD and GPS data of vehicles. Reference [
We implemented the DBSCAN (Density-based Spatial Clustering of Applications with Noise), ARIMA (Auto-regressive Integrated Moving Average) model, and an LSTM (Long Short-Term Memory) algorithm to analyze time-series data. DBSCAN is a density-based data clustering algorithm. It is easy to find clusters with geometric shapes because this algorithm is clustered based on density [
In this section, we describe the overall development of the data analysis software. The structure of the entire system is illustrated in
The system consists of data collection, analysis, and visualization software blocks. The data flow from data collection to analysis software as shown in
The analysis software was developed using Python 3.5; it is designed to be compatible with Python 2.7. Next, we used various libraries to develop our analysis software. The Pandas library was used to analyze large amounts of data efficiently; the Matplotlib and Plotly libraries were used to create graphs of the analysis results. In addition, we applied artificial intelligence techniques to our analysis software based on the SciPy, TensorFlow, Numpy, and Scikit-learn libraries as shown in
Library | Version | Library | Version | Library | Version |
---|---|---|---|---|---|
Pandas | 0.23.4 | PyMongo | 3.8.1 | Openpyxl | 2.6.2 |
SciPy | 1.2.0 | Matplotlib | 2.1.1 | Seaborn | 0.9.0 |
Scikit-learn | 0.20.2 | Plotly | 3.10.0 | XlsxWriter | 1.1.8 |
Numpy | 1.16.4 | Folium | 0.8.3 | TensorFlow | 1.13.1 |
In this study, three database platforms, MongoDB, InfluxDB, and Open TSDB, were used for data collection and storage of processed data. The raw data are collected and accumulated in each database, and the data processed by the analysis software are stored in a new table or metric (or measurement). InfluxDB is an in-memory database that enables very fast data processing, whereas OpenTSDB can efficiently store and process time-series data [
In this study, we performed exploratory data analysis (EDA) and data preprocessing before analyzing the data. The EDA and preprocessing tasks included removing not a number (NaN) and outlier data and creating new tags. In addition, we calculated the azimuth based on the GPS values because the current data only comprised GPS values without azimuth data. We calculated the azimuth using the GPS between two points [
We derived Pearson’s correlation coefficient before analyzing the data. The software module for obtaining the correlation coefficient was developed based on the following formula [
where
Cov(X, Y) is the covariance of
The procedure for analyzing the correlation coefficient with other parameters based on the driving speed is shown in
The structure and procedure of our analysis software are illustrated in
In particular, we implemented the DBSCAN algorithm, ARIMA model, and LSTM algorithm for time-series data analysis. As mentioned earlier, DBSCAN was used to analyze the trajectory of the vehicle in this study. We used the ARIMA model and LSTM to predict the battery voltage, coolant temperature, and engine oil temperature of the vehicle. Nevertheless, not all the algorithms we have implemented are used in the analysis engine and research on how to apply the most suitable algorithm to analyze the data we have collected is currently in progress.
In this section, we describe representative results of our analysis, including the results of dangerous driving patterns.
The main distribution of the collected data was as follows: The data collection period was from January 2017 to February 2020; the data were collected from three companies. Currently, the size of the dataset is approximately 73.1 GB with 793,138,471 rows (as of February 2020). The types and amounts of data parameters collected by each company are different. The monthly data distribution is shown in
CAR_CD | DTG SIGNAL | CLV (Cylinder Load Value) | ATQ (Actual Engine-Percent Torque) | |
---|---|---|---|---|
CREATE_TIME | DTG_STATE | COT (Coolant Temperature) | EGR (Cmd EGR and EGR err) | |
DEPOT_ID | DEVICE_STATE | IAT (Intake Air Temperature) | FTQ (Engine Friction-Percent Torque) | |
RSV_NO | RUN_SPEED | OAT (Outside Air Temperature) | CTB (Catalyst Temperature Bank 1) | |
STL_CNT | MAX_SPEED | MAP (Manifold Air Pressure) | EST (Time Since Engine Start) | |
GPS_SPEED | AVG_SPEED | MAF (Manifold Air Flow) | EFR (Engine Fuel Rate) | |
LONGITUDE | ODO (Odometer) | IFC (Instantaneous Fuel Consumption) | AAT (Ambient Air Temperature) | |
LATITUDE | ALTITUDE | EVC (Electric Vehicle Charge) | ERT (Engine Reference Torque) | |
GPS_TIME | RPM | EVM (Electric Vehicle Charging Mode) | MDT (Monitor Status since DTC cleared) | |
BATT_POWER | BAP (Barometric Pressure) | TIS (Time of Idle Status) | DMA (Distance Traveled While MIL is Activated) | |
DRV_DISTANCE | AFR (Actual Fuel economy) | TPS (Throttle Position Sensor) | DMC (Distance Traveled since DTC cleared) | |
DRV_DISTANCE_DAY | APS (Accelerator Pedal position Sensor) | O2S (Bank1-Sensor1(wide range O2S)-Lambda) | CMV (Control Module Voltage) |
Although data columns related to electric vehicles exist, the data were not collected. Similarly, data were not collected for certain other data fields. In this study, the valid data among the collected data were analyzed. Among the collected data, preprocessing tasks, such as removing outliers and calculating GPS-based azimuth, were performed before analysis.
We plotted a histogram of the collected data to determine the distribution of the collected data and the composition of the field values as shown in
We analyzed the correlations using our S/W module based on Pearson’s correlation coefficient formula.
In this section, we present representative results of the various dangerous driving patterns we analyzed.
The dangerous driving behavior standard is that of the Korea Transportation Safety Authority [
We also applied the analysis module we developed earlier based on the data received from Company C to the vehicle data of a second company (Company B). The module was used to identify dangerous driving patterns based on the data of Company B.
We analyzed all the data for January 2020 and the results are shown in
The results of the data analysis of Company B are summarized as follows. As shown in the previous analysis (Company C), a vehicle generally decelerates more frequently than accelerating rapidly. The difference from the previous data analysis is that the number of rapid acceleration and deceleration patterns was significantly higher for vehicle 11. This is expected because the data were collected from logistics vehicles. In other words, quick start or sudden stop would not be expected to occur frequently compared with a general passenger car in the case of a logistics vehicle operated by an experienced driver compared to a general passenger car driven by a layman. In addition, the drivers of logistics vehicles would be unlikely to accelerate repeatedly because they drive considering the fuel efficiency of the vehicle. The results of the sudden left or right turn analysis are shown in graph5 in
We also analyzed the trajectory of a vehicle using GPS data and DBSCAN. For each point in a cluster, the algorithm requires the existence of at least a minimum number of points (MinPts) in an Eps-neighborhood of that point. The formula is defined as follows [
Here, the Eps-neighborhood of a point
We analyzed and predicted the time-series data using the ARIMA and LSTM models in this study. Parameters such as the battery voltage (BATTERY_VOLTAGE), coolant temperature (COOLANT_TEMPER), and engine oil temperature (ENGINE_OIL_TEMPER) of the vehicle were used for analysis and prediction.
The data used for the prediction and analysis were collected from one vehicle of Company B for 2 h (from April 15, 2020, at 22:10 to April 16, 2020, at 00:09:59). Because data were collected every second, 7200 data (rows) were collected per parameter. The analysis and results for the prediction of the battery voltage data are described here.
Next, we used the LSTM model, in this case a stateful stacked model, to estimate the battery voltage, coolant temperature, and engine oil temperatures. The data are composed of 7200 data points for 2 h, as in the ARIMA model; the data structure for training the model is as follows. We trained the model using 3600 data points (1 h) and evaluated our model using data from 30 min (1 h to 1 h 30 min). The testing was conducted with the data of the remaining 30 min (1 h 30 min to 2 h). The battery voltage data were scaled using the MinMaxScaler class (maximum value 1, minimum value 0), and the loss function was set to “mean_squared_error.” Adam, which is an adaptive learning rate optimization algorithm that was designed specifically for training deep neural networks, was also used as the optimizer [
In this paper, we extensively described our big data analysis platform, the structure of data collection, the development environment of the analysis software, data processing procedures, and the structure of the data analysis software including preprocessing. We performed various analyses, such as correlation, rapid acceleration and rapid deceleration, quick start, sudden stop, sudden left turn, sudden right turn, sudden U-turn (dangerous driving patterns), mileage, fuel efficiency, and trajectory analysis based on the analysis software we developed. In addition, ARIMA and LSTM models were used to predict vehicle data, such as the battery voltage. A summary of the representative results is as follows: Most vehicles had unique driving patterns, such as rapid acceleration and deceleration, and sudden stop and start. In fact, a vehicle with a large number of rapid accelerations has a high number of rapid decelerations according to our analysis. The fuel efficiency was generally higher in summer than in winter. Moreover, most of the vehicles belonging to one company had similar driving radii in the trajectory analysis. Currently, we are working on expanding our analysis software; we are conducting research to develop new business models based on the software we developed and the results of our analyses. In particular, we are analyzing the driving patterns of automobiles efficiently by improving the performance of the trajectory analysis algorithm. In future, we aim to identify ways in which to apply deep learning and machine learning algorithms, such as decision trees, SVMs, and LSTM, to automotive data using a variety of approaches. Furthermore, we plan to conduct research on the analysis of unstructured data, such as vehicle-related images and video files, based on artificial intelligence technology. Eventually, we expect to derive more accurate analysis results and create various business models based on the analyzed results.