Crimes are expected to rise with an increase in population and the rising gap between society’s income levels. Crimes contribute to a significant portion of the socioeconomic loss to any society, not only through its indirect damage to the social fabric and peace but also the more direct negative impacts on the economy, social parameters, and reputation of a nation. Policing and other preventive resources are limited and have to be utilized. The conventional methods are being superseded by more modern approaches of machine learning algorithms capable of making predictions where the relationships between the features and the outcomes are complex. Making it possible for such algorithms to provide indicators of specific areas that may become criminal hot-spots. These predictions can be used by policymakers and police personals alike to make effective and informed strategies that can curtail criminal activities and contribute to the nation’s development. This paper aims to predict factors that most affected crimes in Saudi Arabia by developing a machine learning model to predict an acceptable output value. Our results show that FAMD as features selection methods showed more accuracy on machine learning classifiers than the PCA method. The naïve Bayes classifier performs better than other classifiers on both features selections methods with an accuracy of 97.53% for FAMD, and PCA equals to 97.10%.
Criminal activities have been part of human civilization since its inception. The negative impact of the crime-levels on socioeconomic indicators cannot be understated. Although there is a significant correlation between socioeconomic indicators and the crime levels however the flow of causality in a relationship works in a cycle where the socioeconomic indicators add to the crime-level, such as unemployment,
The origin and growth of crime-levels are based on several characteristics; these characteristics can be different income groups, different racial backgrounds, age groups, family structure [
The availability of crime statistics in the free domain has made it practically possible to use big data and machine learning (ML) techniques for predicting and preventing crime, by supporting the optimal allocation of limited police resources, as knowledge on the likelihood of crime occurrences for a particular area, predicted through a model will help allocate additional police personals to the known crime hot-spots for a particular time and therefore reducing the likelihood of crime occurrences. Although the availability of data in the public domain for analytics has served well of the research purposes, three main obstacles had remained in their practical applicability for the real-world problems.
The conventional approaches using multivariate analyses [
Several research studies show that the criminal world organizes a complex network with its own rules. To make a prediction of crime rate with high accuracy, it is essential to understand the “nature” of a crime. The existing research on the subject illustrates that criminal activities are related to a significant number of factors (features). A number of them take into consideration a variety of individual parameters and show how changes influence crime rates or criminal activities in these parameters. It is observed that the criminal activities are not randomly organized in the cities, they are dependent on the number of factors that contribute towards the existence of a crime hot-spot in a given location, KNN (K-nearest neighbors) has been successfully implemented spatial analysis of cities to map the neighborhoods that are most likely to be troubled by the existence of criminal activities [
Studies have also illustrated that the crime-levels are significantly correlated to the ambient temperatures, as the increase in temperatures increases the levels of serotonin our bloodstream, which directly increases human to human interaction and increases impulsivity among the populations causing the crime-levels to go up in general [
More recent studies that use more sophisticated machine learning algorithms to map the dataset are free from some of the previous studies’ limitations and biases. A comparison in the capabilities of conventional to the modern algorithm suggests that the Machine learning algorithms can map any given dataset that is diverse, better than its conventional peers [
The more recent advancements in the availability of data and the enhancements in the capabilities of the machines to process such massive datasets has given rise to the possibilities of using a new range of features, for example, the usage of CNN (Convolutional neural networks) for processing images and training the models on the same by using not only the census but also the google street view data [
With the onset of these computing capabilities, the models take in a significantly large number of features to make predictions, which in turn helps make the model more receptive and more useful for real-world implementation. However, this introduces two new and different challenges to deal with when implementing models with such a large number of features:
The increase in the number of features causes models to take significantly more computational capabilities than the conventional models, which means more resources and time should be dedicated to the training process. The introduction of variables or features that do not significantly impact the predictions makes models somewhat less efficient than they otherwise would have been in the absence of such redundant features.
To address these problems, it is pivotal to be selective in the selection of features. Most of the studies do not attempt the features selection through any means whatsoever; the ones that do mostly prefer to use conventional techniques like correlation, etc., [
1) Principal Component Analysis (PCA) [
2) Factor Analysis of mixed data (FAMD) [
The remaining article is organized as section two express material and methods where results are presented in Section 3. The brief statistical analysis is carried out in Section 4 and the study is concluded in Section 5.
In this research, we consider all types of crimes in Saudi Arabia. We collected events information on all crimes over almost one year, i.e., 2018 using GDELT [
Several methods are used in the related work for data preprocessing [
where
The values for skewness between –2 and +2 are considered acceptable for achieving normal univariate distribution. Out of the numeric features, four features were positively skewed—most of the research deal with this by using log-transformation. The goal of log transformation is to make data conform more closely to the normal distribution by decreasing data variability. A Log plus one, which is a variation of log transformation, was performed on the four features to account for zero values in those columns. The skewness significantly reduced after was applied.
Then, the data has to encode the feature where nominal features are encoded into numerical form. For all numeric columns such as ‘Actor1code’, ‘Actor1Geo_CountryCode’, ‘ActionGeo_FeatureID’, the data was standardized to conform to a Gaussian distribution with μ = 0 and σ = 1. Rescaling was performed using a z-score. Formally, the z-score formula is given as
where
PCA and FAMD are two-dimensionality reduction algorithms to extract the patterns of characteristics in the data.
PCA and FAMD are two techniques to analyze and understand the attributes to reduce the dimensionality of a dataset without information loss. Thus, we adopted these two techniques to find the label’s most related attributes (EventCode).
PCA | FAMD |
---|---|
NumMentions | EventBaseCode |
Actor1Code | EventRootCode |
Actor2Code | QuadClass |
Actor1Geo_CountryCode | Actor1Code |
Actor2Geo_CountryCode | Actor1Geo_CountryCode |
EventBaseCode | Actor1Geo_type |
Month | Actor2Code |
Region | |
Actor2Geo_CountryCode | |
Month |
Specific classification techniques have been selected, as shown in
PCA | FAMD | |
---|---|---|
Naive Bayes | 97.10% | 97.53% |
Random Forest | 93.18% | 95.64% |
KNN | 90.35% | 95.03% |
Decision Tree | 86.48% | 91.32% |
Deep Learning | 64.43% | 82.23% |
We analyze the performance of the two-dimensionality techniques with the best practice machine learning algorithms [
From the results of our experiments, it is cleared that the Naive Bayes achieves a result of accuracy 97 for both techniques, which is higher than others. The reason is that Naive Bayes is insensitive or not depends on the dependence between features. Moreover, it excludes the attributes with missing values. On the other hand, the KNN classifier has a lower than Naive Bayes accuracy with PCA 90.35% while in FAMD is 95.03%. This shows how the KNN is sensitive to irrelevant variables and data size.
Furthermore, Naïve Bayes, Random forest, and KNN using the FAMD technique generally perform comparably in practice to determine criminal activities with some advantages to Naïve Bayes. Furthermore, Decision trees present a low correlation coefficient compare with other algorithms as branches of the decision trees are more rigid. It provides precise outcomes in case of test-dataset follows the pattern modeled.
However, data visualization helps in the analysis of the data set so that section 4 analyzes crime rate per the region over time, which includes occupying territory, fight, expel or deport, conventional and unconventional violence, etc. We also implemented the total event count, weekly event count, by considering population, male population, smokers, education, and unemployment as a prominent leading factor.
This section provides a general picture of crime trends in Saudi Arabia. It introduces statistics on crime and studies its variation with other statistics to discover possible relationships. Hence, we try to identify the areas more prone to crime and predict the reasons for their high rate in criminal activities. Therefore, creating crime scenarios may help prevent crimes and discover relationships and trends, show up which areas are safe or dangerous in Saudi Arabia, and predict an acceptable output value.
Former Attorney General of New Jersey, Anne Milgram, has clarified in her TED Talk why smart statistics are the key to fighting crime with the help of integrated data for analyzing the criminal justice system. In her TED Talk, Anne Milgram remarks some incredible insights that there was a shortage in data-driven decision-making to use big data analytics and data science instead of yellow post-it notes to prevent crimes. She said: “use of smart data and statistics in making player decisions was good enough for the Oakland A’s, Milgram figured it would be good enough for the legal system.” She created a team of data scientists to fight crimes using a data-based manner with better decision-making. She succeeds in using smart data to reduce murders by 41% and reduced crimes in New Jersey by 26%.
Three regions in Saudi Arabia were identified as the most crime-prone regions, namely; Riyadh (15,037), Makkah (4132), and Jizan (2798). Also, the month of August 2018 has the highest crime rate in the highest crime-prone area, with the capital Riyadh recording 2,300 cases of crime.
In Riyadh, crime was committed three times every 6 hours, representing the highest rate in Saudi Arabia. Another mooted reason why the crime rate might have been higher in Riyadh could be because it is the capital though this assumption was not substantiated. August’s likely reason is the month with the highest crime rate assumes that the Dhul hajj, where Muslims come from all over the world for pilgrimage, starts on August 12. Moreover, the second month (February 2018) was the lowest month to committed crimes and, the sixth month had a lower crime rate than the previous months in all regions.
The oil-rich region of Ash sharqiyah was the fourth region with the highest crime count while we expect it to usurp Jizan due to it being a boundary region. However, it maintained its fourth position. Tabuk is one of the Military regions and an outsider boundary. The more of military presence, the less crime occurrence which is the reason for lower crime rates.
In
Jizan had the third-highest crime rate for almost all the months ranging from almost 150 in February 2018 to 700 in April 2018. It climbed to the region’s position with the second-highest crime rate in April 2018, overtaking Makkah. We can see that February 2018 has the lowest crime rate in Riyadh, Makkah, Jizan, and Ash Sharqiyah. The highest number of crimes in Riyadh, Makkah, and Ash Sharqiyah occurred in August 2018; April 2018 recorded the highest number of crimes in Jizan.
The two most predominantly occurring crimes in Makkah are “Use the conventional military force” and “Arrest, detain, or charge with legal action.” Other frequently occurring crimes include “Fight with small arms and light weapons,” “Impose administrative sanctions,” and “Threaten.” The two most frequent crimes in Jizan are “Use the conventional military force” and “Fight with artillery or tanks.” Other significant crimes include “Fight with small arms and light weapons,” “Employ aerial weapons,” and “Threaten”.
In
In
The high crime rate in August is quite an interesting one. On 12th August, Dhul Hajj month starts. Muslims from around all the world come to do rites of pilgrimage. To examine any possible correlation, we performed statistical analysis with the Ministry of Interior’s online data in their Twitter account, as shown in
The next
Hajj count came from Jeddah Islamic Port and King Abdelaziz airport located in the Makkah region. There were more than 918 thousand hajjes, while the event count was 110 per month. As we can see 11th August acquired 39 event count while 12th August acquired 70–71 count.
Smoking still a big issue worldwide and a significant public health problem. Does smoking influence the psychological and mental state leading to committing crimes? We can see from
The weather could also be an influential factor in adding to crime rates. It provides temperatures in a given area and time.
This section introduced statistics on crime and studied its variation with other statistics to discover possible relationships. As we explain in this chapter, the crime rate is significantly higher in Riyadh than in most other regions. We also analyzed that there is a significant increase in crime in August 2018, as compared to other months. We also saw that crimes related to “Use of conventional military force” and “Arrest, detain or charge with legal action” had a higher frequency than other crimes. We also say that SA’s count rate as both actor one and actor 2 was higher than in other countries. There was also a significantly high count between Yamen and Saudi Arabia as actors. We also found that the crime rate was reduced with an increase in literacy rate across the three regions of Riyadh, Jizan, and Makkah.
In this paper, we investigated the influencing factors that impact crime rates in Saudi Arabia. We observed that both months and regions of Saudi Arabia had most of the events that occurred. The dataset that we extracted has been pre-processed and prepared to be ready for machine learning. Many machine learning classifiers algorithms have been used and training and measure each’s accuracy; then, we apply some techniques to improve it. We conclude that the Naive Bays is the most suitable for crime classification experiments, and the deep learning needs to have much more data set to give better classifier and high accuracy.
It seems impossible to predict crime, but it can be prevented if the time in which crime happens is known. Our research can be improved in different ways. In the future, we plan to use it with a risk terrain modeling technique to enhance crime prediction. Besides, planning to expand the experience to become not locally and collecting much more crimes dataset from Arabic Gulf countries with more features like education, populations, and the weather. Then, analyzing those data to make a prediction of which areas are most at risk and which are safer. Moreover, many advanced machine learning algorithms will be implemented, such as Artificial Neural Network and Deep Learning, to achieve a more balanced approach towards criminal activities.
We would like to thank the Deanship of Scientific Research, Qassim University for funding the publication of this project.