Since the worldwide spread of internet-connected devices and rapid advances made in Internet of Things (IoT) systems, much research has been done in using machine learning methods to recognize IoT sensors data. This is particularly the case for optical character recognition of handwritten scripts. Recognizing text in images has several useful applications, including content-based image retrieval, searching and document archiving. The Arabic language is one of the mostly used tongues in the world. However, Arabic text recognition in imagery is still very much in the nascent stage, especially handwritten text. This is mainly due to the language complexities, different writing styles, variations in the shape of characters, diacritics, and connected nature of Arabic text. In this paper, two deep learning models were proposed. The first model was based on a sequence-to-sequence recognition, while the second model was based on a fully convolution network. To measure the performance of these models, a new dataset, called QTID (Quran Text Image Dataset) was devised. This is the first Arabic dataset that includes Arabic diacritics. It consists of 309,720 different
The Internet of Things (IoT) is based on a set of network and physical systems as well as machine intelligent methods that can analyze and infer data for certain purposes. It seeks to build an intelligent environment that facilitates making the proper decision. IoT applications are particularly required in the visual recognition fields such as Intelligent Transport Systems (ITS) and video surveillance [
Optical character recognition (OCR) is the process of converting an image that contains text into machine-readable text. It has many useful applications including document archiving, searching, content-based image retrieval, automatic number plate recognition, and business card information extraction. OCR is also considered as a tool that can assist blind and visually impaired people. The OCR system’s process includes some pre-processing of the input image file, text areas extraction, and recognition of extracted text using feature extraction and classification methods. Arabic is a widely spoken language throughout the world with 420 million speakers. Compared to Latin text recognition, not much research has been done or published on Arabic text recognition, and it is topic requiring more analysis [
The Holy Quran is the religious scripture that Muslims throughout the world follow. Approximately one and half billion people around the world recite the Holy Quran. Most of the existing versions of the Quran have been published in the Arabic language rather than the Quranic script. The Holy Quran with Othmani font represents the main source for Arabic language rules in the form of a hand-written script. This Othmani font is chosen due to three major reasons: (1) it is one of the major grammar sources of the Arabic language, (2) it contains different words, characters, and diacritics from all over the Arabic language, and (3) it contains all the recitation styles’ letters and vowels. The challenges associated with the Quranic scripts can be summarized as follows:
Since OCR processes images, it is beset by long-lasting visual computing challenges such as poor quality images and background noises.
Arabic handwritten text does not follow the defined patterns and depends on the quality of the writer’s text. For instance, using handwritten text for the bio-metric signature reveals a great dissimilarity for the same script.
Broadly speaking, Arabic letters’ sizes in the same script depend on the font and location in the word. For this reason, segmenting these letters is not an easy task.
Arabic letters’ pronunciations are controlled by the diacritics. It ranges from four to eight forms according to the type of diacritics and location of the letter as shown in
In contrast to English, the number of research studies on this language is very small. This prevents new technologies from being applied since there is a definite shortage of resources.
The most recent paradigms such as deep learning algorithms require a massive amount of data to train and evaluate the networks. The recognition of Quranic letters still lacks the availability of large datasets. Intuitively, deep learning algorithms work better in the case of large datasets in comparison to small datasets [
System/model | CRR (W-D) (%) | CRR (W-N-D) (%) |
---|---|---|
Tesseract 4.0 | 11.40 | 20.70 |
ABBY FineReader 12 | 6.15 | 13.80 |
Quran-seq2seq-Model | 97.60 | 97.05 |
Quran-Full-CNN-Model | 98.90 | 98.55 |
In this paper, two deep learning-based techniques with convolutional neural network (CNN) and long short-term memory (LSTM) networks have been proposed to enhance the Arabic word image text recognition. It does this by using the Holy Quran corpus with Othmani font. The Othmani font is chosen for three key reasons: Firstly, it is one of the major grammar sources of the Arabic language; secondly, it contains different words, characters, and diacritics from all over the Arabic language; and thirdly, the Mus’haf—Holy Quran book is written in Othmani font, which is a handwritten text that contains various shapes for each character. The Arabic word is written with a white font on a
The first model which is known as Quran-seq2seq-Model consists of an encoder named Quran-CNN-encoder and a multi-layer LSTM decoder. This model is like the image captioning models [ Developing two end-to-end deep learning models that recognize Arabic text images in Quran Text Image Dataset (QTID) dataset. Creation and evaluation of a dataset called QTID that was taken from the Holy Quran corpus. Experimental results demonstrate that the proposed models outperform than best OCR engines like Tesseract [
In the last few decades, research and commercial organizations have proposed several devices to create an accurate Arabic OCR for printed and handwritten text. Some of them have achieved a recognition accuracy of 99% or more for printed text but handwritten recognition is still under development.
Tesseract OCR [
An offline font-based Arabic OCR system proposed in [
Recognition of handwritten Quranic text is more complex than printed Arabic text. The Quranic text contains ligatures, overlapped text, and diacritics. It has more writing variations and styles. Further, a letter with the same style may have different aspect ratios. The challenges associated with handwritten Quranic text recognition are described in [
Considering the Arabic word image text recognition problem as a sequence-to-sequence deep learning problem, the proposed methodology is based on the encoder and decoder model architectures. The encoder part in both models uses a deep CNN similar to VGG-16 [
The first model called Quran-seq2seq-Model consists of an encoder named Quran-CNN-encoder and a multi-layer LSTM decoder. It is similar to the image captioning models [
where I is the input image, C is the ground truth characters of the image, and
In the training phase of Quran-seq2seq-Model, its parameters are optimized for the inputs and outputs from the training set. To optimize the model’s loss function, the optimization algorithm and the learning rate must be specified. Since this is a multi-class classification problem, the loss function applied is the cross-entropy, which is defined as:
where
The sequence-to-sequence models work efficiently in text recognition problems such as Arabic handwritten text. However, their character prediction and concatenation time is more than the fully convolutional models. The fully convolutional models can do the predictions at once, which makes the network training easier and helps in faster predictions. Further, these fully convolutional models take the advantage of GPU parallelization, as it does not need to wait for the previous time step. Besides, the number of parameters in these models is smaller when compared to the sequence-to-sequence models. However, the fully convolutional models are limited to the fixed number of output units.
The Quran-Full-CNN-Model expands the same Quran-CNN-encoder as discussed in Section 3.1. However, instead of LSTM layers, this model includes a fully connected layer followed by a Softmax activation as illustrated in
In the training phase of the Quran-Full-CNN-Model, we use the same loss function and other metrics as used in the Quran-seq2seq-Model. Similarly, Adam optimization algorithm served to optimize the loss function with the same beta 1 and beta 2 values. The learning rate of the model was set to 0.001 and the mini-batch size was 32 with no usage of the learning decay technique.
The Quran-seq2seq and the Quran-Full-CNN models along with training phases were implemented using Keras framework with a TensorFlow in the backend. The training phase for the Quran-seq2seq-Model took around 6 h for the ten epochs, which minimized the loss value from 32.0463 to 0.0074. The network evaluation metric shows that the recognition process has 99.48% accuracy on the training set. Moreover, this evaluation process took 587 s.
The Quran-Full-CNN-Model was implemented with the same development features as the Quran-seq2seq-Model. However, the training phase took around 2 h to minimize the loss value from 24.0282 to 0.0074 in ten epochs. The network evaluation metric shows that the recognition process has 99.41% accuracy on the training set. Apart from this the evaluation process took 345 s for the whole training set on the same machine specifications. The network training for both models was performed on an IntelR CoreTM i7 with 3.80 GHz with A 4 GB GTX 960 Nvidia GPU and a 16 GB DDR5 RAM.
To demonstrate the effectiveness of the Quran-seq2seq and Quran-Full-CNN models, different experiments were conducted on the QTID dataset.
To train, validate, and test the proposed models, a new Arabic text recognition dataset is created. The dataset can be used as a benchmark to measure the current state of recognizing Arabic text. Moreover, it is the first Arabic dataset that uses diacritics along with handwritten Arabic words. The Holy Quran corpus with Othmani font is used as the source to create the QTID dataset. This font contains different words, characters, and diacritics from all over the Arabic language. Moreover, the Mus’haf Holy Quran is written in Othmani font, which is a handwritten text, where each character is represented in various shapes.
To evaluate the proposed models, five different evaluation metrics have been used. The first evaluation metric is a character recognition rate (CRR), which is defined as follows:
where (RT) is the recognized text and (GT) is the ground truth text. The Levenshtein Distance function measures the distance between two strings as the minimum number of single-character edits. The other four measures are accuracy, average precision, average recall, and average F1 score, which are defined as follows:
The F1 score takes the harmonic average of the precision and recall for a specific character.
The performance of the proposed models on the QTID dataset has been evaluated and compared with state-of-the-art commercial OCR systems. The Quran-seq2seq and Quran-Full-CNN models, Tesseract, and ABBYY FineReader 12 were evaluated using the metrics as described in Section 5.2. Since Tesseract and ABBYY FineReader 12 cannot recognize the Arabic diacritics, an additional test set was created. This additional test set contained the same Arabic text images as in the target test set, but the diacritics were removed from the ground truth text.
All the images were converted to grayscale in the test sets and fed to the four different models to get the possible predictions. All the text predictions along with the ground truth text were saved in two lists: One for the standard test set with Arabic diacritics and the other for the additional test set without Arabic diacritics. With each model, two evaluations on the test sets with and without Arabic diacritics were performed, which led to eight different lists. An average prediction time for the models developed in this paper was 30 s for each image in the test set.
The evaluation results using the character recognition rate (CRR) metric with the proposed models and the commercial OCR systems are shown in
To calculate overall accuracy, average precision, average re-call, and average F1 score, some pre-processing for the predicted text and the ground truth text has been done. The predicted and ground truth text in both test sets have been aligned using a sequence algorithm so that each text instance should have the same length and each character in the predicted text and ground truth text be mapped.
System/model | CRR (W-D) (%) | CRR (W-N-D) (%) |
---|---|---|
Tesseract 4.0 | 10.67 | 17.36 |
ABBY FineReader 12 | 2.32 | 5.33 |
Quran-seq2seq-Model | 95.65 | 95.85 |
Quran-Full-CNN-Model | 98.50 | 97.95 |
System/model | Avg Precision (W-D) (%) | Avg Precision (W-N-D) (%) | Avg Recall (W-D) (%) | Avg Recall (W-N-D) (%) |
---|---|---|---|---|
Tesseract 4.0 | 56.73 | 50.59 | 18.44 | 14.6 |
ABBY FineReader 12 | 34.37 | 37.64 | 4.06 | 6.14 |
Quran-seq2seq-Model | 91.85 | 97.55 | 87.35 | 97.53 |
Quran-Full-CNN-Model | 90.46 | 98.66 | 89.64 | 97.41 |
The average F1 score of the proposed Quran-seq2seq and Quran-Full-CNN models with and without Arabic diacritics as shown in
System/model | Avg F1 score (W-D) (%) | Avg F1 score (W-N-D) (%) |
---|---|---|
Tesseract 4.0 | 27.83 | 22.66 |
ABBY FineReader 12 | 7.27 | 10.55 |
Quran-seq2seq-Model | 89.55 | 95.88 |
Quran-Full-CNN-Model | 90.05 | 98.03 |
The results documented in
Optical character recognition systems are supposed to deal with all kinds of languages in imagery and then convert them to their corresponding machine-readable text. Arabic text recognition for the OCR systems has not yet reached state-of-the-art standard compared to Latin text. This is mainly due to the language complexities and other challenges associated with Arabic text. This paper proposed two deep learning-based models to recognize Arabic Quranic word text in the images. The first model is a sequence-to-sequence model and the other is a fully convolutional model. A new large-scale dataset named QTID was developed from the words of the Holy Quran to improve the recognition accuracy of Arabic text from images. This is the first dataset to contain Arabic diacritics. The dataset consists of 309,720 images, which were split into training, validation, and testing sets, respectively. Both models were trained and tested on the QTID dataset. To compare the performance of the proposed model, QTID test set was evaluated on the two commercial OCR systems. The subsequent results show that the proposed models outperform commercial OCR systems. Although the proposed models outperform on the QTID, these models have some limitations such as: The text must be at the center of the input image, the foreground color of the input image should be white, while the text color is black. In the future, more Arabic images with diverse text directions will be included. The Arabic word text images with different foreground and background colors will be added. An end-to-end system will be proposed for recognizing sentence-level Arabic text in images. Further, a few more deep learning models will be evaluated on the proposed QTID dataset.