Multi-Domain Deep Convolutional Neural Network for Ancient Urdu Text Recognition System

Deep learning has achieved magnificent success in the field of pattern recognition. In recent years Urdu character recognition system has significantly benefited from the effectiveness of the deep convolutional neural network. Majority of the research on Urdu text recognition are concentrated on formal handwritten and printed Urdu text document. In this paper, we experimented the Challenging issue of text recognition in Urdu ancient literature documents. Due to its cursiveness, complex word formation (ligatures), and context-sensitivity, and inadequate benchmark dataset, recognition of Urdu text from the literature document is very difficult to process compared to the formal Urdu text document. In this work, first, we generated a dataset by extracting the recurrent ligatures from an ancient Urdu fatawa book. Secondly, we categorized and augment the ligatures to generate batches of augmented images that improvise the training efficiency and classification accuracy. Finally, we proposed a multi-domain deep Convolutional Neural Network which integrates a spatial domain and a frequency domain CNN to learn the modular relations between features originating from the two different domain networks to train and improvise the classification accuracy. The experimental results show that the proposed network with the augmented dataset achieves an averaged accuracy of 97.8% which outperforms the other CNN models in this class. The experimental results also show that for the recognition of ancient Urdu literature, well-known benchmark datasets are not appropriate which is also verified with our prepared dataset.


Introduction
Text recognition is regarded as one of the great advances in the field of natural language processing and has attracted many researchers to work in the discipline for translation of handwritten or printed documents to a computer editable format. Text recognition is also one of the biggest challenges in machine translation because it requires identifying the specific sequence of words that are the source of the sentence. Urdu is one of the primeval and primary languages of the Indian sub-continent. Around the world, Urdu is spoken in more than 15 countries with an estimated over 280 million speakers. Urdu is the national language of Pakistan, and an official language of six states of India, and also one of the 22 official languages recognized in the Constitution of India. The Urdu language has the characteristics of Arabic, Farsi, and Pashto, so an Urdu speaker can without much of a stretch comprehend Hindi. Urdu scriptwriting is from right to left using alphabets that are derived from the Arabic alphabets character set. While speaking, Urdu is quite similar to Hindi but different in writing. Urdu is written in Arabic script with some additional characters, in different font styles of cursive script, such as Nasta'liq, Kofi, Thuluth, Diwani, Riq'a, and Naskh. From the last two decades, significant research has been conducted in languages like English, Chinese, French, Devanagari, Tamil, etc., to digitize existing printed and handwritten documents [1]. But the most challenging assignment is to develop a recognition system for cursive languages, like Arabic, Farsi, Pashtu, and Urdu. Deep learning has made a huge impact on computer vision and set the state-of-the-art in providing extremely definite classification results. With the advancement of deep learning, character recognition has become a widely used method for identifying and labeling natural languages. The most recent generation of deep learning is the convolutional neural network (CNN), which has shown to be very effective at detecting complex expressions in natural language words. CNN has become increasingly popular due to its ability to capture highly descriptive and well-ordered image features [1]. One of the prime factors of the supreme accuracy attainment of CNN is that it is trained with massive categorized samples. ImageNet is the most broadly perceived dataset used to train CNN models utilizing error prorogation [2]. But earning a dataset as extensively characterized as ImageNet remains a big challenge in the character recognition of under resource languages like Urdu. Training intense CNN utilizing a small size dataset frequently causes convergence issues.
Transfer learning is one of the most promising approaches for training deep CNN models with limited training data. Transfer learning makes the network learn quickly on new input data with fewer classes [3]. For under-resourced classification tasks like text recognition, transfer learning is the best option instead of reducing the network limit [4]. Various script identification systems have been effectively demonstrated using different network models recently [5], like Reference [6] has presented a handwriting recognition of historical documents by transfer learning CNN-BLSTM-CTC network, in which the authors have transfer learned a composite dataset with ground-truth and pooled the collective attributes with a new dataset. Similarly, Reference [7] exhibited a hybrid model for handwriting recognition using transfer learning a heterogeneous dataset. A character-level text ConvNets is presented in [8] for English and Chinese corpus by transfer learning to endeavor and reuse the important portrayals that are found out in the ConvNets from an enormous scope dataset. AlexNet the most robust CNN which is pre-trained with the ImageNet dataset has been exploited for character recognition of many scripts like Devanagari [9], Malayalam [10], Korean [11], and Tamil [12]. Manually written Devanagari character acknowledgment has been introduced in [13] using layer-wise preparation of Deep CNN and accomplished great results using six different adaptive gradient methods. A Latin and Chinese character acknowledgment has been presented in [14] where the author has transfer-learn a deep CNN on digits to recognize upper case letters. Transfer learning has successfully supported numeral recognition of several scripts where the network is trained with a small set of the numeral dataset [15]. One of the oldest languages of the world-Tamil, handwritten characters have been transfer learned using the Vgg16 and achieved promising results [16]. One of the cutting edge techniques for similar script family languages-Kanada and Telugu, character recognition by transfer learning has been exhibited in [17]. From the literature, we noticed that mostly the transfer learned networks are effectively smeared on the printed script rather than handwritten, and we also observe that CNN-based transfer learning is effectively applied to the non-cursive script like Chinese, Latin, Bangala, and Devanagari, etc. Relatively very few research have been concentrated on cursive scripts like Arabic [18], Urdu [19], and Farsi [20], where the experiment carried out on regular documents, and the network models are trained using specific benchmark datasets like UNHD [21], UPTI [22], EMILLE [23] and WordNet [24]. These datasets consist of handwritten text lines and word images written by various writers. For Urdu literature document digitization these datasets and models cannot be applied due to the canonical complexity of Urdu literature documents like no proper baseline, diagonality, packed loops, incorrect loop, and hefty disparity of ligatures. To our knowledge currently, there is no such Urdu ligature image dataset available solely for recognition of Urdu text from literature documents.
The key advantage of Urdu literature is that comes in multiple volumes, and nearly 80% of the text (words) in the first book are recurrent in the succeeding volumes. We took this advantage to develop a dataset, especially for Urdu literature document text recognition. We extracted the ligatures from the first volume of an Urdu ancient fatawa book (Fatawa Aziz) which has 4 volumes. Then we proposed a multidomain deep CNN for handwritten Urdu ligature classification which exploits both spatial domain and frequency domain convolution to learn modular relating features from Urdu ligatures image.
The chief contributions of this work are (I). Developing a ligature-based image dataset for Urdu literature document text recognition, (II). Exploiting multi-domain CNN features for Urdu text recognition (III). Deep learning in Urdu ligatures classification, eliminates the need for segmenting individual characters. (IV). For ancient literature archive digitization, our proposed network is highly suitable for implementing real-time recognition.

Characteristic of Urdu Literature Document Text
Nastalique is an artistic, curvaceous, and calligraphic style widely used in Urdu literature script writing. National Language Authority of Pakistan has defined 58 characters in urdu, 28 of which are derived from Arabic, but just 40 essential characters and one dochashmi-hey is utilized to frame every composite letter set. Urdu Nastalique script is inherently cursive, has four unique shapes for each character, and also context-sensitive. A ligature is molded by fusing two or more characters cursively with diacritics in a free stream structure. A ligature is not a complete word, rather in the greater part of the cases a part of a word, or a subword. A Nastalique word is made out of ligatures and secluded characters.

Segmentation
We prepared the dataset by extracting all possible ligatures from the first volume of the Urdu fatwa book Fatawa Azizi authored by Shaykh Shah Abdul Aziz Muhaddith Dehlvi. r.a-consist of 4 Volumes. The scanned copy of this book was obtained from the Arabic college of our nearby town. The scanned documents had two common & conventional issues-noise and skew. As preprocessing, we performed document de-noising using median filtering scheme & k-fill method [25] and de-skewing using Ali's Algorithm [26]. To have a smart thought of the physical structure of the Urdu literature document, Fig. 1A shows a sample text page from the fatawa book.
Segmenting lines from a page or a paragraph in an Urdu literature document is a challenging and crucial task due to the following issues: Several methods exist to take care of these problems and they can be categorized as one of two classifications, top-down and bottom-up approaches [27]. In a top-down methodology, a line division calculation utilizes enormous features of a line to decide its limits. The Bottom-up method begins from the smallest component of an archive picture i.e., the pixel. By gathering contacting pixels, associated segments are produced. Based on our document suitability we employed the block covering analysis method for text line segmentation as it is proven to be the best in the literature. The ligature segmentation phase involves the separation of text lines into individual words/ligatures. Ligature segmentation in the case of Urdu literature documents relies basically upon the separation among words and sub-words. Due to non-consistently slanted sub-words in the literature script, the space among words is not considered. Literature writing is precisely compacted and free spaces are fully occupied. In the case of the Urdu literature text, there are no broad suppositions to be made. Still, for examination, we take 2 fundamental suppositions that maintain a significant degree for our chosen books. First, though the writing is cursive, it is legible and has an inclination under 30 degrees to the vertical. Second, the lines in the record are not even, yet at the same time, two back-to-back lines have a sufficient gap to isolate them, so we employed the depth-first search-based connected components method for our ligature extraction. Preparing a classifier to perceive requires labeled ligature classes. Grouping ligatures manually and labeling them as per the text or ligature is an exorbitant procedure as far as the time and exertion. Hence we complete a self-loader grouping of extricated ligatures. Miss-grouping in clustered classes is then revised through visual investigation in request to mark groups error-free to fill in as preparing information. We utilized the dynamic time wrapping sequential algorithm (DTW) [28] for our ligature clustering. The major advantage of this method is that it does not require the number of ligature classes initially.  Step-3: Ligature Clustering: Ligature Clustering: dynamic time wrapping method. A ligature is arbitrarily picked and is accepted as a means of the primary group, every one of the rest is then chosen individually by a mean distance of each group and processed using DTW Step-4: If (Separating distance = closest group is under a predefined edge), the current ligature is added to the separate bunch and the group means is updated.
Else (another group is made with the current ligature as its center) Text Line = covering block dimensions are demarcated by measuring the intervals of nonempty lines.

Image Augmentation
There are numerous approaches to address difficulties related to constrained resources in deep learning. Image augmentation is one valuable procedure in building convolutional neural systems that can expand the size of the preparation set without obtaining new images. The thought is straightforward to replicate pictures with verities so the model can gain from more samples. By augmentation, we can increase the picture in such a way that it protects the key highlights, yet revises the pixels enough that it includes some agitation. Text Image Augmentations come from simple basic changes, for example, even flipping, shading space enlargements, and irregular editing. etc. To emulate the data variant perceived in the Urdu ligatures we furnish the following five augmentation techniques: Geometric transformation, Addition of uniform noise, Image glitch, Distortion using a grid, and horizontal and vertical profile normalization. Our image augmentation sequence is described in Algorithm 2 and Fig. 1E shows the results of proposed augmentation methods for the Urdu text document.

Algorithm 2: Image Augmentation
Step 1. Geometric transformation: Calculate the control points based on the spatial angle of the ligature.
Step 2. Pixels in the ligature image are spatially transformed by physical rearrangement.
Step 3. Assigns grey levels to the transformed image.
Step 4. Adding Uniform Noise: Noise tests pseudo-arbitrarily from the standard ordinary circulation, with a customizable change and a mean of 0.
Step 5. The background noise is free and indistinguishably conveyed. A ligature that is enlarged with white Gaussian noise with a variance of 0.06 & 0.08.
Step 6. Ligature profile normalization: Normalize the ligature Image to recompense the variations in the size using the distance between the upper BL & UDL as well as the lower BL and the LDL.  Step 7. Grid-based image distortion: Calculate the control point placement interval and standard deviation by which it randomly displaces the control points.
Step 8. Mapping is first characterized between any two proportionate normal frameworks with minor geometric confinements and is then ideally stretched out to the inside of the source matrix.

Proposed Network Architecture
In this work, we proposed a multi-domain Convolutional Neural Network for Urdu ligature recognition. Our network integrates a spatial domain CNN and a frequency domain CNN. The individual architecture of spatial and frequency domain CNN is discussed below.

Proposed Multi-Domain Network
Traditional CNN models have huge parameter space because they are much denser and therefore require more time to parse. In our proposed network, the spatial domain network is constructed using the fire modules which makes the network lightweight and fast that can be easily deployed in the edge devices. The frequency and spatial domain networks are combined up to their individual fully connected layers. The frequency-domain network gets the input as a histogram of the DCT coefficients of the ligature image and the spatial network gets the input as a raw ligature image. The features originating from the last fully connected layers of both the networks are fused to learn the inter-modal relation of ligature images from two different domain networks. The fused robust feature vector is fed to the last fully connected layer which has 512 neurons followed by a softmax layer which yields the probability that individual text is classified into respective classes. Fig. 2 portrays our proposed multi-domain CNN. Algorithm 2 describes the complete process of our proposed system.

Spatial Domain CNN
Our spatial domain network consists of a convolutional block (Conv-1, Relu-1 & max pool-1) and four fire modules similar to SqueezeNet. The network replaces some portion of 3 Â 3 convolution core in CNN with 1 Â 1 convolution parts, breaks down the first one convolution layer into two layers, and condenses them into a Fire module. Our fire module as shown in Fig. 3 has a Congestion layer (l Â l filter to decrease the input channel from 3 Â 3) which reduces the size of the feature map and an Expand layer (a combination of l Â l and 3 Â 3 filters to reduce filter size) which increases the gain. The input layer of the network is 227 Â 227 Â 3, the initial convolutional comes with a RELU activation followed by 4 Fire modules (Fire2 to Fire5), finally tailed with three fully connected layers and a softmax layer.

Frequency Domain CNN
Our frequency-domain CNN has 3 convolutional blocks (Conv and Maxpool), and all the max pool layers have a kernel of the same size. The three fully connected layers have 256 neurons each. In the frequency domain network before feeding the input layer, we first compute the histogram of DCT coefficients from the input ligature image. From an image block of 8 Â 8, DCT coefficients are extracted for each spatial frequency (i, j), and the histogram h (i, j) is built on behalf of the events absolute quantized DCT values. The input layer takes a vector of 784 elements as input for each ligature image. This feature vector is used to train our spatial domain CNN.
After we characterize the layers of our network the following stage is to set up the training options for the organization. We utilize the trainingOptions capacity to characterize the global training parameters.

MiniBatchSize
A small cluster is a subset of the training set that is utilized to assess the slope of the loss function and update the weights, the measure of information remembered for each sub-age weight change is known as the batch size. While the utilization of huge smaller than expected batches expands the accessible computational

Maxepochs:
An epoch is the full pass of the training algorithm over the whole training set. Datasets are generally gathered into groups (particularly when the measure of information is exceptionally huge). The overall relation where dataset size is S, number of epochs is O, number of iterations is I, and cluster size is C would be S*O = i*C.

InitialLearnRate
The volume that the weights are updated during training is alluded to as the "learning rate." Specifically, the learning rate is a configurable hyper-parameter utilized in the training of neural networks that has a small positive value, regularly in the reach somewhere in the range of 0.0 and 1.0. Learning rate is utilized to scale the extent of parameter updates. The decision of the incentive for learning rate can affect two things: 1) how quick the calculation learns and 2) whether the cost function is minimized or not. The default esteem is 0.01 for the 'sgdm' solver and 0.001 for the 'rmsprop' and 'adam' solvers.

Shuffle
Shuffle the training information before each training epoch, and Shuffle the approval information before each network validation. On the off chance that the smaller than expected batch size doesn't equitably isolate the number of training tests, then, at that point train-Network disposes of the training information that doesn't squeeze into the last total scaled-down group of every epoch. To try not to dispose of similar information each epoch we set the 'Shuffle' value to 'every epoch'.
Verbose: Indicator to display training progress information Step-1: Load the image datastore and 'splitEachLabel' with the required partition (i) Input to Frequency Domain Network: Histogram of DCT coefficients of 8 Â 8 block, spatial frequency (i, j), and the histogram h (i, j).
Step-2: Load the proposed network into 'NET'.
Step-3: Find the layers to transfer learn by 'layer transfer. fullyConnectedLayer; softmax layer; classification layer Step-4: Freeze the initial layer weights and bias and change the same for the new layers WeightLearnRateFactor, 50, BiasLearnRateFactor, 50 Step Step-6: Initialize training the NET with the provided dataset and training options (Continued) Step-7: The output of L7 Fc(spatial-4096) and the output of L6(frequency-256) are obtained and fed to the last fully connected layer which has 512 neurons. Step-8: Finally, the softmax layer classifies the ligatures in 4311 classes. Accuracy = mean(YPred == YValidation).

Experimental Analysis
Our proposed network is trained using our generated dataset of Urdu literature documents. We have used A NVIDIA-1060 graphics Zotak (CUDA v10.0) with an Intel Core i7-operating at 3.60 GHz for faster response. For testing, we interface a smartphone to the system using the IP-Webcam app. IP Webcam app transforms the cell phone into a network camera with different review choices and transfers video over Wi-Fi without the internet. Our system reads the image through the smartphone IP gateway and resizes it appropriately to feed this to the input layers of our network for classification. We performed 4 rounds of tests, in TEST-1 we self-evaluate the generated ligatures in different ratios. In TEST-2, we performed testing & validating of ligatures on the remaining three-volume of the same book. In TEST-3 (Cross Validation) our trained network validated the ligatures of the Urdu ancient literature books which were not used in training. In TEST-4, we tested our trained network on formal Urdu handwritten documents and other benchmarks Urdu datasets.
TEST-1: The first test involves training our network with the prepared dataset and self-validating it. The training comprises learning above 4000 ligature classes with each around 150 images. Tab. 3 shows our proposed method achieved an averaged recognition accuracy of 97.8% with the augmented dataset and 92% with the raw dataset. It proves that our proposed network and the augmentation techniques are best suited for Urdu ligature classification.
In TEST-2 we evaluate our network by testing & validating the ligatures of the remaining volume of the same book. Tab. 4 shows the result of averaged recognition accuracy associated with HFL, LFL, and UL. It is observed from Tab. 4 that ligatures of volume 1 are recurrently available in the succeeding volumes. Nearly 95% of the ligatures are repeated in the remaining volume of the same literature book. Fig. 4 shows the accuracy attainment of ligatures (based on the number of characters) in all four volumes with the raw and augmented dataset. It is witnessed that the minimum size ligatures (1 & 2 Characters) and maximum size ligatures (7-characters) are having good accuracy compared to other ligatures.     In TEST-4, we tested our trained network on formal Urdu handwritten documents obtained from 50 student's exam answer papers from local Arabic colleges. We also evaluated our trained network on the benchmark Urdu datasets like UNHD, UPTI, EMILLE, and WordNet. Tab. 6 reveals that the ligatures of formal documents are nowhere compatible for experimenting on ancient Urdu literature documents. The training time in a DNN depends primarily on the network structure of the problem domain, computational resources, and model parameters. However, the trained models tend to improve significantly on the training error in the problem. As our proposed network has two independent different domain networks, the training of our selected datasets took quite a long time, but the recognition time of our system is extraordinarily fast compared to the other network in this domain as shown in Tab. 7. Tab. 8 shows that our proposed method performs comparably extremely well with a consistent averaged accuracy of 97.8% against the state-of-the-art in Urdu ligature classification.  From the above correlations and significant evidence, we conclude that our proposed network with the information enlargement strategies is best appropriate for the recognition of ancient Urdu literature documents. It is also demonstrated that for the recognition of ancient Urdu literature, well-known available datasets are not appropriate which is verified with our prepared dataset.

Conclusion
In this paper, we experimented the challenging issue of text recognition of ancient Urdu literature documents. One of the key advantages of ancient Urdu literature is that it comes in multiple volumes and more than 80% of the text (words) in the first book are recurrent in the succeeding volumes. So we develop a dataset from the first volume of an Urdu literature book and trained to recognize the remaining volumes. To experiment this we proposed a multi-domain deep CNN which integrates a spatial domain CNN and a frequency domain CNN. Our network is lightweight and fast to learn the modular relations between features originating from the spatial domain and the frequency domain network. The experimental results show that for the recognition of ancient Urdu literature, well-known benchmark datasets are not appropriate which is verified with our prepared dataset and our proposed network is best appropriate for recognition of ancient Urdu literature documents. The results indicate that our approach can be effective at building a language model that could learn Urdu ligature more simply and compactly, we also conclude that the utilization of multi-domain network features which are robust and more  significant enables higher accuracy gain in the classification of Urdu ligatures. The integration of multidomain hierarchical features and multi-domain handcrafted features to further improve the classification performance is our future consideration. The proposed system mainly goes in the direction of developing an ideal character recognition for the urdu script.