Convolutional Neural Network Based Intelligent Handwritten Document Recognition

: This paper presents a handwritten document recognition system based on the convolutional neural network technique. In today’s world, handwritten document recognition is rapidly attaining the attention of researchers due to its promising behavior as assisting technology for visually impaired users. This technology is also helpful for the automatic data entry system. In the proposed system prepared a dataset of English language handwritten character images. The proposed system has been trained for the large set of sample data and tested on the sample images of user-defined handwritten documents. In this research, multiple experiments get very worthy recognition results. The proposed system will first perform image pre-processing stages to prepare data for training using a convolutional neural network. After this processing, the input document is segmented using line, word and character segmentation. The proposed system get the accuracy during the character segmentation up to 86%. Then these segmented characters are sent to a convolutional neural network for their recognition. The recognition and segmentation technique proposed in this paper is providing the most acceptable accurate results on a given dataset. The proposed work approaches to the accuracy of the result during convolutional neural network training up to 93%, and for validation that accuracy slightly decreases with 90.42%.


Introduction
Character recognition is the field where many machine learning techniques are widely applied. The world is advancing towards paperless communication, but there are many fields where handwritten document sharing still exists in daily communication. A significant challenge in handwritten document recognition: the processing of distorted shapes of character and various writing styles. Secondly, proper segmentation techniques are required for this character-by-character processing. These handwritten words may generate many challenging tasks for researchers. The dataset consists of handwritten characters that may not necessarily be sharp enough and write perfectly in a straight line. Another issue is the curve of characters may not consistently smooth enough like printed characters. Different orientations and sizes of handwritten characters can also generate problems in processing. Finally, characters may not always be in their complete shape, which can fall into different categories and generate improper recognition in the recognition process.
Since the last few years, deep learning techniques have successfully performed their role in many fields, like, speech recognition, image and textual classification, face and facial expressions recognition, semantic-based video searching and many other areas. Many of experimented problems are re-experimented using deep learning to acquire significant results. The proposed method addresses the previously defined issues by developing an intelligent and efficient handwritten scripts recognition system. The development of such kinds of systems always demands extensive image processing and pattern recognition techniques to be part of these systems. This study reports the deep convolutional neural network technique to recognize handwritten scripts. The deep convolutional neural network consists of many hidden layers, and these layers comprise many neurons. An extensive dataset of 65000 handwritten characters' images is first prepared to train and test the proposed approach. Then, a deep convolutional neural network is properly trained on the defined dataset, and a separate testing phase applies to these handwritten scripts to check the accuracy of recognition. This handwritten script is given to the recognition system; it demands the proper segmentation of written lines, words and characters. The proposed approach develops three kinds of segmentation algorithms: line-based segmentation, word-based segmentation, and character-based segmentation of handwritten words. These segmentation techniques will separate each character of the script that will be recognized by the deep convolutional neural network later. This recognized handwritten script is immediately converted into an electronic text document. This proposed technique will generate a valuable contribution to the field of handwritten document recognition systems.

Related Work
Since the 1950s, handwriting recognition has been under investigation. For this, the application for new digital computer technology became the subject of interest. In 1968, Eden suggested the technique known as analysis-by-synthesis. In this proposed method, the author formally proved that all characters are consist of an infinite number of schematic features. As a result, many researchers have added their valuable contributions later in this field.
The work done by Grimsdale and Bullingham describes how the process of handwriting recognition can be simplified and speeded up. In this paper, the technology of the flying-spot uses a high-resolution scanner as a spot of light to read or scan an image [1].
In [2], the authors proposed a recognition system; its feature extraction phase aims to illustrate the pattern with the help of a minimum number of features used to discriminate different pattern classes. This paper use gradient representation o measure the direction and magnitude of the enormous change in intensity in a minor neighborhood of every pixel. Through the Sobel operator, gradients are computed. The Sobel operator technique is mainly used for edge detection, where it creates an image emphasizing edges. In this paper, the obtained recognition accuracy of English Characters is 94%. The logical simplicity and easy use of the gradient features technique become the reason for this technique's popularity for recognition purposes [3].
In research [4], authors proposed handwriting recognition using fuzzy theory. The proposed method consists of two main phases, pattern recognition and feature extraction, respectively. First, the fuzzy technique is used for pattern recognition to fetch the fuzzy patterns. The problem in this method with handwritten characters is that every character has a different shape, size, and position because of different writing styles [5].
The proposed approach describes the effects of changing the models of Artificial Neural Networks to recognize the characters in the input document [6]. The paper explains the behavior of different Neural Network models that are used in Optical Character Recognition. Different parameters are considered in their proposed work, such as hidden layers, no. of neurons used in each layer and epochs, etc. They use Multilayer Feed Forward and Backward network for the recognition of characters. Their proposed work consists of phases like pre-processing, segmentation of characters, normalizing and de-skewing [7].
In [8], the authors propose a new technique, 'diagonal based feature extraction, to recognize handwritten alphabets. A multilayer feed-forward neural network is used for this purpose. The system performs a high level of accuracy compared to other conventional horizontal or vertical based feature extraction [8].
In [9], the authors develop a system using MATLAB. It acquires the image and converts it into a greyscale image. This preprocessed image then uses for recognition purposes. Multilayer perceptron (MLP) neural network is used for the recognition of characters. An MLP uses the backpropagation technique of a neural network in which each neuron of every layer is fully connected to the neuron of the next layer. Every node of the layer is work as an individual neuron except the input nodes. It decreases the training time and cost [9]. Fig. 1 is defining the proposed method, which consists of multiple sub-phases. The first stage will get the written script from the environment in the image acquisition phase. Then this image will transfer to the pre-processing stage for the grey scaling, binarization and skew correction of the input image. Then this handwritten script will send to the next stages for line, word and character segmentation. Finally, the adequately segmented characters will send to the feature extraction phase. Finally, CNN configured layers will perform the training phase; after this, the character is classified in any of 26 characters' classes.

Image Acquisition
Image acquisition is the primary phase of a handwritten character recognition system. The proposed recognition system takes a scanned handwritten character image as an input. To start the process, the user should upload the image of the handwriting

Pre-Processing
Pre-processing is used to increase the quality of the image. It involves the following operations.

Grayscale Conversion
The original image is firstly converted into a greyscale image [10]. This is converting the RGB values (24 bit) into greyscale values (8 bit).

Noise Removal
Noise means unwanted information which disturbs the quality of the image. It means pixels in the image have different intensity values than the actual pixel values. Filters are the way in image processing to eliminate noise from the input image. For example, using a median filter s an efficient technique used to remove salt and pepper noise. This proposed method is using a median filter to remove salt and pepper noise [10]. The median filter will substitute the current pixel value of the image with its median value. In median filtering, it first performs numerical order sorting on the pixel values from adjacent pixels and then exchanges the pixel under consideration with the middle pixel value.

Binarization
It is a part of the pre-processing process. This will convert a gray-scale image into a black and white image [10]. Binarizing the image will invert the pixel values. In this, black is represented by 0 bit, and white is represented by 1 bit.

Segmentation
Segmentation of characters is a crucial step in handwriting recognition because it directly affects the accuracy of the system [11]. Therefore, the accuracy of this recognition process will become better when the characters are correctly segmented. In the proposed work, the segmentation process is divided into three phases Line Segmentation, Word Segmentation, and Character Segmentation.

Line Segmentation
In the line segmentation, the input image has dark background pixels and white foreground pixels. Therefore, there is a possibility that text can touch the top and bottom lines. Removing such errors requires first pad the image to make a black space at the top and bottom in the image. Then, it will help to calculate the dark centroids in the image. The method proposed for line segmentation is based on the idea of projection problem taken from the Algorithm 2.1 "Projection Profile-Based Algorithm" presented in [11] with some amendments. In that study, the horizontal projection of the image is found, and then through that projection, they find maxima and minima of the image for further processing. In this proposed method, after finding the vertical projection, the non-text regions are marked. Then, the centroids of those dark regions are found, and the segmentation points for lines will consider as dark centroids. This algorithm does not work well for the images with skew angles, as shown in Fig. 2a. A skew angle is the direction of the text baseline. It is the clockwise or anti-clockwise orientation of text baseline concerning horizontal frame [12]. Skew can be of the following types: (1) Negative Skew: In it, the direction of the text baseline goes from the bottom left to the top right, in an upward direction; it is called "negative skew." (2) Zero-Skew: The text baseline is parallel to the horizontal frame; it is zero skewed. In the skew correction, the algorithm "Skew Detection using Center of Gravity" presented in [13] is used with some modifications. The applied algorithm can correct a slight skew angle, not corrected as shown in Fig. 2b. Wherever in the proposed method, whenever the line's slope is calculated, it is checked that the line must be a straight line or an approximate straight line. The result of this method is shown in Fig. 2c. The proposed Modified Skew Detection using the Center of Gravity algorithm is given in Algorithm #1.
Algorithm #1: Modified skew detection using center of gravity 1 Divide the text image into two equal halves vertically. 2 For both left and right halves, compute the center of gravity. 3 Join the two centers of gravity by a line. 4 Compute slope of the line. 5 If the value of the slope is not between −0.1 and 0.1, then skew is corrected (Continued) 6 Else (a) By taking the inverse of the slope, calculate the skew angle (b) Rotate the signature by detected skew angle. (c) Repeat steps 1 to 6 until the condition at Step 5 becomes true

Word Segmentation
The Line segmentation further demands the segmentation of words. In the word segmentation, morphology is applied to the image [11]. This process is done by dilating the image; in the result, it will connect the near parts of characters [14]. Dilation is a primitive morphological process that raises or condenses items into a binary image. Fig. 4a, referred to as a shaping component, is used to regulate the precise method and magnitude of this thickening. Shaping components are small cliques or sub-images used to review an image under consideration for features of interest. The dilation process is mainly used with line segmentation, but in the proposed work author is using it for word segmentation. In the dilation process of the image, the characters in words are connected, and those connected components can be easily extracted from the image. The extraction of those connected components demands proper labeling of these words, and this proposed work uses MATLAB bwlabel labeling function. Those labels are cropped from the line image, and as a result, the words are extracted. The word segmentation is done through Algorithm #3.

Character Segmentation
In handwritten document generation, sometimes the negative or positive slant occur in the written words, which demands the slant correction [15]. Slant correction will help to perform character segmentation. Handwritten characters could be cursive and untouched characters.
(1) Cursive character segmentation (2) Untouched character segmentation Cursive Character Segmentation In cursive segmentation, the characters are segmented from the word image of cursive handwriting. In this segmentation, the main challenge is to avoid misssegmentation and over-segmentation. Mis-segmentation means the characters that had to be segmented are not segmented properly. Over-segmentation means that a single character is segmented as two characters, just like 'm' can be segmented as two n's, and we can be segmented as two v's. This problem can be avoided through the combination of algorithms presented in [16].
In the first step, the word image is skeletonized, and the vertical projection of the word is calculated by summing all the columns. The vertical projection can determine ligatures between characters as they will have only one foreground pixel in the perspective column. Sometimes, oversegmentation can occur in these characters 'm', 'n', 'u', 'v', 'w', etc. The over-segmentation of 'm' and 'n' can be avoided through the midpoint of the height of the image is calculated. The image is scanned vertically for each column; if the white pixel finds, its position is determined. In that position, if the pixel is below the midpoint of height, then that column may be a segmentation column. In the end, if the sum of that column is more significant than one, then this column is discarded. In another case, it may not be the joining point, and it is stored as a potential segmentation column. The over-segmentation problem for these characters ('u', 'v', 'w') can be avoided through the distance-based approach. If the distance between this column and the previous segmentation column is less than the given threshold value, then it is discarded, and this process is repeated for the next columns. Fig. 5a present the slant words and Fig. 5b define slant correction. Whereas Fig. 6. represent the working of Algorithm #4.  Store first Segmentation Point SP (1) at the start of the first character. a. 5 Find the midpoint of the height of the image by finding the midpoint of the number of rows present in the image. a. 6 Repeat for each column i starting from 1st SP 6.1. Repeat for each row j starting from row 1 6. For each SP draw a red line on the word image for visualizing segmentation points. a. 8 Store 0 in all rows of all Segmentation Points SP a. 9 Now all the touched characters are disconnected. a. 10 For segmentation of these disconnected characters algorithm is given in the following section.
Untouched Character Segmentation: Segmentation of untouched characters is much easier than touched characters. In that process, the line segmentation method is used. It will find black spaces between characters and make a separation between characters. In this type of character segmentation, vertical projection is calculated instead of horizontal projection.

Algorithm #5: Untouched character segmentation 1
Pad the left and right sides of the image. 2 Find Vertical projection by summing the image vertically. i.e., Find the sum of each column, as shown in Fig. 7b. 3 Find the dark lanes that define the zones between the letters by setting a threshold value. i.e., 0 for a binary image. 4 Label each dark region using bwlabel function of MATLAB. 5 Find Number of dark regions. 6 Find centroids of those dark regions using regionprops function of MATLAB 7 Store all centroids into an array. 8 Find all x-centroids and y-centroids 9 Now crop the region of texts by assuming a line of text between 2 consecutive x-centroids. 10 Store all the cropped regions into a separate array. Now each character is segmented properly. As shown in Fig. 7c.

Feature Extraction
Feature extraction is the process of detecting the features of interest from an image and storing them for further processing. In image processing, feature extraction is a critical step that allows moving from pictorial to data representation. The proposed work is using a Convolutional Neural Network for feature extraction [17].

Creating a Convolutional Neural Network
The proposed method is using a Convolutional neural network (CNN). CNN contain neurons with learnable weights and biases. CNN can contain multiple layers, which are also known as deep learning. CNN is a feed-forward neural network that can contain one or more convolutional layers. One or more fully connected layers follow CNN, just like in a simple multiple layer neural network. The architecture of CNN is modeled so that it can use the 2D structure of the image as an input of a Neural Network. The proposed configuration of the neural network is using the back-propagation technique for the training of a CNN. Traditional CNN consist of layers defines in Fig. 8. Whereas Fig. 9. define in detail the layering scheme of configured CNN.

Training of Convolutional Neural Network
The configuration of the neural network then demands the training of this neural network. The overall dataset set is split into two sections, training data, and validation data to achieve this step. The sample data for training consist of handwritten images of different writing styles. The dataset of 65000 handwritten characters is prepared for the training and validation of the neural network. There are 2500 sample images of each alphabet, and the total sample images are 2500 * 26 = 65000. Sample image is a gray-scale image that has a white foreground and black background. The neural network is trained by the initial learning rate of 0.005. The maximum training accuracy of CNN is recorded as 97%. The trained neural network is saved, and then it is used for the recognition process.

Document Recognition
For image recognition, the neural network comprises four primary operations defined in the next sections.

Convolution:
The CNN use a convolutional operator to extract features from an input image. The spatial association between pixels Convolution is preserved by learning image features using the sub-matrix of the input image. Every image can be expressed as a matrix contains pixel values. For example, let's suppose the image matrix of 5 * 5 size consists of binary values and a filter matrix of 3 * 3 size. The convolution operator works as the filter matrix is placed over the image matrix. Every value of the filter matrix is multiplied by the image matrix's corresponding value, and then the filter matrix is moved by one pixel, and the same step is repeated. The step moved or jumped is called stride in the proposed work stride one. The resulting matrix is the convolution matrix.
CNN depends on the following things: (1) Depth: Number of filters (2) Stride: Filter matrix slides over the image by some number of pixels; those number of pixels are called Stride. (3) Zero-padding: Sometimes, zeros are padded around the border; it is used that the filter can be applied to border elements of the input image matrix. It is also known as wide convolution.
In the convolutional layer, the convolution (•) is the dot product between inputs image M and the filter matrix N. The output of the whole process is a convoluted features matrix represented as icon. The icon can be calculated using Eqs. (1) and (2).
Non-Linearity (ReLU): ReLU stands for Rectified Linear Unit; it is used to perform a non-linear process. ReLU performs an element-wise operation. It is applied to each pixel, and it substitutes all negative pixel values by zero in the extracted feature maps. It retains only nonnegative value pixels in a feature map.
The mathematical model of the ReLU function consists of the piecewise nonlinearity operator that defines the maximum output indication. Thus, the ReLU function is represented as ReLU(•). The output of the ReLU function will be the rectified feature map, irec, which can be calculated using Eq. (3). Pooling Step: Pooling is the process of defining sub-sampling or downsampling. This feature reduces the dimensionality of the feature maps and also preserves the essential information. This study is using the Max pooling technique for spatial pooling. In Max pooling, a spatial matrix is defined as 2 * 2, and the maximum portion from that area is selected.
Mathematically the max-pooling function can be Pool (•) defined as shown in Eq. (4).
Fully Connected Layer: The fully connected layer is used for classification. A fully connected layer means that every neuron of a layer is connected to every neuron of another layer. The output of the convolution and pooling layer is fed to full connected layers, and it performs classification.
Softmax Layer: A softmax layer trained on the handwritten alphabets will output a separate probability for each of the 26 alphabets, and the probabilities will add up to one. Thus, the softmax activation function in the output layer of a deep neural network is to express a categorical dispersal over class labels. Thus, the probabilities of each input element are obtained that belong to a label.
The softmax function Softmax(•), is a multiclass classifier. The ith probabilistic output of that function can be calculated using Eq. (5).
Classification Layer: The classification layer classifies the output obtained from the softmax layer.

Obtained Results
Matlab 2017b is being used to find the overall accuracy of the proposed work and dataset. In this study, the accuracy of the proposed method and dataset is demonstrated by using the CNN algorithm.

Application View
The application view is defined in Fig. 10, wherein the first stage needs to upload the image for the image acquisition process. In the preprocessing stage, it will perform all preprocessing phases defined previously in Section 3.2. After preprocessing according to the input image, the user will choose the segmentation type (cursive or untouched). Next, the system will perform all types of segmentation phases defined in Section 3.3. Then CNN will configure automatically for the recognition of segmented characters. On completing the recognition phase, the recognized characters will be stored in a simple notepad file for verification of the system. The simulated system consists of seven layers. Input Layer, Convolutional Layer, batch normalization layer, ReLU layer, max-pooling layer, fully connected layer, softmax layer and classification layer. The system consists of three convolutional layers in which first convolutional layer comprises 64 filters, second convolutional layer contains 56 filters, and the last convolutional layer consists of 40 filters. Thus, systems have three batch normalization layers, three ReLU layers, three max-pooling layers, one fully connected layer, one softmax layer and one classification layer. Max pooling layer have kernel size of 2 * 2 matrix.  Fig. 11 shows that the system's accuracy will get better during the cursive handwriting each character skew corrected and is equally distanced aligned. Tab. 1 shows the overall accuracy of the segmentation phase of a system for each character.    . 12 shows the system performance during the uncorrected cursive segmentation. Again, the system will incorrectly segment the characters if they are too skewed and have an equal distance.  Tab. 2, defines the average accuracy of the system in the perspective of recognition of each character in the validation step. The accuracy of the system depends on the writing style of the writer. If the writing style is untouched, the system will perform much better than the performance defined in Tab. 2. The English script gives the system for validation purposes. The script can be in cursive and untouched writing style form. Then given image is firstly pass through the preprocessing process. Then after segmentation, each character is separately sent to the system for verification purposes.

Discussion
The solution's average accuracy reached 90% in the proposed solution, which is an excellent result obtained in the handwritten document recognition on the given dataset. This study is helpful to the design and development of handwritten Optical Character Recognition Systems in future. The comparison between the recognition accuracy of previous studies and the proposed method is demonstrated in Tab. 3.
In this study, the essential contribution by the authors are: • Designing a valid dataset that is used to train the systems efficiently can be trained for both printed and handwritten documents. • Designing of new algorithms for line, word and character segmentation of cursive and noncursive handwriting. • Find every possible writing style for every alphabet, joining style with other alphabets in English language alphabets.

Conclusion
This paper proposed a Convolutional Neural Network and different segmentation approaches for Recognition of Handwritten English documents. The proposed technique is to train and test on the standard user-defined dataset prepared for the proposed system. From experimental results, it is observed that the proposed technique provides the best accuracy rate. The proposed system is currently achieving 90.42% accuracy in the validation phase. This decrease in accuracy due to many factors like a distorted stroke in writing, multiple sizes and thickness of characters, different writing styles, illumination of writing and many others. In future, the accuracy level can be further improved by modifying segmentation techniques in line, word and character segmentation and indulging more intermediate layers and filters in convolution neural networks.