|Computers, Materials & Continua |
Denoising Letter Images from Scanned Invoices Using Stacked Autoencoders
1Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, 84428, Saudi Arabia
2Department of Electronics, College of Engineering Chengannur, Kerala Technological University, Chengannur, 689121, India
*Corresponding Author: Samah Ibrahim Alshathri. Email: email@example.com
Received: 08 August 2021; Accepted: 09 September 2021
Abstract: Invoice document digitization is crucial for efficient management in industries. The scanned invoice image is often noisy due to various reasons. This affects the OCR (optical character recognition) detection accuracy. In this paper, letter data obtained from images of invoices are denoised using a modified autoencoder based deep learning method. A stacked denoising autoencoder (SDAE) is implemented with two hidden layers each in encoder network and decoder network. In order to capture the most salient features of training samples, a undercomplete autoencoder is designed with non-linear encoder and decoder function. This autoencoder is regularized for denoising application using a combined loss function which considers both mean square error and binary cross entropy. A dataset consisting of 59,119 letter images, which contains both English alphabets (upper and lower case) and numbers (0 to 9) is prepared from many scanned invoices images and windows true type (.ttf) files, are used for training the neural network. Performance is analyzed in terms of Signal to Noise Ratio (SNR), Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM) and Universal Image Quality Index (UQI) and compared with other filtering techniques like Nonlocal Means filter, Anisotropic diffusion filter, Gaussian filters and Mean filters. Denoising performance of proposed SDAE is compared with existing SDAE with single loss function in terms of SNR and PSNR values. Results show the superior performance of proposed SDAE method.
Keywords: Stacked denoising autoencoder (SDAE); optical character recognition (OCR); signal to noise ratio (SNR); universal image quality index (UQ1) and structural similarity index (SSIM)
Digitizing paper documents is a crucial step in business process automation. This process helps industries to efficiently manage large volume of documents. The images obtained by scanning the paper documents are converted into a digital format using OCR (optical character recognition) software. Usually, during the scanning process, noise can get into the images in the form of background noise, blurred and faded letters due to dirt on paper or lens, water marks, moisture on the lens, or due to physical dealing of papers. Transmission errors and compression methods also add noise to the images . This can result in significant image degradation and affects OCR detection accuracy. So, it is essential to use efficient image denoising techniques as a preprocessing step in order to remove the noise and recover the text information from degraded image obtained. Invoice data is an important document that need to be automated in almost all industries. Most of the old invoice data(receipts) may be damaged due to physical handling and dirt. During scanning and detection process, OCR fail to detect accurately the letters in these noisy invoices. So invoice data denoising is required as a preprocessing step before OCR detection process. Most of the existing filtering methods are not efficient for this type of letter images at high noise conditions. This work focus on an autoencoder based deep learning technique for invoice letter image denoising. We prepared a dataset of 59,119 letter images obtained from different scanned invoice images and developed a modified stacked denoising autoencoder model with combined loss function criteria for letter denoising. A detailed comparison is made with existing denoising filters and standard autoencoder based method for different noise levels.
The image denoising techniques have attracted researchers for half a century and, it remains a challenging and open task [2–5]. The Spatial domain methods consist of linear filters, which blurs edges and remove fine details  and non-linear filters, which preserve the edge information while suppressing noise . Denoising filters in literature consists of wiener filtering technique, morphological techniques, vector-median-filtering, non-local algorithm method etc. [8–19]. But these methods cannot produce better results for document images.
Machine learning methods for image denoising includes, sparse based methods, dictionary learning method, total variation regularization, gradient histogram estimation and preservation (GHEP) etc. [20–27]. These methods have reasonably good performance, but have many drawbacks , such as manual setting of parameters and the need for high computational optimization techniques.
Deep learning techniques are part of machine learning, which have significant applications in many fields [29–34]. Application of deep learning in image denoising have gained much attention in recent years [35–39]. But in case of image denoising, most deep learning methods are highly data dependent and one architecture designed to remove a particular noise will not work for another type of noise distribution. To perform deep learning based denoising for invoice data, a large training dataset is required. No such public data for invoice letters is now available and this hinders research in this direction.
In this paper, a modified stacked denoising autoencoder is implemented and used for receipt data denoising. The proposed method of autoencoder design can capture the most salient features of training samples. An undercomplete autoencoder is designed with non-linear encoder and decoder function. This autoencoder is regularized for denoising application using a combined loss function which considers both mean square error and binary cross entropy error. A two-level stacking is done to increase efficiency of the network. A dataset consisting of 59,119 letter images, is prepared from different scanned invoices(receipts) images and windows true type (.ttf) files for training the network. Its performance is compared with other denoising filters, in terms of SSIM index, SNR (dB), PSNR (dB) and UQI values for different noise levels. Denoising performance of proposed SDAE with combined loss function criteria is compared with standard SDAE with single loss function criteria in terms of SNR and PSNR values. Results shows that proposed method have better denoising performance.
Autoencoders  are artificial neural networks which are capable of learning the lower-dimensional features of the input data. It is based on unsupervised training criterion. The input is in the form of a vector, which is represented as p ∈ [0, 1] d, which can be a patch of an image. This input is mapped to a hidden representation q ∈ [0, 1] d. Here ‘d’ represents the dimensionality of the vector space. The mapping is given by the Eq. (1)
where s is a nonlinear function. W is a weight matrix and b a bias vector. This hidden representation is mapped back to a vector y ∈ [0, 1] d, in order to obtain the reconstructed input data. This reverse mapping is given by the Eq. (2)
W′ and b′ are weight and bias of the network respectively.
The model parameters are optimized to minimize cost function, which is the average reconstruction error,
given by Eq. (3)
where L is a loss function.
This network adapts itself to extract features from images. So hand coded feature descriptors are not needed. Autoencoder can be used for classification and denoising applications. Autoencoder architecture is shown in Fig. 1. Autoencoder consists of three layers, an encoder, one hidden layer and a decoder. The aim of an encoder is to take an input vector (p) and produce a feature map (q). This is a compressed representation of input data. The decoder reconstructs the output vector (y). During each training phase a loss function is calculated and its value is minimized, so that the reconstructed data looks like the original input.
2.1 Denoising Autoencoders
Denoising Autoencoder  can learn to remove noise from the input image. It can prevent overfitting in classification tasks, by preventing the network from memorizing examples from training set. For denoising purposes, instead of using the input and the reconstructed output to compute the loss, the loss can be calculated by using the ground truth image and the reconstructed image as shown in Fig. 2.
The mapping function for denoising autoencoder is given by the Eq. (4)
where r is a random vector and s is a nonlinear function. This reverse mapping for denoising autoencoder is given by the Eq. (5)
W′ and b′ are weight and bias of the network respectively. The cost function is
The second term in cost function is used to minimize correlations between input images.
2.2 Stacked Autoencoders
Stacked Autoencoders is obtained by stacking one layer of autoencoder after the other . A composition of several levels of nonlinearity in a neural network can efficiently model complex relationships between variables. Each layer produces a higher-level representation from the lower-level representation. output by the previous layer. This technique can efficiently detect important structures(features) in the input patterns. A new encoding function is learnt by the network in each hidden layer and passed to next level for learning another encoding function. The structure of stacked autoencoder is shown in Fig. 3.
Input is a vector x, which is passed through hidden layers and y is the decoded vector. Encoder is used for mapping the input data x into hidden representation (code), and decoder is used for reconstructing input data from the hidden representation. Here h1 (first hidden layer) represents the hidden encoder vector calculated from x and h2 (second hidden layer) represents the second hidden encoder vector calculated from layer h1. Similarly h3 and h4 are two hidden layers in the decoder section, which represents the hidden decoded vectors formed from the code generated by encoder. Here y is the decoded vector of the output layer. The encoding process in each layer is as follows:
where represents hidden encoder vector in nth hidden layer, f is the encoding function, represents weight matrix of encoder in nth hidden layer, and bn is the bias vector in nth hidden layer.
where represents hidden decoder vector in nth hidden layer, g is the decoding function, represents weight matrix of decoder in nth hidden layer, and is the bias vector in nth decoder hidden layer.
End to end pre-training and Ladder wise pre-training are the two methods of training stacked autoencoders. After all the hidden layers are trained, backpropagation algorithm is used to minimize the cost function and update the weights by optimization process. The rectified linear units (Re-LU) activation function is used after each hidden layer vector calculation. Re-LU does not suffer from gradient diffusion or vanishing problems. The Re-LU function is
Sigmoid activation function is used in output layer.
Methodology of work is shown in Fig. 4. First the dataset for training/testing/cross validation is generated. An autoencoder is designed and the data set is used to train it. The performance of the autoencoder to remove additive noise is tested with external noisy letter images.
The steps in the experiment are detailed in Fig. 5. This section is broadly divided into generation of dataset, development of stacked denoising autoencoder and testing.
4.1 Generation of Data Set
Invoices are often generated by Windows based system and it is logical to train the autoencoder with Windows fonts and letter sizes. A Python script is written to extract Windows true type (.ttf) files and are used to generate images of lower case and upper case English letters and numerals. All possible fonts at 12 point size are used to generate synthetic images of dimension 60 × 40. The 62 letters and numbers are stored in 62 folders, each folder containing 711 images, with label as the folder name. This synthetic dataset contains 44082 images.
Another Python script is written to read in scanned images of invoices and contours are drawn around letters and these text boxes are separated, labelled and added to the respective folders to augment the synthetic data set to yield 59,119 images. The pixel values of the images are converted to a Python array along with the labels, added controlled amount of noise and then pickled to form the training, test and cross validation data for the autoencoder.
4.2 Development of Autoencoder
Stacked denoising autoencoder is implemented in python using Py-Torch deep learning library. Pickled noisy images of size 59,119 × 60 × 40 is fed to the input of stacked denoising autoencoder. Adam optimizer is used. Learning rate used is 10−3 and batch size used is 16. The network is trained for 100 epochs on an HPCC with NVIDIA Tesla k20M GPU hardware.
During each epoch, mean square error and cross entropy error are calculated and this loss score is backpropagated through an optimizer in order to update the weights of the network. Two hidden layers having 512 neurons and 128 neurons respectively are included in encoder section. Another two hidden layers having 128 neurons and 512 neurons respectively are included in decoder section. Re-Lu activation function is used after each hidden layer. Sigmoid activation function is used at the output layer. Additive Gaussian noise of mean zero and variance of 20% of the peak signal value is used to get the noisy version of letter images.
The system shown in Fig. 6 is tested with noisy letter images of known variance. End to end pre-training is done using 59,119 noisy letter images. A comparative study with other filters is done in terms of Peak Signal to Noise Ratio (PSNR), Signal to Noise Ratio (SNR), Structural Similarity Index (SSIM) and Universal Image Quality Index (UQI).
5 Results and Analysis
SDAE (Stacked denoising autoencoder) outputs with combined loss function for a randomly selected letter images for 20%, 40% and 60% (of the peak signal value)) noise variances are shown in Fig. 7. Observe the removal of noise in all cases. Here a letter “C” from scanned invoice, corrupted with 20% to 60% white gaussian noise is given as input and noise is perfectly removed in all cases. Python3 plotting library matplotlib is used for plotting all graphs and figures.
Autoencoder output for a set of input letter images corrupted by 20% noise variance is shown in Fig. 8. Letters “y, m, M, 8, o, 7, h and 3” with 20% noise is tested. All images were denoised with 100% detection accuracy.
Autoencoder output for the same set of input letter images corrupted by 40% noise variance is shown in Fig. 9. It is evident from figure that most letters were denoised perfectly. But number “3” does not retained its shape at this level of noise.
Autoencoder output for Input letter images corrupted by 60% noise variance is shown in Fig. 10. It is observed that the proposed method works well under even deep noise levels. It is observed that the proposed method works well under even deep noise levels. Complete noise is removed and letters looks almost similar to input data.
5.1 Comparison with Other Denoising Filters
It is essential to compare the performance of other denoising filters for these letter images from invoice at different noise levels. Results of other filters for a randomly selected invoice image representing number “4” corrupted by 20% noise variance are shown in Fig. 11 for comparison. NLM filter shows good denoising at this noise level. But the performance of anisotropic filter and gaussian filter are poor. SDAE removes complete noise and outperforms all other filters.
Results of other filters for the same image corrupted by 60% percentage of noise is shown Fig. 12. At this noise level no information is visible from noisy input image. SDAE detects letter “4” shows stable performance. NLM based method also fails at this level. At this high noise level, no other filters works better than SDAE.
Visual quality of SDAE is determined based on
1. Signal to Noise Ratio (SNR)
2. Peak Signal to Noise Ratio (PSNR)
3. Structural Similarity Index (SSIM)
4. Universal Image Quality Index (UQI)
5.1.1 Improvement in Signal to Noise Ratio
The SNR is expressed as
Signal to noise ratio improvement of various denoising methods for Gaussian noise of zero mean and different noise variances is shown in Tab. 1. It is observed that the SNR improvement for SDAE is consistently above other filters by 10–12 dB even under deep noise, validating the visual quality in Figs. 11 and 12. But the visual quality of NLM filter, Anisotropic diffusion and Gaussian Filter are not stable with high noise variances. This is evident from SSIM values.
5.1.2 Peak Signal to Noise Ratio
Peak signal to noise ratio (PSNR) is the ratio between the maximum possible power of an image and the power of corrupting noise that affects the quality of its representation. PSNR is defined as follows:
Here, M is the number of maximum possible intensity levels (minimum intensity level considered to be 0) in an image and RMSE is root mean square error. Tab. 2 shows the PSNR values for various filter under different noise levels. The values for SDAE are above other filters, indicating its superior performance in noise removal.
5.1.3 Structural Similarity Index (SSIM)
Structural similarity index (SSIM)  represents the “visual quality” of the image. It quantifies the degree of preservation of the overall structure of the image. The similarity index between the images x and y is given as
The parameters and are the means and and are the variances of x and y respectively. is the covariance between x and y. and are nonzero constants. When x and y are identical, SSIM is unity and degrades when the structural differences between x and y increases.
SSIM comparison of five denoising methods, SDAE, NLM, Gaussian filter, Mean filter and anisotropic diffusion filter is shown in Fig. 13. The stable performance of SDAE can be accounted from this graph. At 10% of gaussian noise, NLM is having slightly higher SSIM index. But its performance drastically reduced with increased noise levels. Anisotropic diffusion filter shows a stable value, but its SSIM value is less than SDAE. Higher value in the range of 0.998 is obtained from SDAE. Gaussian and mean filters shows lower values of SSIM.
5.1.4 Universal Image Quality Index (UQI)
UQI  is designed by modelling any image distortion as a combination of three factors: loss of correlation, luminance distortion, and contrast distortion. Comparison of Universal image Quality Index for different methods is shown in Fig. 14. NLM filter have good performance only for low noise levels SDAE shows a stable and better Universal image Quality Index even for high noise levels. Gaussian filter has UQI values better than anisotropic diffusion filters for lower noise levels. But stability of gaussian filter is less compared to anisotropic diffusion method. Mean filter has lowest UQI values.
5.2 Comparison with Standard SDAE
The proposed stacked denoising autoencoder with combined MSE and BCE loss function is compared with standard stacked denoising autoencoder with single binary cross entropy loss function, in terms of signal to noise ratio (SNR) and peak signal to noise ratio (PSNR). Comparison results for two different noise levels for a single selected letter ‘X’ are shown in Tab. 3. Results shows that proposed method have good denoising capability even at higher noise levels.
The proposed denoising method of letter images from invoice documents by using modified Stacked Denoising Autoencoder (SDAE) is observed to have excellent signal to noise ratio, structural similarity index and universal quality index, even under extreme noisy conditions. Using a combined loss function which considers both mean square error and binary cross entropy for regularizing the denoising function is used. Under complete representation of autoencoder used in this denoising method have better feature extraction properties. A dataset consisting of 59,119 letter images, which contains both English alphabets (upper and lower case) and numbers (0 to 9) is prepared from many scanned invoices images and windows true type (.ttf) files and is used for training the neural network. SDAE being an unsupervised deep learning method, no labels are required for training the network. No manual parameter tuning is necessary when compared to other denoising filters. The denoised letters have better chances of detection by OCR methods. SDAE denoising performance in terms of SNR, PSNR, SSIM and UQI values is compared with non-local means filter, anisotropic diffusion filter, gaussian filter and mean filters. The proposed SDAE method is also compared with standard SDAE in terms of SNR and PSNR values. A SSIM value as high as 0.998912 is obtained irrespective of extreme noise levels. One disadvantage is the large time taken for training the network. Once the model is saved, it can be reused. Another disadvantage is due to limitation in no of training samples of some letters from 62 different classes, its shape may get deformed at extreme noise levels. These issues may be addressed in future research. Also new regularization and optimization methods can be incorporated to increase the denoising performance.
Acknowledgement: This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Funding Program.
Funding Statement: This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Funding Program.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|