Accurate handwriting recognition has been a challenging computer vision problem, because static feature analysis of the text pictures is often inadequate to account for high variance in handwriting styles across people and poor image quality of the handwritten text. Recently, by introducing machine learning, especially convolutional neural networks (CNNs), the recognition accuracy of various handwriting patterns is steadily improved. In this paper, a deep CNN model is developed to further improve the recognition rate of the MNIST handwritten digit dataset with a fast-converging rate in training. The proposed model comes with a multi-layer deep arrange structure, including 3 convolution and activation layers for feature extraction and 2 fully connected layers (i.e., dense layers) for classification. The model’s hyperparameters, such as the batch sizes, kernel sizes, batch normalization, activation function, and learning rate are optimized to enhance the recognition performance. The average classification accuracy of the proposed methodology is found to reach 99.82% on the training dataset and 99.40% on the testing dataset, making it a nearly error-free system for MNIST recognition.
The Modified National Institute of Standards and Technology (MNIST) handwritten digit database, one of the most important areas of research in pattern recognition, has excellent research and practical value. Generally speaking, handwriting classification techniques can be divided into either statistical feature-related methods or structural feature-related approaches [
In comparison to these approaches that rely on manual feature extraction, convolutional neural network (CNN)-based deep learning (DL) architectures, which is also known as deep neural networks (DNNs), can automatically extract the implicit correlation within and amongst data to find useful patterns [
This paper is organized as follows. The key CNN design method is introduced in Section 2, focusing on the design of feature extraction, classification method, hyper-parameter optimization, over-fitting prevention strategy, and model verification strategy. The model structure design and optimization analysis, particularly the experimental results of model performance validation, are discussed in Section 3. Finally, the fourth part summarizes the paper.
The core steps of CNN-based recognition model primarily include extraction, classification yield, and backpropagation to alter parameters in the network. The overall algorithm design process is presented in
CNNs rely on multiple convolutional layers and non-linear layers (e.g., activation layer) for feature extraction, and key features can be retained through feature reduction techniques (e.g., max pooling, etc.). Detail steps can be specified as follows: 1) convolution layers can be employed multiple times throughout the DNN; each convolution layer can be individually implemented by the Keras/Tensorflow built-in two-dimensional (2D) convolution function, with appropriate kernel size, stride steps, input and output channels; “Padding” is set to ‘SAME’ so that the output of the convolution function remains the same size as the input; 2) initialize each neuron/kernel in each CNN layer with a weight matrix and a bias scalar; to avoid the vanishing gradient problem and break the symmetry of neurons, a truncated normal distribution with a standard deviation of 0.1 is employed to initialize the weight matrix; in addition, for sake of avoiding the “death” of the neuron node (i.e., the output is always 0), a small positive number (e.g., 0.1) needs to be used to initialize the bias term; 3) batch normalization may be exploited to regulate the convoluted results within a certain range, to facilitate the following Rectified Linear Unit (ReLU) as the activation function to extract the features; 4) max pooling and/or other data reduction techniques are optionally applied to reduce the data dimension and computational cost, as well as to retain the key features.
The output of the last convolutional layer includes the extracted features, which are flattened into a one-dimensional tensor and then forwarded to two fully connected (i.e., dense) layers. A Softmax layer is connected after the last fully connected layer to predict the likelihood of each category for each sample. The class with the highest probability is chosen as the sample’s predicted category.
In this study, three convolution layers are exploited for feature extraction (
Once a CNN backbone architecture is established, detailed hyperparameters (e.g., batch size, learning rate, etc.) need to be tweaked to determine the best shape of this CNN model to fit the training and validation dataset, with a hope of achieving the best performance (e.g., accuracy, loss, etc.) upon other general data of the same kind.
The training batch size frequently has a significant impact on the training outcomes. Too small batch size results in long training time and even may generate severe gradient oscillations, which will make the models converge slowly or even impossible. In contrast, too large value may lure the model to converge to a local rather than a global optimal value. Consequently, it is challenging to get the ultimate ideal solution through training. Thus, by appropriately adjusting the batch size, it is possible to effectively increase memory utilization, increase the parallel efficiency of high-dimensional matrix multiplication, reduce training time and the likelihood of gradient oscillation, pushing the model convergence to the optimal results. Empirically, the global batch size can be set as Batch_size = 128*Number of Accelerators (e.g., parallel network replicas); tf.data will automatically split the global batch size among all replicas.
When an alternative mini-batch gradient descent method is available for training because of the intriguing project’s noise, employing a fixed learning rate leads the error to decline and evolves toward the local minimum, but may eventually oscillate around rather than truly converges at the global optimal point. Thus, the learning rate should be gradually decreased throughout the training process to minimize this problem. The research uses the attenuation of learning rate to strike a compromise between rapid convergence and stability. A relatively large initial training step size can be utilized to speed up the training convergence at early stage, and then it is gradually decreased to approach the global optimum, or swing back and forth in the vicinity of the optimum, as the strategic alignment and hidden layers are intensified to improve the training accuracy at convergence. This prevents gradient descent from falling into local minimum values or even gradient divergence.
Generally speaking, deep learning networks are susceptible to overfitting problems: the model’s performance on both the training and validation datasets improves over the training iterations, but drops on the validation dataset after a certain number of iterations while it continues to rise on the training dataset. This indicates that the model fits the training sample too closely, limiting its generalizability. In addition, encoding sample labels (e.g., one-hot encoding) may unnecessarily increase the difference between the full probability and zero probability categories, causing the prediction model to rely excessively on the predicted category, thereby increasing the likelihood of overfitting.
Common approaches to ameliorate this issue includes 1) reasonable data fitting and 2) using one or more “Dropout” layers to nullify a random portion of the trained parameters in the CNN after each training epoch. This work analyzes the incorrect predictions on the dataset to avoid overfitting and achieve the highest classification accuracy. Whenever the model’s performance on the validation set begins to deteriorate during the training process, the model’s training is interrupted and the trained parameters are preserved to avoid the overfitting induced by excessive training. By doing so, a model with the minimum validation loss and potentially better test outcomes can be achieved. In this study, if the validation loss stops dropping and starts to converge or increase in two consecutive training epochs, the training process should be stopped, and the epoch when the local validation loss is the lowest, or validation accuracy is the highest, must be noted to ease the process of reloading model parameters for model deployment.
In order to evaluate the built model, the cross-entropy,
MNIST database is one of the foremost classical imaging datasets within the field of machine learning, and is broadly utilized for benchmarking in image classification. The MNIST database contains 60,000 training samples and 10,000 testing samples (
Dataset object | Sample amount | Role |
---|---|---|
Data_sets.train | 55,000 | Training dataset |
Data_sets.validation | 5,000 | Validation dataset |
Data_sets.test | 10,000 | Testing dataset |
Placeholder and parameter setting.
A placeholder is created for each input image and its label: The design of convolutional layer and activation layer.
In the first layer, the size of the convolution kernel is set to be 3 × 3, and the weight and bias terms are initialized. The output channel is 12 to extract 12 different features. Then, the inner product of the convolution kernel and the input is computed, and the bias term is added to the convolution result. A batch normalization layer is employed to regulate the convolution results, followed by an ReLU activation function/layer for non-linear processing and feature extraction. The second layer is also a combination of convolution, batch normalization and activation functions, but with a kernel size of 6*6 and a stride of (2, 2) in the convolution layer, which reduces the image tensor size to a half, i.e., 14*14. In addition, the number of features is increased to 24. The third layer is similar to the second in kernel size and stride, but is extended to 32 features in total. In the end, the output of the third layer is flattened to a 1* (7*7*32) = 1*1568 tensor, fed to the following fully connected layers for classifications. Fully connected layers and dropout layer for classification and overfitting reduction.
The first fully connected layer has 200 hidden nodes, and an ReLU activation function is applied afterwards to make the input with bias terms have nonlinear characteristics. The Dropout is employed during training to randomly discard 40% of the trained neurons (i.e., weights and biases) to reduce overfitting. The output of the second dense layer is connected to the Softmax classifier to obtain the probability of each category of classification, and the class with the highest probability is selected as the predicted class of the corresponding sample. When the neural network model is validated, all nodes are retained to obtain the best predictive classification performance.
As aforementioned that a variable learning rate should be considered to achieve the optimum training outcomes, in this paper, exponential decay is used as the learning rate decay method. The initial learning rate is set at 0.001 and the learning rate decay factor is set at 0.99. As a result, the recognition accuracy of the model has been significantly improved by 2.9% on the test set, while the loss has been drastically reduced. The improvements in accuracy of various numbers/classes are also observed. This shows that the learning rate decay can effectively improve the recognition accuracy of the MNIST handwritten dataset.
By defining the cross-entropy loss function and a small value of the initial learning rate (i.e., 1*10−3), the Adam optimizer is used to automatically to minimize the loss function during the training process. In this process, the loss will be backpropagated to adjust the network parameters to better fit the training sample data. Here, the batch size is set to 1000, which means 1000 training samples are sent to the model for training with random gradient descent. The proper batch size can reduce computational overhead while generalize the overall characteristics of the dataset. The dropout rate is set to 0.4. One can see that the training loss and accuracy converge within 10 iterations (
Epoch | Training accuracy | Training loss | Learning rate |
---|---|---|---|
1 | 0.9598 | 0.1311 | 1.00e-02 |
2 | 0.9874 | 0.0400 | 5.01e-03 |
3 | 0.9927 | 0.0238 | 2.52e-03 |
4 | 0.9950 | 0.0153 | 1.20e-03 |
5 | 0.9966 | 0.0115 | 6.25e-04 |
6 | 0.9975 | 0.0090 | 3.12e-04 |
7 | 0.9980 | 0.0078 | 1.56e-04 |
8 | |||
9 | 0.9981 | 0.0068 | 3.90e-05 |
10 | 0.9939 | 0.0070 | 1.95e-05 |
The testing dataset is finally examined to verify the entire training and validation process, and achieve an accuracy of 99.40% and loss of 0.0171. The performance of the model on the test set resembles the training results. On the other hand, the accuracy rates vary on different digits. For example, 97% of number “6” are correctly classified while “1” reaches 100%.
Methods | Train. acc. | Train. loss | Val. acc. | Val. loss | Test. acc. | Test. loss |
---|---|---|---|---|---|---|
Googlenet [ |
0.98 | 0.0064 | 0.97 | 0.0029 | 0.98 | 0.0014 |
InceptionV3 [ |
0.95 | 0.0020 | 0.95 | 0.0042 | 0.96 | 0.0013 |
Xception [ |
0.96 | 0.0035 | 0.96 | 0.0141 | 0.96 | 0.0019 |
VGG16 [ |
0.96 | 0.0065 | 0.97 | 0.0391 | 0.97 | 0.0024 |
Basic CNNs | 0.98 | 0.0237 | 0.97 | 0.0578 | 0.98 | 0.0901 |
Proposed | 0.99 | 0.0260 | 0.99 | 0.0174 | 0.99 | 0.0136 |
This study elaborates the details of developing a light-weight DNN for handwriting digital recognition/classification using the MNIST dataset. The CNN-based backbone architecture and the methodology of numerous hyperparameter optimizations, including batch size, learning rate, etc., are explored in detail to enhance the training and testing outcomes. The developed neural network model is evaluated upon Keras/Tensorflow framework, and the overall accuracy of the developed model can reach 99.4% on the MNIST dataset, compatible with the results from other
We would like to thank the editors and anonymous reviewers for their thoughtful comments and efforts towards improving our manuscript.
The authors received no specific funding for this study.
The authors declare that they have no conflicts of interest to report regarding the present study.