|Computer Modeling in Engineering & Sciences|
N-SVRG: Stochastic Variance Reduction Gradient with Noise Reduction Ability for Small Batch Samples
School of Information Science and Engineering, Fudan University, Shanghai, 200433, China
*Corresponding Author: Lirong Zheng. Email: email@example.com
Received: 01 September 2021; Accepted: 11 October 2021
Abstract: The machine learning model converges slowly and has unstable training since large variance by random using a sample estimate gradient in SGD. To this end, we propose a noise reduction method for Stochastic Variance Reduction gradient (SVRG), called N-SVRG, which uses small batches samples instead of all samples for the average gradient calculation, while performing an incremental update of the average gradient. In each round of iteration, a small batch of samples is randomly selected for the average gradient calculation, while the average gradient is updated by rounding of the past model gradients during internal iterations. By suitably reducing the batch size B, the memory storage as well as the number of iterations can be reduced. The experiments are compared with the state-of-the-art Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG, and it is demonstrated that N-SVRG outperforms SVRG and SASG, and is on par with SCSG. Finally, by exploring the relationship between the small values of different parameters n. B and k and the effectiveness of the algorithm, we prove that our N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of small batch size. The advantages and disadvantages of various methods are experimentally compared, and the stability of N-SVRG is explored by parameter settings.
Keywords: Machine learning; SGD; SVRG; memory storage
The variance problem introduced by the stochastic nature of the SGD algorithm becomes the main problem of optimization algorithms nowadays. The introduction of variance makes SGD reach only sublinear convergence speed with a fixed step size , while the stochastic algorithm accuracy is positively related to the sampling variance, and when the variance tends to 0, the deviation of the algorithm will also be 0. In this case, the SGD can still be fast even with a large step size convergence. Therefore, how to reduce the variance of the stochastic gradient in the stochastic algorithm has become an important issue studied by scholars .
For SGD variance problem, there are three mainstream methods to reduce the variance of sampling at present that include importance sampling, hierarchical sampling method and control variable method. The objective function in machine learning is usually solved using the Batch Gradient Descent (BGD) or SGD . BGD algorithm computes the gradients of all samples for each iteration to perform the weight update, and the latter randomly selects one training sample at a time to update the parameters by computing the sample gradients. Then came the improved Mini-Batch SGD (MBGD) algorithm, where MBGD computes the gradient and performs weight update by randomly selecting m data samples in the original data for each iteration. SGD has the advantage that each step relies only on a simple random sample gradient, so the computational consumption is only a fraction of that of the standard GD . However, it has the disadvantage that a constant step size leads to slow convergence in the case of variance introduced by randomness.
In recent years, many scholars have carried out research based on SGD algorithm, for example, momentum algorithm , which is based on the idea of gradient momentum accumulation and uses exponential weighted average to reduce the swing amplitude of the gradient. Two concepts of velocity and friction are introduced into momentum algorithm. The function of the gradient is to change velocity, and friction is to gradually reduce the velocity. In the whole training process, the increase and attenuation of momentum are simulated to achieve the purpose of convergence. AdaGrad (Adaptive Gradient) is proposed to learn by gradually decreasing the step size . By accumulating gradients, the learning rate of each weight is related to the value of their previous gradient. However, the consequence of accumulating gradients is that as the training increases, the step size decreases and eventually the training stalls. To address the training stagnation problem that occurs with AdaGrad, Hinton, G. proposed RMSProp (Root Mean Square Prop) to calculate the cumulative gradient using a moving average method that only accumulates the gradient of one window, making the change in step size adapt to the current gradient thus achieving better optimization [7,8]. And for SGD stochasticity introduces the variance problem, where SVRG is used to correct the gradient used for each model update using the global average gradient information . Theoretical analysis and experiments demonstrated that SVRG produced linear convergence with reducing variance. Subsequently, Zhao et al.  proposed the SASG (Stochastic Average Gradient Average) algorithm, which has the same thematic idea as SVRG, with the difference that a piece of memory is used to store the original gradients of all samples, and the global average gradient is updated by constantly updating the original gradients during training, which requires a large amount of memory consumption throughout the training . The SCSG (Stochastically Controlled Sto-chastic Gradient) algorithm was proposed, SCSG is to calculate the average gradient by randomly selecting a part of the sample gradient as the global gradient, but when performing the weight update, randomly selecting the number of updates will make the calculation more variable and tedious, and the computation is large . Subsequently, a series of algorithms  such as the novel Mini-Batch SCSG [14,15], b-NICE, SAGA [16,17] were generated based on the idea of variance reduction.
However, there is another structural risk minimization problem in machine learning, which is composed of “loss function + regularization term”, and different forms of regularization terms lead to different complex problems, such as Overlapping group lasso, Graph-guided fused lasso etc. , which are very complex for SGD-based theoretical approaches, while the ADMM algorithm is applied to a wider range of models and its excellent performance proves itself to be an effective optimization tool. Several variance reduction algorithms have been proposed in combination with ADMM, including SAG-ADMM , SDCA-ADMM , and SVRG-ADMM . All three algorithms are improved algorithms generated based on the update strategy of ADMM. Then  proposed the APA-SVRG, which is trained on the basis of SVRG using proximity averaging, and the experimental results show that the APA-SVRG algorithm is comparable to SVRG-ADMM. The neighborhood stochastic L-BFGS  methods, on the other hand, an improved algorithm based on Newton’s method, which is trained mainly on loss functions that are smooth functions. Meanwhile, the specular stochastic sub-gradient descent  method is used to train on the loss function that is a non-smooth function. However, these schemes compute the gradient unbiased estimation using the average gradient of a small batch of samples, which cannot reduce the total complexity linearly, and the computational cost and memory consumption increase, and the computation and update of the gradient of a small batch of samples increase the computational consumption of the whole algorithm .
In order to address the above challenges, we propose a noise reduction method of stochastic gradient method, and then use the idea of small sample average gradient instead of global average gradient to design the algorithm N-SVRG that selects small samples for training while updating the average gradient to achieve variance reduction, and introduce the algorithm flow and convergence analysis of N-SVRG algorithm in detail, and compare it with the mainstream Mini-Batch SGD The N-SVRG algorithm is compared with the mainstream Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG algorithms, and it is proved that the N-SVRG algorithm outperforms SVRG, SASG and other algorithms, and is equal to SCSG. Finally, by exploring the relationship between the small values of different parameters n. B and k and the effectiveness of the algorithm, we prove that the N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of low back size. Experimentally comparing the advantages and disadvantages of various methods and exploring the stability of the N-SVRG algorithm through parameter settings.
The contributions of this paper are as follows:
* We propose N-SVRG that uses small batches samples instead of all samples for the average gradient calculation, while performing an incremental update of the average gradient. By suitably reducing the batch size B, the memory storage as well as the number of iterations can be reduced.
* The convergence analysis of the proposed algorithm shows that the algorithm can stably converge to a certain lower limit, and find the saddle point with smooth gradient descent in the whole training process of neural network. It also enables the algorithm in this paper to reduce the computational cost and memory burden.
* The experiments are compared with the state-of-the-art demonstrate that N-SVRG outperforms baseline. Exploring the relationship between the small values of different parameters n. B and k and the effectiveness of the algorithm, we prove that our N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of low batch size.
For the supervised learning problem in machine learning [2,26]: assume that there is a functional model l in the space L for each model input x, i.e., there is a prediction l(x). That is, given n training data where denotes the input sample data and denotes the corresponding training labels. Our goal is to find the prediction function that minimizes the loss due to inaccurate predictions. However, since the predictions l(x) and will differ each time the model predicts, one uses the loss function to assess the difference between the two. Since each sample corresponds to a loss function, the empirical risk is the average of these n sample loss functions.
where w represents the parameters in the function model, and represents the loss function corresponding to sample . Formulas (1) and (2) clearly show how the expected risk and empirical risk depend on the error function, sample space, sample set, etc.
Empirical risk minimization is solving for w such that the empirical risk is minimized:
However, during the training process, the model may have a strong performance capability because it has a large number of parameters itself, but the data provided for training are insufficient. Insufficient data for training, then the model may overfit during a large number of repeated training sessions, i.e., “the learned model is so well suited to a specific set of data that it does not reliably fit other data or future observations”. In order to reduce the impact of overfitting [4,27] to some extent, a regularization term is added to the empirical risk to limit the complexity of the model, which constitutes a structural risk minimization  in the form of “loss function + regularization term” problem:
where λ > 0 is a hyperparameter that balances the weight of the regularization term by its own numerical size, and the larger λ is set, the heavier the penalty on the weight. r(w) picks different forms depending on the effect, which include L1 parametrization, L2 parametrization , and L~∞ parametrization. The most common form is the L2 parametrization, i.e., r(w) = ‖ w ‖ 2, which can be calculated using .
The loss function in machine learning is a non-negative real function , where the common loss functions are log Logi Loss, Squared Loss, 0-1 Loss, and Loss and Cross Entropy Error Loss (CEL) . As shown below:
Logarithmic loss function:
Mean square error loss function:
0-1 Loss function:
Cross-entropy error loss function:
The use of these loss functions should be chosen according to the specific problem, for example, the cross-entropy error loss function can be used with the Softmax function to form a Softmax-with-Loss layer for training and learning the classification problem .
3 Related Concepts
To facilitate the subsequent analysis and study, we give the relevant basic concepts that will be covered in the paper as well as the definitions. There is an iterative sequence , with two adjacent iteration points , , satisfying the following conditions:
1) If μ = 1, the algorithm achieves sublinear convergence;
2) If 0 < μ < 1, the algorithm achieves linear convergence ;
3) If μ = 0, the algorithm achieves superlinear convergence.
Let set , any and θ ∈ [0,1] has , the set C is called a convex set.
Let set be a convex set and f be a function defined on set C. For any set, exists:
f (x) is a convex function defined on a convex set C.
For a function f(y) with step η > 0, its proximal operator at the point x  (Proximal Operator) is defined as:
If is a schematic function on the convex set C
That is, the proximity operator of the function f(y) at x can be viewed as the Euclidean projection of x on the set C.
Assume (Lipschitz continuous target gradient) that the objective function F: is continuously differentiable, the gradient function of F is Lipschitz continuous and the Lipschitz constant L > 0, i.e.,
Ensures that the gradient of F does not change arbitrarily fast with respect to the parameter vector. This assumption is essential for the convergence analysis of most gradient-based methods; without it, the gradient will not be a good indication of the distance to reduce F.
Assumption 1: If the function satisfies L smooth, then for any , and with gradient at point x, there exists L > 0 such that
The constant L is called the Lipschitz constant and is generally used to measure the smoothness of a function. The function satisfies L smoothness in case it itself satisfies the Lipschitz continuity condition. For a fixed , L is a fixed value. This condition places a restriction on the variation of the value of the function.
If the function is strongly convex, then for any , and with gradient at point x, there exists γ > 0 such that
4 The Proposed Algorithm
4.1 Controlled Variable Method
This section describes the specific implications of the control variables approach . It is assumed that we need to use Monte Carlo  sampling method to estimate the expectation μ of the random variable X, i.e., E[X] = μ. Also assume that we have been able to be able to estimate relatively easily the expectation τ of another random variable Y, i.e., E[Y] = τ. We construct a new random variable:
where is the corresponding coefficient, it can be seen from Eq. (17) that Z remains an unbiased estimator of μ  and that the variance of Z:
It can be easily obtained by calculation that Var(Z) is minimum at , when
where . It can be seen from (18) and (19). The new random variables Z and X are unbiased estimates of μ as well, but the variance of the random variable Z is smaller than that of X. Therefore, it is better to use the random variable Z as an unbiased estimate of μ than X. From the formula, it can be seen that the variance of Z is sufficiently small as long as the random variable X is guaranteed to show a certain correlation with Y , so Y is also called the control variable of X. This is the control variable method.
SVRG is a new variance reduction method formed based on the control variable method, which itself does not construct control variables from similar samples, but from the perspective of optimizing variables. Its iterative approach is:
where denotes the model parameters, η denotes the update step, is the global average gradient which is calculated using the model parameters obtained at the end of the previous cycle, denotes the gradient of the function f over the sample with respect to parameter , is the deviation of the control variables from the unbiased estimate, and is the corrected unbiased estimate, using to update .
SVRG itself is a periodic algorithm, with the increase of iterations, and gradually converge to , that is: , , when , so and are getting closer and closer, and because is the control variable of , according to the theory, when and the closer, the higher the correlation between the two, then the variance will converge to 0, so that the variance can be reduced to 0 by the control variable method, so the SVRG algorithm achieves linear convergence.
The SVRG algorithm steps are as follows:
Step 1: Given the initialization weight , the number of SGD steps m, the number of outer loop iterations T and the learning step η, initialize t = 0;
Step 2: Calculate the global average gradient and initialize ;
Step 3: j = 0, select a random sample , calculate , cycle Step 3 until , , go to the next step;
Step 4: t = t + 1, go to Step 2 until t = T, output: Choice 1: ; Choice 2: . From the pseudo-code, we can see that the main calculations of the algorithm are in Steps 2 and 3, where Step 2 is to calculate the average gradient and Step 3 is used to calculate the corrected unbiased estimate of the gradient from to rounds iteratively at a computational cost of n + 2m.
Assuming that all are convex functions and satisfy Eqs. (19) and (20) and γ > 0, and assuming that m is large enough, then
Expectation of geometric convergence  for SVRG:
Proof: For any i, consider that there is
science , so ,
Because , therefore , therefore
When i = 1,2, …, n, the cumulative inequality (21) is added. And using , we obtain
Give way available
Make available to
The first inequality uses , the second inequality uses
, the third inequality uses (27). Existing
The first inequality uses the previously obtained inequality , and the second inequality uses the P(w)-convex property. By accumulating the j = 1, … m, stages we obtain
The second inequality uses the strong convexity property, thus obtaining
So get , cite as evidence.
5 N-SVRG Algorithm
The SASG algorithm is a memory-consuming algorithm in the SVRG family of algorithms. Although the computational cost of the SVRG algorithm is huge when computing the average of all sample gradients, it can be computed in parallel during the computation because it can use matrix operations itself. the difference between the SASG and SVRG algorithms is that the SASG algorithm requires one memory block to store all sample gradients, which not only increases the computational cost (because matrix parallelism cannot be used), but also increases the memory burden of the computer. In order to reduce the computational cost and memory burden, we designed the back decimation update gradient [2,33].
5.1 Algorithm Introduction
In the SASG algorithm, a piece of memory is used to store the original gradients of all samples, while a new gradient modeled with the current new weights is computed after passing thereby achieving the purpose of updating . However, its memory consumption is too large, and the computational cost increases, which makes it impractical to use the SASG algorithm in the face of the complexity of the model due to the large size of the data and the huge parameters, etc. The SCSG algorithm is mainly designed to reduce the computational cost. In the SCSG algorithm, it does not need to calculate the gradient of all samples, but only needs to consistently sample a small batch of samples and calculate the average gradient instead of the global sample gradient, so that it can obtain the same training accuracy and convergence speed as SVRG [15,34].
The small batch average gradient in SCSG and the SGD update approach in the SASG algorithm inspired our algorithm. n-SVRG algorithm discards the calculation of the global average gradient for all samples and consistently samples a small batch of samples from the sample, and replaces the global average sample gradient for all by calculating the batch average gradient, i.e., , so that we can reduce the computational cost. At the same time, we store the gradients of these samples, as in the SASG algorithm, the batch sample gradients are modified after each update of the weight, and then the average gradient is recalculated. But we change the update form of SASG average gradient from to , that is, the sample gradient calculated by the past model is directly rounded off, and the gradient variance is reduced by directly rounding of the sample gradient. From the perspective of the control variables method, it can be interpreted as follows: the mean of the original weight gradient is used as the estimator, and the variance is reduced by gradually decreasing the number of samples. So inevitably, the number of iterations to compute the unbiased estimate of the gradient is correspondingly reduced to less than B, which indirectly reduces the computational cost.
The N-SVRG algorithm steps are as follows:
Step 1: Given the initialization model parameters , the number of outer cycles T, the step size η, the number of SGD steps k and the size of each batch B, initialize t = 0;
Step 2: Randomly and consistently sample a batch of , where size ;
Step 3: Calculate the gradient for all samples in one by one and deposit it in , initializing , ;
Step 4: Calculate and select a random sample ;
Step 5: Compute and remove the gradient of sample ij by means of , j = j + 1, looping Steps 4 and 5 until j = k, , going to the next step;
Step 6: t = t + 1, go to Step 2 until t = T, output: .
As can be seen from Step 5 to , is finite and no longer embraced after deletion, to avoid repeating deletions by mistake, the random sorting of the samples in B will be used in the program programming to draw them sequentially. The algorithm steps show that the main computational cost of the N-SVRG algorithm is in Step 3 and Step 5, where the number of computations in Step 3 is B and Step 5 contains an inner loop with k, because k < B, so the maximum computational cost of the algorithm is 2 B.
5.2 Algorithm Convergence Analysis
For simplicity, we only consider the case where each is convex and smooth and P(w) is a strongly convex problem. There are the following assumptions:
is a convex function with L-Lipschitz gradient 
Assume that P(w) is a strongly convex function
The expectation in Step 3 of the SVRG algorithm is , but the variance problem makes non-zero. The variance of is reduced by using in the update rule of the algorithm.
In the algorithm of this paper, the same SVRG is wanted to be achieved by updating . Consider the jth loop in the jth.
From the literature it is clear that
We can take the expectation for and get
Since is used in the algorithm instead of and , so
The first three inequalities are used and the fourth inequality is used .
Set also ; This has
The first inequality uses and the second inequality uses the P(w)-convex property. By accumulating the j = 1, …, k stages we get
The second inequality uses the strongly convex property (38), and we thus obtain
We have the convergence of N-SVRG
The convergence of the N-SVRG algorithm shows that the convergence rate of the algorithm is related to the selection of parameters B and k.
Regarding the selection of parameters B and k on the stability of the algorithm and the convergence speed exploration we will explain in detail in the experimental section.
6 Experimental Design and Experimental Results
6.1 Experimental Data
MNIST  is a handwritten digit image dataset that was collected with the aim of logarithmically enabling the recognition of handwritten digits, as shown in Fig. 1. This dataset has been widely used in machine learning and deep learning since 1998 by Yan LeCun in the LeNet-5 network to test the effectiveness of algorithms such as SVM, Linear Classifiers, Neural Nets, KNN [31,35], etc.
The dataset consists of a training set and a test set, where the training set contains 60,000 handwritten digital images and the digital labels corresponding to the digital images. Similarly, the test set contains a total of 10,000 handwritten digital images and digital labels corresponding to the digital images. The data set is composed of digital images from 0 to 9, each of which is a single-channel gray scale image of 28 pixels by 28 pixels, and each pixel is a uint 8 data type, i.e., an 8-bit unsigned integer with a value between 0 and 255. Each image is labeled with “1”, “3”, “5”, etc.
6.2 Experimental Setup
6.2.1 Data Pre-Processing
The experimental dataset is a MNIST dataset. In the experimental preprocessing stage, we need to regularize and expand the images in one dimension so that when we use batch processing, we can transform a batch of digital image samples into a two-dimensional array for computation. At the same time, we encode the data labels as one-hot. One-hot encoding is an array where the correct solution label is 1 and all others are 0. For example, like label ‘7’ in one-hot encoding array is 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0.
6.2.2 Network Structure
All program networks are structured as 728 × 40 × 10, the input layer is 728 neurons with input function only, the implicit layer is 40 neurons containing the Relu function, and the output layer is a Softmax-with-Loss layer with 10 neurons. In the output layer we use a Softmax-with-Loss layer for learning.
6.2.3 Experimental Basis
The algorithm is done by Python3.6, for the implementation of the algorithm for the structure in [15,17] is consulted. The device used in this paper is CPU: Inter(R) Core(TM) i5-7500 CPU@3.40 GHz with 16.00 GB of RAM.
The experiment consists of two parts in total. The first part is mainly the evaluation comparison between the algorithm of this paper and the current mainstream algorithm; the second part is mainly the exploration of the relationship between the parameters of the algorithm of this paper.
6.3 Algorithm Comparison
The N-SVRG algorithm is programmed in PyCharm using Python3.6, and is compared with the mainstream Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG algorithms.
Parameter setting of each algorithm: Mini-Batch SGD:
Maximum number of iterations 4800, per batch size 250;
AdaGrad: 4800 iterations maximum, 250 per batch size;
RMSProp: 4800 iterations maximum, 250 per batch size;
SVRG: Maximum epoch count 20 times, SGD steps 5000;
SCSG: Maximum number of iterations 4800, batch size 250, number of SGD steps 250;
N-SVRG: Maximum number of iterations 4800, batch size 250, number of SGD steps 249.
The following points can be seen in Fig. 2 and Table 1:
(1) N-SVRG, RMSProp and SCSG all reach around 65% accuracy by the first epoch of iteration, while SVRG, Mini-Batch SGD and AdaGrad only reach below 40%. As the iteration proceeds, N-SVRG, RMSProp and SCSG reach 80% or more accuracy after only 7 epochs, SVRG reaches 65%, and Mini-Batch SGD and AdaGrad reach only 50%. Batch SGD and AdaGrad.
(2) N-SVRG, RMSProp and SCSG all achieve high accuracy smoothly and quickly. In terms of accuracy, the difference between N-SVRG and SCSG is 0.26%, and the difference between RMSProp and SCSG is 1.13%. The detection accuracy of SVRG is slightly lower than the previous three algorithms, but it is still better than Mini-Batch SGD and AdaGrad.
6.4 Stability Exploration
The stability of the N-SVRG algorithm is mainly affected by two factors: the size of each batch B and the number of SGD steps k. The size of B directly affects the computational cost and storage cost of the algorithm, while an appropriate B can enable the algorithm to achieve better accuracy while converging quickly, and the number of SGD steps k has a similar effect with B. Also, the size relationship between k and B is what we need to discuss. Inspired by , this paper focuses on two aspects of the experimental investigation, namely, the relationship between the number of training samples B and n ratio and the relationship between B and k the ratio.
B and n ratio exploration:
In the experiment to explore the effect of B and n scaling relationship on the algorithm, in order to make the experimental sample size diversity, we need to take 10,000, 8,000, 4,000 samples from MNIST data one by one for training and B ∈ n/2, n/4, n/8, n/16, n/32, set k = B − 1 for the experiment.
The following can be seen from Fig. 3 and Table 2:
(1) The proportional relationship between B and n has a certain relationship with the effect of the algorithm, the larger B is the smaller the training accuracy, and the longer the training time will be because of the number of substitutions due to the setting of k;
(2) Relatively best accuracy at and ;
(3) Although it can be seen from the detection accuracy that the size of B has a relationship with the convergence speed and the accuracy of the algorithm, especially and , it has less impact on the stability of the algorithm and still obtains relatively good results with a simple network structure and strong robustness.
In the experiments exploring the effect of the relationship between B and k scaling on the algorithm, we directly use the complete MNIST dataset and consistently sample small batches of samples of sizes 250, 500, and 1000, respectively. while we set the number of SGD steps k ∈ B-1, 0.8 × B 0.6 × B 0.4 × B 0.2 × B for the experiments.
The following can be seen in Fig. 4 and Table 3:
(1) The relationship between the size of k and B is positively related to the accuracy of the algorithm, the closer k is to B, the higher the testing accuracy of the algorithm, and vice versa, the accuracy decreases, and this phenomenon does not change with the size of B.
(2) From the cross-observation of the data, it is clear that when k is a constant (k = 200), the accuracy decreases with the increase of B, which confirms the above point.
(3) Comparing the results of two experiments, Experiment B with n-proportional probe and Experiment B with k-proportional probe, we can see that the N-SVRG algorithm is robust and maintains high experimental results for general optimization problems when the batch size is too small.
Through the experiments we can find that the N-SVRG algorithm is outstanding in convergence speed and accuracy compared with Mini-Batch SGD, AdaGrad and SVRG, and is comparable to RMSProp and SCSG. We can know that N-SVRG algorithm is a relatively stable algorithm, although there are two parameters B and k affect the accuracy of the algorithm, from the experiment we can see that the change of B shows a trend of the smaller the value, the higher the accuracy of the algorithm, this nature is the algorithm is suitable for general optimization problems.
In this paper, we propose a noise reduction method for SVRG, which uses a small batch of samples instead of all samples for average gradient computation and incremental update of the average gradient. The experiments are compared with the mainstream Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG algorithms, and it is proved that the N-SVRG algorithm outperforms SVRG and SASG, and is equal to SCSG. Finally, by exploring the relationship between the small values of different parameters n, B and k and the algorithm effect, we prove that the N-SVRG algorithm has some stability.
Funding Statement: This work was supported by the National Natural Science Foundation of China under Grant 62076066.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|