Open Access
ARTICLE
Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks
1Departamento de Ingeniería en Control y Automatización, Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Zacatenco, Instituto Politécnico Nacional, Unidad Profesional Adolfo López Mateos. Av. Luis Enrique Erro S/N, Gustavo A. Madero, Zacatenco, Ciudad de México, México
2 Departamento de Control Automático, Centro de Investigación y de Estudios Avanzados (CINVESTAV) del Instituto Politécnico Nacional, Unidad Zacatenco, Av. Instituto Politécnico Nacional No. 2508, Col. San Pedro Zacatenco, Ciudad de México, México
3 Facultad de Ingeniería, Universidad Anáhuac México, Campus Norte, Huixquilucan, Estado de México, México
4 Sección de Estudios de Posgrado e Investigación, Unidad Profesional Interdisciplinaria en Ingeniería y Tecnologías Avanzadas (UPIITA), Instituto Politécnico Nacional, Av IPN 2580, La Laguna Ticoman, G. A. M., Ciudad de México, México
* Corresponding Author: César Ulises Solís-Cervantes. Email:
(This article belongs to the Special Issue: Computational Modeling, Simulation, and Algorithmic Methods for Dynamical Systems)
Computer Modeling in Engineering & Sciences 2026, 147(1), 26 https://doi.org/10.32604/cmes.2026.077726
Received 16 December 2025; Accepted 25 February 2026; Issue published 27 April 2026
Abstract
Many linear-in-parameters models arising in identification and control can be expressed as single-layer artificial neural networks (ANNs) with linear activation, enabling online learning via first-order optimization. In practice, however, standard gradient descent often exhibits slow convergence, large intermediate weights, and stagnation when the regressor data are ill-conditioned or computations are performed under finite precision. This paper proposes Gradient Descent with Time-Decaying Regularization (GD-TDR), a training algorithm that augments the quadratic loss with a regularization term whose weight decays exponentially in time. The proposed schedule enforces uniform strong convexity during early iterations, effectively mitigating neural-paralysis-like behavior associated with flat directions, while asymptotically vanishing so that the unregularized least-squares solution is recovered. A convergence theorem for GD-TDR is established and a concise pseudocode implementation is provided. Numerical and embedded experiments on an online identification problem of a Chua-type chaotic oscillator demonstrate that GD-TDR converges faster and avoids stagnation compared to standard gradient descent, without introducing the steady-state bias characteristic of fixed quadratic regularization.Keywords
Artificial neural networks (ANNs) have been extensively studied as flexible function approximators and parametric models across a wide range of scientific and engineering tasks. General surveys and application-driven reviews document the breadth of ANN deployments and motivate their use in identification and control, particularly when explicit first-principles modeling is difficult or when data-driven adaptation is required [1–3]. Foundational treatments established the core modeling paradigms and training principles, including linear and nonlinear network structures and their algorithmic implementations [4–6]. In this classical view, many practical learning rules can be interpreted as iterative optimization procedures acting on a quadratic or near-quadratic objective, a perspective that remains central to modern online learning formulations.
A large portion of the ANN training literature is rooted in incremental (first-order) updates. Recent theoretical analyses of learning dynamics in deep linear networks provide explicit characterizations of transient behavior and the interaction between initialization, regularization, and optimization geometry under gradient-based training [7,8]. Within the quadratic-loss setting, basic gradient-descent rules and their stochastic variants remain canonical examples of first-order adaptation mechanisms and highlight how data statistics, conditioning, and step-size choices shape stability and speed of learning [9,10]. In parallel, modern energy-based learning continues to emphasize the role of objective shaping and conditioning in learnability [11]. These works collectively support the view that optimization geometry—not only model expressiveness—plays a decisive role in whether training proceeds smoothly or becomes trapped in slow transient regimes.
Beyond standard multilayer architectures, several specialized ANN families have been developed for robustness, interpretability, and control-oriented deployment. Radial basis function networks and their robust variants provide a well-established pathway to stable approximation under uncertainty and noise [12,13]. In adaptive and self-learning control, ANN-based schemes have been reported for real-time compensation and online tuning, where training must remain stable under streaming data and limited numerical precision [14]. Related approaches in intelligent control also include fuzzy and cerebellar-model architectures that stress adaptation under nonlinearities and disturbances [15,16]. Complementary lines of research continue to refine computationally efficient first-order training, including recent advances in stochastic gradient descent variants [17].
A particularly demanding class of identification problems arises in nonlinear and chaotic dynamics, where sensitivity to initial conditions and measurement noise can degrade learning reliability. Recent studies illustrate both the feasibility of parameter identification in chaotic systems and the numerical challenges of learning from chaotic trajectories [18–20]. These challenges are amplified in embedded or resource-constrained implementations, where finite-precision arithmetic and strict real-time requirements can exacerbate ill-conditioning and lead to slow or stagnant learning. Recent reviews on hardware realizations of neural methods, including FPGA-oriented implementations and embedded control applications, highlight the practical importance of training rules that remain stable and well-conditioned under limited precision. Consistent with these trends, widely used embedded platforms and rapid-prototyping toolchains have enabled end-to-end experimental validation of online learning strategies on microcontrollers [21,22].
Despite the breadth of architectures and applications, an enduring challenge in first-order online training is the susceptibility to slow plateaus and weight growth when the regressor is ill-conditioned or when optimization directions become nearly flat. Classical quadratic (L2/Tikhonov) regularization is a standard remedy to improve conditioning, yet fixed regularization may introduce steady-state bias when the target objective is the unregularized least-squares criterion. This motivates strategies that improve early-stage conditioning while preserving asymptotic fidelity to the original objective, which is the central perspective adopted in this work.
1.2 Description and Main Contributions
Single-layer ANNs with a linear activation function are equivalent to linear regression models and are widely used to represent linear-in-parameters structures in system identification, adaptive filtering, and control. In these applications, the model output can be written as a linear combination of known regressors and unknown parameters, so that training reduces to minimizing a least-squares functional. Although a closed-form solution exists for batch least squares, embedded and real-time settings often require iterative, lightweight, and online algorithms. For this reason, first-order methods based on gradient descent remain attractive due to their low computational complexity and ease of implementation [2,4,17].
In practice, however, standard gradient descent may perform poorly when the regressor data are ill-conditioned or nearly rank-deficient. These situations are frequent in online identification problems with delayed signals, correlated regressors, or limited excitation. The resulting cost surface can contain nearly flat directions, which leads to slow progress, long plateaus, and very large intermediate weights. On finite-precision hardware, such dynamics can manifest as training stagnation and numerical instabilities that are commonly described as neural-paralysis-like plateau behavior in the neural-network literature [23]. In the linear setting considered here, the effect is not caused by saturation of nonlinear activation functions, but rather by loss of curvature and poor conditioning of the quadratic objective.
A standard remedy is to augment the least-squares loss with a quadratic (Tikhonov) regularizer, which enforces strong convexity and penalizes large weights. Fixed regularization, however, introduces a bias: the minimizer of the regularized problem does not generally coincide with the minimizer of the original least-squares cost. This trade-off is particularly undesirable in identification tasks, where asymptotic accuracy is essential.
This work proposes GD-TDR (Gradient Descent Algorithm with Regularizer—Time Decay), a first-order scheme that interpolates between these two extremes. The algorithm employs a quadratic regularizer whose coefficient decays exponentially over time. As a result, the early iterations benefit from improved curvature and bounded weights, while the regularization vanishes asymptotically and the algorithm recovers the minimizer of the unregularized least-squares functional.
The main contributions of this work are summarized as follows:
• a unified analytical framework that explicitly connects classical gradient descent, gradient descent with fixed quadratic (L2) regularization, and the proposed Gradient Descent with Time-Decaying Regularization (GD-TDR) through a single decay parameter, thereby clarifying their structural similarities and fundamental differences;
• a rigorous convergence theorem establishing that the time-decaying regularization enforces uniform strong convexity during the transient phase while asymptotically recovering the minimizer of the unregularized least-squares problem;
• a concise and self-contained pseudocode implementation of GD-TDR that directly reflects the theoretical development and facilitates reproducible implementation;
• a comprehensive numerical and embedded validation, including online parameter identification of a Chua-type chaotic oscillator and a real-time implementation on an ® STM32F4-Nucleo microcontroller, demonstrating accelerated convergence and mitigation of stagnation without steady-state bias.
The paper is organized as follows. Section 2 introduces the linear ANN model and the least-squares training objective. Section 3 compares gradient-descent training schemes and presents GD-TDR. Section 4 states and proves the convergence theorem and provides the GD-TDR pseudocode. Section 5 shows how common identification models can be written in linear ANN form. Sections 6 and 7 report the numerical and embedded validation, respectively. Section 8 concludes the paper and outlines future work.
2 Single-Layer Linear ANN and Least-Squares Training
Consider a single-layer ANN with linear activation (purelin) and no bias. Its output is
where the input vector is

Figure 1: Single-layer ANN with linear activation.
2.2 Batch Least-Squares Objective
Given a finite data set
Let
Then
The gradient and Hessian of J with respect to
The Hessian is positive semidefinite. It is positive definite (and hence J is strongly convex) if and only if
3 Training Schemes and Algorithmic Comparisons
This section summarizes three closely related first-order schemes: classical gradient descent (GD), gradient descent with a fixed quadratic regularizer (GD-QR), and the proposed time-decaying regularized scheme (GD-TDR). Only GD-TDR is presented in pseudocode form (Section 4).
3.1 Classical Gradient Descent (GD)
Using (2), the GD update with step size
where
3.2 Gradient Descent with Fixed Quadratic Regularization (GD-QR)
A standard approach to improve conditioning is to add a quadratic regularizer
where
so
Fixed regularization controls weight growth and improves curvature, but the minimizer of
3.3 Proposed GD-TDR (Time-Decaying Quadratic Regularization)
GD-TDR replaces the constant regularization weight
The weight update becomes
Two limiting cases highlight the relationship among the three schemes:
• Classical GD: setting
• Fixed regularization: setting
Therefore, GD-TDR provides a continuous mechanism to improve conditioning early in training while asymptotically removing the regularization bias. The theoretical properties of this scheme are stated in Theorem 1.
4 Theoretical Properties and GD-TDR Pseudocode
We restate the objective in compact form. Define
and let
with
Theorem 1 (Online GD-TDR): Let
For
and
(a) For every
satisfies
for all
(b) Suppose that
Consider the GD-TDR iteration
Then
(c) For each
In particular, when
Proof: (a) Since
Hence
(b) When
Thus
Gradient descent on a strongly convex, smooth function with a step size in
(c) By strong convexity of
Remark 1: The decay rule

To make the above easier to visualize, Table 1 presents a comparison of key aspects.

5 Application to Online Parameter Identification
This section shows how common linear identification models can be written as single-layer linear ANNs, which allows applying GD-TDR directly.
5.1 Discrete Transfer Functions
Consider a discrete transfer function
with
Defining
The model becomes

Figure 2: Single-layer ANN representation for parameter identification.
5.2 Discrete State-Space Models
For a discrete state-space (DSS) system
with appropriate dimensions, the right-hand side is linear in the unknown entries of
A practical issue is causality: to update parameters at time

Figure 3: DSS model viewed as a linear ANN mapping.

Figure 4: Delay-based scheme for DSS parameter identification.
Remark 2: Although the focus is on linear-in-parameters models, the same idea can be used to identify local linearizations of nonlinear systems around operating points, provided the regressors are constructed accordingly.
The numerical validation considers online parameter identification of a chaotic system whose dynamics are equivalent to those of a Chua-type oscillator. Chaotic trajectories provide a demanding excitation pattern for adaptive algorithms and are known to expose slow transients and stagnation effects in gradient-based learning [19,20].
The Chua oscillator is described by
where
For the parameter values

Figure 5: Trajectory of Chua dynamics (chaotic attractor).
To pose an identification problem that is linear in the unknown parameters, the dynamics are rewritten as
In this form,
Fig. 6 shows the main program used in the numerical simulations.

Figure 6: Main program in ® Matlab-Simulink environment.
In the reported tests, the learning factor was set to
Fig. 7 shows the convergence of the estimated parameters towards the target values, while Fig. 8 reports the norm of the estimation error. The final identified parameters for GD-TDR, GD and GD-QR are listed in (34) to (36), respectively.

Figure 7: Convergence of GD, GD-QR and GD-TDR parameters.

Figure 8: Norm of the parameter-estimation error.
GD and GD-TDR algorithms converge to high-accuracy estimates; however, GD-TDR exhibits markedly faster convergence and substantially shorter stagnation transients. In the considered experiment, GD requires approximately
GD-QR mitigates oscillations during the convergence process but yields the poorest parameter convergence (
From an algorithmic viewpoint, these improvements can be interpreted through Theorem 1: the decaying regularization increases the curvature of
7 Embedded Implementation on a Microcontroller
To evaluate suitability for low-processing-capacity platforms, GD-TDR was implemented on an ® STM32F4-Nucleo microcontroller. The implementation was developed in the ® Matlab-Simulink environment and deployed using the ® Waijung toolkit. Figs. 9 and 10 show the board-level program configuration, which follows the same signal flow as the numerical setup and adds serial communication blocks for monitoring. In particular, in Fig. 9, the content of the block called chua can be located in Fig. A1 as well as its programming, which are contained in the Appendix A.

Figure 9: Program configuration for the ® STM32F4-Nucleo board.

Figure 10: CPU monitoring program.
Experimental Results
In this work, neuron-paralysis-like behavior is quantitatively assessed through a combination of indicators, including prolonged plateaus in the loss evolution, persistent attenuation of the effective gradient norm, and excessive transient growth of the parameter vector prior to convergence.
Fig. 11 reports the convergence behavior observed on the microcontroller. The results are qualitatively consistent with the numerical simulations, indicating that the proposed method preserves its robustness against stagnation even under limited precision and memory.

Figure 11: Dynamics of convergence of GD-TDR weights on the ® STM32F4-Nucleo board.
This paper introduced GD-TDR, a time-decaying quadratically regularized gradient-descent algorithm for training single-layer linear ANNs. The method is motivated by online identification problems in which the regressor data can be ill-conditioned and standard gradient descent may suffer from long plateaus, large intermediate weights, and neural-paralysis-like stagnation on finite-precision hardware. GD-TDR addresses this issue by enforcing strong convexity early in training through a quadratic penalty and then removing the penalty asymptotically via an exponential decay schedule.
A convergence theorem was provided that formalizes the key mechanism: for every iteration index the regularized objective remains strongly convex, so flat directions are eliminated, and under standard step-size conditions the iterates converge to the minimizer of the original (unregularized) least-squares cost as the regularization vanishes. The numerical validation on online identification of a Chua-type chaotic oscillator and the implementation on an ® STM32F4-Nucleo microcontroller confirm that the proposed scheme converges faster than conventional gradient descent and significantly reduces stagnation transients, while preserving high identification accuracy.
Future work will focus on three directions. First, the decay schedule
Acknowledgement: The authors would like to thank Professor Alexander Poznyak for his valuable review of the work, as well as Bruce Dickinson for his motivation throughout the development of this research. They also acknowledge and are grateful for the funding provided by the IPN-SIP (SIP 20250023, 20250424, 20251300, 20251721, and 20253411), SECIHTI (CF-2023-I-1635), and the Sistema Nacional de Investigadores e Investigadoras (SNII) of Mexico.
Funding Statement: Funding was provided by the IPN-SIP (SIP 20250023, 20250424, 20251300, 20251721, 20253411 and MULTI-2026-0035), SECIHTI (CF-2023-I-1635), and the Sistema Nacional de Investigadores e Investigadoras (SNII) of Mexico.
Author Contributions: Conceptualization, Sergio Isai Palomino-Resendiz and César Ulises Solís-Cervantes; methodology, Diego Alonso Flores-Hernández and Sergio Isai Palomino-Resendiz; software, César Ulises Solís-Cervantes, Sergio Isai Palomino-Resendiz and Luis Alberto Cantera-Cantera; validation, Luis Alberto Cantera-Cantera and Jorge de Jesús Morales-Mercado; formal analysis, César Ulises Solís-Cervantes, Sergio Isai Palomino-Resendiz and Diego Alonso Flores-Hernández; investigation, Sergio Isai Palomino-Resendiz; resources, César Ulises Solís-Cervantes, Sergio Isai Palomino-Resendiz and Diego Alonso Flores-Hernández; data curation, Luis Alberto Cantera-Cantera and Jorge de Jesús Morales-Mercado; writing—original draft preparation, Sergio Isai Palomino-Resendiz; writing—review and editing, Sergio Isai Palomino-Resendiz and César Ulises Solís-Cervantes; visualization, Sergio Isai Palomino-Resendiz; supervision, Sergio Isai Palomino-Resendiz and César Ulises Solís-Cervantes; project administration, Sergio Isai Palomino-Resendiz; funding acquisition, Diego Alonso Flores-Hernández and Sergio Isai Palomino-Resendiz. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The data that support the findings of this study are available from the Corresponding Author, César Ulises Solís-Cervantes, upon reasonable request.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
| The following abbreviations are used in this manuscript: | |
| ANN | Artificial Neural Network |
| GD | Gradient Descent |
| GD-QR | Gradient Descent with Quadratic Regularization |
| GD-TDR | Gradient Descent with Time-Decaying Regularization |
| NP | Neural Paralysis |
| SLM | Stagnation in Local Minima |
The following matlab function block implements (31).
function [xp, yp, zp] = fcn(x,y,z)
alpha = 15.6; beta = 25; m0 = −8/7; m1 = −5/7;
phi = m1*x + 0.5*(m0-m1)*(abs(x + 1)-abs(x − 1));
d1 = y − x − phi; d2 = x − y + z; d3 = −y;
xp = alpha*d1; yp = d2; zp = beta*d3;
end

Figure A1: Contents of the block called Chua of the main program.
References
1. Pillonetto G, Aravkin A, Gedon D, Ljung L, Ribeiro AH, Schön TB. Deep networks for system identification: a survey. Automatica. 2025;171(7):111907. doi:10.1016/j.automatica.2024.111907. [Google Scholar] [CrossRef]
2. Dong Q, Liu L, Wang P, Zhang L, Zhang J. Neural network-based parametric system identification: a comprehensive review. Int J Syst Sci. 2023;54(13):2676–88. doi:10.1080/00207721.2023.2241957. [Google Scholar] [CrossRef]
3. Yu P, Wan H, Zhang B, Wu Q, Zhao B, Xu C, et al. Review on system identification, control, and optimization based on artificial intelligence. Mathematics. 2025;13(6):952. doi:10.3390/math13060952. [Google Scholar] [CrossRef]
4. Prince SJD. Understanding deep learning. Cambridge, MA, USA: MIT Press; 2023. [Google Scholar]
5. Bishop CM, Bishop H. Deep learning: foundations and concepts. Cham, Switzerland: Springer; 2024. [Google Scholar]
6. Murphy KP. Probabilistic machine learning: an introduction. Cambridge, MA, USA: MIT Press; 2022. [Google Scholar]
7. Braun L, Dominé CCJ, Fitzgerald JE, Saxe AM. Exact learning dynamics of deep linear networks with prior knowledge. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Red Hook, NY, USA: Curran Associates, Inc.; 2022. p. 6615–29. [Google Scholar]
8. Ziyin L, Li B, Meng X. Exact solutions of a deep linear network. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Red Hook, NY, USA: Curran Associates, Inc.; 2022.p. 24446–58. [Google Scholar]
9. Gambella C, Ghaddar B, Naoum-Sawaya J. Optimization problems for machine learning: a survey. Eur J Oper Res. 2021;290(3):807–28. doi:10.1016/j.ejor.2020.08.045. [Google Scholar] [CrossRef]
10. Ahn K, Zhang J, Sra S. Understanding the unstable convergence of gradient descent. In: Proceedings of the 39th International Conference on Machine Learning (ICML). London, UK: PMLR; 2022. p. 247–57. [Google Scholar]
11. Xie J, Gao R, Nijkamp E, Zhu S-C, Wu YN. Cooperative training of fast thinking initializer and slow thinking solver for conditional learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(8):3957–73. doi:10.1109/TPAMI.2021.3069023. [Google Scholar] [PubMed] [CrossRef]
12. Wurzberger J, Schwenker F. Learning in deep radial basis function networks. Entropy. 2024;26(5):368. doi:10.3390/e26050368. [Google Scholar] [PubMed] [CrossRef]
13. Kalina J, Vidnerová P, Janáček P. Highly robust training of regularized radial basis function networks. Kybernetika. 2024;60(1):38–59. doi:10.14736/kyb-2024-1-0038. [Google Scholar] [CrossRef]
14. Wang D, Gao N, Liu D, Li J, Lewis F. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA J Autom Sin. 2024;11(1):18–36. doi:10.1109/JAS.2023.123843. [Google Scholar] [PubMed] [CrossRef]
15. Le T-L, Huynh T-T, Hong S-K, Lin C-M. Hybrid neural network cerebellar model articulation controller design for non-linear dynamic time-varying plants. Front Neurosci. 2020;14:695. doi:10.3389/fnins.2020.00695. [Google Scholar] [PubMed] [CrossRef]
16. Razmi M, Macnab CJB. Near-optimal neural-network robot control with adaptive gravity compensation. Neurocomputing. 2020;389(6):83–92. doi:10.1016/j.neucom.2020.01.026. [Google Scholar] [CrossRef]
17. Tian Y, Zhang Y, Zhang H. Recent advances in stochastic gradient descent in deep learning. Mathematics. 2023;11(3):682. doi:10.3390/math11030682. [Google Scholar] [CrossRef]
18. Liang C, Ma W, Ma C, Guo L. Harnessing machine learning for identifying parameters in fractional chaotic systems. Appl Math Comput. 2025;500(2):129454. doi:10.1016/j.amc.2025.129454. [Google Scholar] [CrossRef]
19. Chakraborty S, Größmann AH, Benner P. Divide and conquer: learning chaotic dynamical systems using deep neural networks. Comput Methods Appl Mech Eng. 2024;430(8):117442. doi:10.1016/j.cma.2024.117442. [Google Scholar] [CrossRef]
20. Mikhaeil JM, Monfared Z, Durstewitz D. On the difficulty of learning chaotic dynamics with RNNs. In: NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc.; 2022. p. 11297–312. [Google Scholar]
21. Mohan N, Hosni A, Atef M. Neural networks implementations on fpga for biomedical applications: a review. SN Computer Sci. 2024;5(8):1004. doi:10.1007/s42979-024-03381-4. [Google Scholar] [CrossRef]
22. Majumdar P. Spiking neural networks: a comprehensive review of diverse applications, research progress, challenges and future research directions. Evol Syst. 2025;16(4):125. doi:10.1007/s12530-025-09755-0. [Google Scholar] [CrossRef]
23. Achour EM, Malgouyres F, Gerchinovitz S. The loss landscape of deep linear neural networks: a second-order analysis. J Mach Learn Res. 2024;25:242. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools