Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks

Sergio Palomino-Resendiz; César Solís-Cervantes; Luis Cantera-Cantera; Jorge de; Diego Flores-Hernández

doi:10.32604/cmes.2026.077726

icon Open Access

ARTICLE

Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks

Sergio Isai Palomino-Resendiz^1,2, César Ulises Solís-Cervantes^1,*, Luis Alberto Cantera-Cantera^1,3, Jorge de Jesús Morales-Mercado¹, Diego Alonso Flores-Hernández⁴

1Departamento de Ingeniería en Control y Automatización, Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Zacatenco, Instituto Politécnico Nacional, Unidad Profesional Adolfo López Mateos. Av. Luis Enrique Erro S/N, Gustavo A. Madero, Zacatenco, Ciudad de México, México
2 Departamento de Control Automático, Centro de Investigación y de Estudios Avanzados (CINVESTAV) del Instituto Politécnico Nacional, Unidad Zacatenco, Av. Instituto Politécnico Nacional No. 2508, Col. San Pedro Zacatenco, Ciudad de México, México
3 Facultad de Ingeniería, Universidad Anáhuac México, Campus Norte, Huixquilucan, Estado de México, México
4 Sección de Estudios de Posgrado e Investigación, Unidad Profesional Interdisciplinaria en Ingeniería y Tecnologías Avanzadas (UPIITA), Instituto Politécnico Nacional, Av IPN 2580, La Laguna Ticoman, G. A. M., Ciudad de México, México

* Corresponding Author: César Ulises Solís-Cervantes. Email: email

(This article belongs to the Special Issue: Computational Modeling, Simulation, and Algorithmic Methods for Dynamical Systems)

Computer Modeling in Engineering & Sciences 2026, 147(1), 26 https://doi.org/10.32604/cmes.2026.077726

Received 16 December 2025; Accepted 25 February 2026; Issue published 27 April 2026

Abstract

Many linear-in-parameters models arising in identification and control can be expressed as single-layer artificial neural networks (ANNs) with linear activation, enabling online learning via first-order optimization. In practice, however, standard gradient descent often exhibits slow convergence, large intermediate weights, and stagnation when the regressor data are ill-conditioned or computations are performed under finite precision. This paper proposes Gradient Descent with Time-Decaying Regularization (GD-TDR), a training algorithm that augments the quadratic loss with a regularization term whose weight decays exponentially in time. The proposed schedule enforces uniform strong convexity during early iterations, effectively mitigating neural-paralysis-like behavior associated with flat directions, while asymptotically vanishing so that the unregularized least-squares solution is recovered. A convergence theorem for GD-TDR is established and a concise pseudocode implementation is provided. Numerical and embedded experiments on an online identification problem of a Chua-type chaotic oscillator demonstrate that GD-TDR converges faster and avoids stagnation compared to standard gradient descent, without introducing the steady-state bias characteristic of fixed quadratic regularization.

Keywords

Time-decaying regularization; gradient descent; single-layer linear neural network; online system identification; chaotic oscillator; embedded implementation

1 Introduction

1.1 State of the Art

Artificial neural networks (ANNs) have been extensively studied as flexible function approximators and parametric models across a wide range of scientific and engineering tasks. General surveys and application-driven reviews document the breadth of ANN deployments and motivate their use in identification and control, particularly when explicit first-principles modeling is difficult or when data-driven adaptation is required [1–3]. Foundational treatments established the core modeling paradigms and training principles, including linear and nonlinear network structures and their algorithmic implementations [4–6]. In this classical view, many practical learning rules can be interpreted as iterative optimization procedures acting on a quadratic or near-quadratic objective, a perspective that remains central to modern online learning formulations.

A large portion of the ANN training literature is rooted in incremental (first-order) updates. Recent theoretical analyses of learning dynamics in deep linear networks provide explicit characterizations of transient behavior and the interaction between initialization, regularization, and optimization geometry under gradient-based training [7,8]. Within the quadratic-loss setting, basic gradient-descent rules and their stochastic variants remain canonical examples of first-order adaptation mechanisms and highlight how data statistics, conditioning, and step-size choices shape stability and speed of learning [9,10]. In parallel, modern energy-based learning continues to emphasize the role of objective shaping and conditioning in learnability [11]. These works collectively support the view that optimization geometry—not only model expressiveness—plays a decisive role in whether training proceeds smoothly or becomes trapped in slow transient regimes.

Beyond standard multilayer architectures, several specialized ANN families have been developed for robustness, interpretability, and control-oriented deployment. Radial basis function networks and their robust variants provide a well-established pathway to stable approximation under uncertainty and noise [12,13]. In adaptive and self-learning control, ANN-based schemes have been reported for real-time compensation and online tuning, where training must remain stable under streaming data and limited numerical precision [14]. Related approaches in intelligent control also include fuzzy and cerebellar-model architectures that stress adaptation under nonlinearities and disturbances [15,16]. Complementary lines of research continue to refine computationally efficient first-order training, including recent advances in stochastic gradient descent variants [17].

A particularly demanding class of identification problems arises in nonlinear and chaotic dynamics, where sensitivity to initial conditions and measurement noise can degrade learning reliability. Recent studies illustrate both the feasibility of parameter identification in chaotic systems and the numerical challenges of learning from chaotic trajectories [18–20]. These challenges are amplified in embedded or resource-constrained implementations, where finite-precision arithmetic and strict real-time requirements can exacerbate ill-conditioning and lead to slow or stagnant learning. Recent reviews on hardware realizations of neural methods, including FPGA-oriented implementations and embedded control applications, highlight the practical importance of training rules that remain stable and well-conditioned under limited precision. Consistent with these trends, widely used embedded platforms and rapid-prototyping toolchains have enabled end-to-end experimental validation of online learning strategies on microcontrollers [21,22].

Despite the breadth of architectures and applications, an enduring challenge in first-order online training is the susceptibility to slow plateaus and weight growth when the regressor is ill-conditioned or when optimization directions become nearly flat. Classical quadratic (L2/Tikhonov) regularization is a standard remedy to improve conditioning, yet fixed regularization may introduce steady-state bias when the target objective is the unregularized least-squares criterion. This motivates strategies that improve early-stage conditioning while preserving asymptotic fidelity to the original objective, which is the central perspective adopted in this work.

1.2 Description and Main Contributions

Single-layer ANNs with a linear activation function are equivalent to linear regression models and are widely used to represent linear-in-parameters structures in system identification, adaptive filtering, and control. In these applications, the model output can be written as a linear combination of known regressors and unknown parameters, so that training reduces to minimizing a least-squares functional. Although a closed-form solution exists for batch least squares, embedded and real-time settings often require iterative, lightweight, and online algorithms. For this reason, first-order methods based on gradient descent remain attractive due to their low computational complexity and ease of implementation [2,4,17].

In practice, however, standard gradient descent may perform poorly when the regressor data are ill-conditioned or nearly rank-deficient. These situations are frequent in online identification problems with delayed signals, correlated regressors, or limited excitation. The resulting cost surface can contain nearly flat directions, which leads to slow progress, long plateaus, and very large intermediate weights. On finite-precision hardware, such dynamics can manifest as training stagnation and numerical instabilities that are commonly described as neural-paralysis-like plateau behavior in the neural-network literature [23]. In the linear setting considered here, the effect is not caused by saturation of nonlinear activation functions, but rather by loss of curvature and poor conditioning of the quadratic objective.

A standard remedy is to augment the least-squares loss with a quadratic (Tikhonov) regularizer, which enforces strong convexity and penalizes large weights. Fixed regularization, however, introduces a bias: the minimizer of the regularized problem does not generally coincide with the minimizer of the original least-squares cost. This trade-off is particularly undesirable in identification tasks, where asymptotic accuracy is essential.

This work proposes GD-TDR (Gradient Descent Algorithm with Regularizer—Time Decay), a first-order scheme that interpolates between these two extremes. The algorithm employs a quadratic regularizer whose coefficient decays exponentially over time. As a result, the early iterations benefit from improved curvature and bounded weights, while the regularization vanishes asymptotically and the algorithm recovers the minimizer of the unregularized least-squares functional.

The main contributions of this work are summarized as follows:

• a unified analytical framework that explicitly connects classical gradient descent, gradient descent with fixed quadratic (L2) regularization, and the proposed Gradient Descent with Time-Decaying Regularization (GD-TDR) through a single decay parameter, thereby clarifying their structural similarities and fundamental differences;

• a rigorous convergence theorem establishing that the time-decaying regularization enforces uniform strong convexity during the transient phase while asymptotically recovering the minimizer of the unregularized least-squares problem;

• a concise and self-contained pseudocode implementation of GD-TDR that directly reflects the theoretical development and facilitates reproducible implementation;

• a comprehensive numerical and embedded validation, including online parameter identification of a Chua-type chaotic oscillator and a real-time implementation on an ® STM32F4-Nucleo microcontroller, demonstrating accelerated convergence and mitigation of stagnation without steady-state bias.

The paper is organized as follows. Section 2 introduces the linear ANN model and the least-squares training objective. Section 3 compares gradient-descent training schemes and presents GD-TDR. Section 4 states and proves the convergence theorem and provides the GD-TDR pseudocode. Section 5 shows how common identification models can be written in linear ANN form. Sections 6 and 7 report the numerical and embedded validation, respectively. Section 8 concludes the paper and outlines future work.

2 Single-Layer Linear ANN and Least-Squares Training

2.1 Model Representation

Consider a single-layer ANN with linear activation (purelin) and no bias. Its output is

y^:=w⊺x,(1)

where the input vector is x:=[x1 x2 ⋯ xn]⊺ and the weight vector is w:=[w1 w2 ⋯ wn]⊺. Fig. 1 illustrates the architecture.

images

Figure 1: Single-layer ANN with linear activation.

2.2 Batch Least-Squares Objective

Given a finite data set {(xk,yt,k)}k=1N, define the prediction errors ek:=y^k−yt,k and the quadratic cost

J(w):=1N∑k=1Nek2.(2)

Let X∈ℳn×N(R) be the regressor matrix and yt∈RN the target vector,

X:=[x1 x2 ⋯ xN],yt:=[yt,1 yt,2 ⋯ yt,N]⊺.(3)

Then

J(w)=1N‖X⊺w−yt‖22.(4)

The gradient and Hessian of J with respect to w are

∇J(w)=2NX(X⊺w−yt),∇2J(w)=2NXX⊺.(5)

The Hessian is positive semidefinite. It is positive definite (and hence J is strongly convex) if and only if XX⊺ is nonsingular.

3 Training Schemes and Algorithmic Comparisons

This section summarizes three closely related first-order schemes: classical gradient descent (GD), gradient descent with a fixed quadratic regularizer (GD-QR), and the proposed time-decaying regularized scheme (GD-TDR). Only GD-TDR is presented in pseudocode form (Section 4).

3.1 Classical Gradient Descent (GD)

Using (2), the GD update with step size η>0 is

w(j+1)=w(j)−η∇J(w(j))=w(j)−2ηNXe(j),(6)

where e(j):=X⊺w(j)−yt. When J is strongly convex and η is chosen appropriately, GD converges linearly to the unique minimizer. If XX⊺ is singular or ill-conditioned, J is not strongly convex and GD may exhibit slow progress and large intermediate weights.

3.2 Gradient Descent with Fixed Quadratic Regularization (GD-QR)

A standard approach to improve conditioning is to add a quadratic regularizer

Jγ(w):=J(w)+γ2w⊺Pw,(7)

where γ>0 and P∈ℳn×n(R) is symmetric positive definite. The Hessian becomes

∇2Jγ(w)=2NXX⊺+γP≻0,(8)

so Jγ is strongly convex for any γ>0. The GD-QR update reads

w(j+1)=w(j)−η(2NXe(j)+γPw(j)).(9)

Fixed regularization controls weight growth and improves curvature, but the minimizer of Jγ generally differs from the minimizer of J. In identification problems, this bias can be detrimental.

3.3 Proposed GD-TDR (Time-Decaying Quadratic Regularization)

GD-TDR replaces the constant regularization weight γ in (9) by a time-decaying sequence

γ(j):=γ0λj,γ0>0,0<λ<1.(10)

The weight update becomes

w(j+1)=w(j)−η(2NXe(j)+γ(j)Pw(j)).(11)

Two limiting cases highlight the relationship among the three schemes:

• Classical GD: setting γ0=0 yields (6).

• Fixed regularization: setting λ=1 yields GD-QR in (9).

Therefore, GD-TDR provides a continuous mechanism to improve conditioning early in training while asymptotically removing the regularization bias. The theoretical properties of this scheme are stated in Theorem 1.

4 Theoretical Properties and GD-TDR Pseudocode

We restate the objective in compact form. Define

J(w):=1N‖X⊤w−yt‖22,e(w):=X⊤w−yt,(12)

and let P≻0 be symmetric. For each iteration j consider the regularized functional

Jj(w):=J(w)+γ(j)2w⊤Pw,(13)

with γ(j) defined by (10).

Theorem 1 (Online GD-TDR): Let X∈Mn×N(R) and yt∈RN be given and assume that XX⊤ is positive semidefinite. Let P∈Mn×n(R) be symmetric and positive definite and define

H:=2NXX⊤.(14)

For γ0>0 and 0<λ<1 define

γ(j):=γ0λj,j=0,1,2,…(15)

and

Jj(w):=J(w)+γ(j)2w⊤Pw.(16)

(a) For every j≥0 the functional Jj is strongly convex. More precisely, its Hessian

∇w2Jj(w)=H+γ(j)P(17)

satisfies

λmin(∇w2Jj(w))≥γ(j)λmin(P)>0(18)

for all w∈Rn. Consequently, each Jj has a unique global minimizer wj⋆.

(b) Suppose that XX⊤ is positive definite, so that J has a unique minimizer w⋆. Let μ:=λmin(H) and L:=λmax(H), and denote by λmin(P) and λmax(P) the extremal eigenvalues of P. Choose η>0 such that

0<η<2L+γ0λmax(P).(19)

Consider the GD-TDR iteration

e(j)=X⊤w(j)−yt,(20)

w(j+1)=w(j)−η(2NXe(j)+γ(j)Pw(j)),j=0,1,2,…(21)

Then {w(j)}j≥0 is bounded and converges to w⋆.

(c) For each j and every w∈Rn,

‖∇Jj(w)‖2≥γ(j)λmin(P)‖w−wj⋆‖2.(22)

In particular, when γ(j) is large (early iterations), the curvature of Jj is uniformly bounded away from zero and the gradient cannot vanish far from the minimizer. This reduces extended flat regions of the cost surface and mitigates neural-paralysis-like stagnation associated with weight growth.

Proof: (a) Since H is symmetric and positive semidefinite and P≻0, the matrix H+γ(j)P is symmetric positive definite for every γ(j)>0. Moreover,

λmin(H+γ(j)P)≥λmin(H)+γ(j)λmin(P)≥γ(j)λmin(P)>0.(23)

Hence Jj is strongly convex and has a unique minimizer wj⋆.

(b) When H is positive definite, J is μ-strongly convex and L-smooth. For each j, Jj is μj-strongly convex and Lj-smooth with

μj=λmin(H+γ(j)P)≥μ,Lj=λmax(H+γ(j)P)≤L+γ0λmax(P).(24)

Thus 0<η<2/Lj holds uniformly for all j. The update can be written as

w(j+1)=w(j)−η∇Jj(w(j)).(25)

Gradient descent on a strongly convex, smooth function with a step size in (0,2/Lj) is a contraction mapping, hence {w(j)} is bounded. Since γ(j)→0, ∇Jj(w)→∇J(w) uniformly on bounded sets. Taking limits in the iteration yields ∇J(w¯)=0 for the limit point w¯, which must equal the unique minimizer w⋆.

(c) By strong convexity of Jj with modulus at least γ(j)λmin(P) and the standard gradient characterization of strong convexity,

‖∇Jj(w)‖2≥γ(j)λmin(P)‖w−wj⋆‖2.(26)

◻

Remark 1: The decay rule γ(j+1)=λγ(j) implements an exponentially vanishing regularization weight. Early in training, γ(0) can be chosen sufficiently large so that the quadratic term dominates the curvature and penalizes large weights. As γ(j) decreases, the regularization bias disappears asymptotically and the algorithm recovers the geometry of the original least-squares cost. An alternative and simplified way to visualize all of the above is through the pseudocode of the algorithm contained in Algorithm 1.

images

To make the above easier to visualize, Table 1 presents a comparison of key aspects.

images

5 Application to Online Parameter Identification

This section shows how common linear identification models can be written as single-layer linear ANNs, which allows applying GD-TDR directly.

5.1 Discrete Transfer Functions

Consider a discrete transfer function

Y(z)U(z):=anz−s+an−1z−s+1+⋯+a1z−1+a0bmz−m+bm−1z−m+1+⋯+b1z−1+1,(27)

with s≤m. After inverse z-transform (zero initial conditions), one obtains an ARX-like representation

y[n]=∑k=0saku[n−k]−∑r=1mbry[n−r].(28)

Defining

w:=[a0a1⋯asb1⋯bm]⊺,x[n]:=[u[n]u[n−1]⋯u[n−s]−y[n−1]⋯−y[n−m]]⊺.(29)

The model becomes y[n]=w⊺x[n], which is exactly the output of a single-layer linear ANN. Fig. 2 sketches this identification setup.

images

Figure 2: Single-layer ANN representation for parameter identification.

5.2 Discrete State-Space Models

For a discrete state-space (DSS) system

x[n+1]=Ax[n]+Bu[n],(30)

with appropriate dimensions, the right-hand side is linear in the unknown entries of A and B. By stacking the parameters into a single vector (or matrix) and defining a regressor vector that contains state and input components, the DSS update can also be written in the form x[n+1]=Wϕ[n], where W collects the unknown parameters. This is compatible with GD-TDR, which can be applied entrywise.

A practical issue is causality: to update parameters at time n, the target x[n+1] is needed. A simple remedy is to insert a one-step delay in the learning loop, which preserves the identification objective while keeping the update implementable in real time. Fig. 3 illustrates the ANN view of the DSS model, while Fig. 4 shows a delay-based identification for DSS parameter identification.

images

Figure 3: DSS model viewed as a linear ANN mapping.

images

Figure 4: Delay-based scheme for DSS parameter identification.

Remark 2: Although the focus is on linear-in-parameters models, the same idea can be used to identify local linearizations of nonlinear systems around operating points, provided the regressors are constructed accordingly.

6 Numerical Validation

6.1 Experimental Setup

The numerical validation considers online parameter identification of a chaotic system whose dynamics are equivalent to those of a Chua-type oscillator. Chaotic trajectories provide a demanding excitation pattern for adaptive algorithms and are known to expose slow transients and stagnation effects in gradient-based learning [19,20].

The Chua oscillator is described by

{x˙=α(y−x−φ(x)),y˙=x−y+z,z˙=−βy,(31)

where

φ(x):=m1x+12(m0−m1)(|x+1|−|x−1|).(32)

For the parameter values α=15.6, β=25, m0=−1.1429, and m1=−0.7143, the corresponding attractor is shown in Fig. 5.

images

Figure 5: Trajectory of Chua dynamics (chaotic attractor).

To pose an identification problem that is linear in the unknown parameters, the dynamics are rewritten as

[x˙y˙z˙]⏟yt=[w11w12w13w14w21w22w23w24w31w32w33w34]⏟W[xyzd]⏟ϕ,d:=|x+1|−|x−1|.(33)

In this form, W is an unknown parameter matrix to be estimated online from the measured signals. The validation compares GD-TDR against classical GD. In addition, Section 3 provides an explicit comparison with fixed quadratic regularization (GD-QR), which is obtained as the special case λ=1.

Fig. 6 shows the main program used in the numerical simulations.

images

Figure 6: Main program in ® Matlab-Simulink environment.

In the reported tests, the learning factor was set to η=0.02. For GD-TDR, the regularization parameters were initialized at γ0=0.99 and decayed according to (10) with λ=0.9998. The simulations used the ode8 solver with fixed step size and sampling time 0.001s.

6.2 Results

Fig. 7 shows the convergence of the estimated parameters towards the target values, while Fig. 8 reports the norm of the estimation error. The final identified parameters for GD-TDR, GD and GD-QR are listed in (34) to (36), respectively.

WGD-TDR=[−4.45715.6000.0003.3431.000−1.0001.0000.0000.000−25.0000.0000.000],NE=0.0001.(34)

WGD=[−4.43715.5800.0013.3270.997−0.9971.0000.0020.0202−24.9870.0000.0151],NE=0.0013.(35)

WGD-QR=[−3.312414.32320.09692.59130.8175−0.87120.99080.1169−1.0477−23.4959−0.32020.5065],NE=0.0419.(36)

images

Figure 7: Convergence of GD, GD-QR and GD-TDR parameters.

images

Figure 8: Norm of the parameter-estimation error.

6.3 Discussion

GD and GD-TDR algorithms converge to high-accuracy estimates; however, GD-TDR exhibits markedly faster convergence and substantially shorter stagnation transients. In the considered experiment, GD requires approximately 800s to reach steady convergence, whereas GD-TDR attains comparable accuracy in about 200s. The error norm (NE) trajectories further indicate that GD spends a significant portion of the runtime in plateau-like regions before converging, a behavior consistent with ill-conditioning and the presence of flat directions in the quadratic objective function.

GD-QR mitigates oscillations during the convergence process but yields the poorest parameter convergence (NE=0.0419). This behavior is expected, since the QR-based transformation effectively shifts the original optimal point to an alternative one in order to increase the convexity of the cost functional, thereby smoothing the convergence dynamics at the expense of final parameter accuracy.

From an algorithmic viewpoint, these improvements can be interpreted through Theorem 1: the decaying regularization increases the curvature of Jj in the early iterations, preventing the gradient from vanishing far from the minimizer and discouraging weight growth in nearly-flat directions. As γ(j) decreases, the method smoothly transitions toward the unregularized least-squares objective, thereby avoiding the steady-state bias that would occur if a fixed γ were used (GD-QR).

7 Embedded Implementation on a Microcontroller

To evaluate suitability for low-processing-capacity platforms, GD-TDR was implemented on an ® STM32F4-Nucleo microcontroller. The implementation was developed in the ® Matlab-Simulink environment and deployed using the ® Waijung toolkit. Figs. 9 and 10 show the board-level program configuration, which follows the same signal flow as the numerical setup and adds serial communication blocks for monitoring. In particular, in Fig. 9, the content of the block called chua can be located in Fig. A1 as well as its programming, which are contained in the Appendix A.

images

Figure 9: Program configuration for the ® STM32F4-Nucleo board.

images

Figure 10: CPU monitoring program.

Experimental Results

In this work, neuron-paralysis-like behavior is quantitatively assessed through a combination of indicators, including prolonged plateaus in the loss evolution, persistent attenuation of the effective gradient norm, and excessive transient growth of the parameter vector prior to convergence.

Fig. 11 reports the convergence behavior observed on the microcontroller. The results are qualitatively consistent with the numerical simulations, indicating that the proposed method preserves its robustness against stagnation even under limited precision and memory.

WSTM32=[−4.43915.585−0.0013.3330.997−0.9871.0000.0010.014−24.9880.0010.007],NE=0.001.(37)

images

Figure 11: Dynamics of convergence of GD-TDR weights on the ® STM32F4-Nucleo board.

8 Conclusions and Future Work

This paper introduced GD-TDR, a time-decaying quadratically regularized gradient-descent algorithm for training single-layer linear ANNs. The method is motivated by online identification problems in which the regressor data can be ill-conditioned and standard gradient descent may suffer from long plateaus, large intermediate weights, and neural-paralysis-like stagnation on finite-precision hardware. GD-TDR addresses this issue by enforcing strong convexity early in training through a quadratic penalty and then removing the penalty asymptotically via an exponential decay schedule.

A convergence theorem was provided that formalizes the key mechanism: for every iteration index the regularized objective remains strongly convex, so flat directions are eliminated, and under standard step-size conditions the iterates converge to the minimizer of the original (unregularized) least-squares cost as the regularization vanishes. The numerical validation on online identification of a Chua-type chaotic oscillator and the implementation on an ® STM32F4-Nucleo microcontroller confirm that the proposed scheme converges faster than conventional gradient descent and significantly reduces stagnation transients, while preserving high identification accuracy.

Future work will focus on three directions. First, the decay schedule γ(j) can be adapted online using measurable indicators of conditioning or excitation, rather than being fixed a priori. Second, systematic and low-cost design rules for selecting P (e.g., diagonal or structured choices compatible with embedded computation) will be investigated, including robustness to noise and time-varying parameters. Third, extending the approach beyond linear activation to shallow nonlinear networks and to constrained identification problems is of interest, where time-decaying regularization may provide similar benefits without sacrificing asymptotic accuracy.

Acknowledgement: The authors would like to thank Professor Alexander Poznyak for his valuable review of the work, as well as Bruce Dickinson for his motivation throughout the development of this research. They also acknowledge and are grateful for the funding provided by the IPN-SIP (SIP 20250023, 20250424, 20251300, 20251721, and 20253411), SECIHTI (CF-2023-I-1635), and the Sistema Nacional de Investigadores e Investigadoras (SNII) of Mexico.

Funding Statement: Funding was provided by the IPN-SIP (SIP 20250023, 20250424, 20251300, 20251721, 20253411 and MULTI-2026-0035), SECIHTI (CF-2023-I-1635), and the Sistema Nacional de Investigadores e Investigadoras (SNII) of Mexico.

Author Contributions: Conceptualization, Sergio Isai Palomino-Resendiz and César Ulises Solís-Cervantes; methodology, Diego Alonso Flores-Hernández and Sergio Isai Palomino-Resendiz; software, César Ulises Solís-Cervantes, Sergio Isai Palomino-Resendiz and Luis Alberto Cantera-Cantera; validation, Luis Alberto Cantera-Cantera and Jorge de Jesús Morales-Mercado; formal analysis, César Ulises Solís-Cervantes, Sergio Isai Palomino-Resendiz and Diego Alonso Flores-Hernández; investigation, Sergio Isai Palomino-Resendiz; resources, César Ulises Solís-Cervantes, Sergio Isai Palomino-Resendiz and Diego Alonso Flores-Hernández; data curation, Luis Alberto Cantera-Cantera and Jorge de Jesús Morales-Mercado; writing—original draft preparation, Sergio Isai Palomino-Resendiz; writing—review and editing, Sergio Isai Palomino-Resendiz and César Ulises Solís-Cervantes; visualization, Sergio Isai Palomino-Resendiz; supervision, Sergio Isai Palomino-Resendiz and César Ulises Solís-Cervantes; project administration, Sergio Isai Palomino-Resendiz; funding acquisition, Diego Alonso Flores-Hernández and Sergio Isai Palomino-Resendiz. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The data that support the findings of this study are available from the Corresponding Author, César Ulises Solís-Cervantes, upon reasonable request.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANN	Artificial Neural Network
GD	Gradient Descent
GD-QR	Gradient Descent with Quadratic Regularization
GD-TDR	Gradient Descent with Time-Decaying Regularization
NP	Neural Paralysis
SLM	Stagnation in Local Minima

Appendix A Chua Model Block

The following matlab function block implements (31).

function [xp, yp, zp] = fcn(x,y,z)

alpha = 15.6; beta = 25; m0 = −8/7; m1 = −5/7;

phi = m1*x + 0.5*(m0-m1)*(abs(x + 1)-abs(x − 1));

d1 = y − x − phi; d2 = x − y + z; d3 = −y;

xp = alpha*d1; yp = d2; zp = beta*d3;

end

images

Figure A1: Contents of the block called Chua of the main program.

References

1. Pillonetto G, Aravkin A, Gedon D, Ljung L, Ribeiro AH, Schön TB. Deep networks for system identification: a survey. Automatica. 2025;171(7):111907. doi:10.1016/j.automatica.2024.111907. [Google Scholar] [CrossRef]

2. Dong Q, Liu L, Wang P, Zhang L, Zhang J. Neural network-based parametric system identification: a comprehensive review. Int J Syst Sci. 2023;54(13):2676–88. doi:10.1080/00207721.2023.2241957. [Google Scholar] [CrossRef]

3. Yu P, Wan H, Zhang B, Wu Q, Zhao B, Xu C, et al. Review on system identification, control, and optimization based on artificial intelligence. Mathematics. 2025;13(6):952. doi:10.3390/math13060952. [Google Scholar] [CrossRef]

4. Prince SJD. Understanding deep learning. Cambridge, MA, USA: MIT Press; 2023. [Google Scholar]

5. Bishop CM, Bishop H. Deep learning: foundations and concepts. Cham, Switzerland: Springer; 2024. [Google Scholar]

6. Murphy KP. Probabilistic machine learning: an introduction. Cambridge, MA, USA: MIT Press; 2022. [Google Scholar]

7. Braun L, Dominé CCJ, Fitzgerald JE, Saxe AM. Exact learning dynamics of deep linear networks with prior knowledge. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Red Hook, NY, USA: Curran Associates, Inc.; 2022. p. 6615–29. [Google Scholar]

8. Ziyin L, Li B, Meng X. Exact solutions of a deep linear network. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Red Hook, NY, USA: Curran Associates, Inc.; 2022.p. 24446–58. [Google Scholar]

9. Gambella C, Ghaddar B, Naoum-Sawaya J. Optimization problems for machine learning: a survey. Eur J Oper Res. 2021;290(3):807–28. doi:10.1016/j.ejor.2020.08.045. [Google Scholar] [CrossRef]

10. Ahn K, Zhang J, Sra S. Understanding the unstable convergence of gradient descent. In: Proceedings of the 39th International Conference on Machine Learning (ICML). London, UK: PMLR; 2022. p. 247–57. [Google Scholar]

11. Xie J, Gao R, Nijkamp E, Zhu S-C, Wu YN. Cooperative training of fast thinking initializer and slow thinking solver for conditional learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(8):3957–73. doi:10.1109/TPAMI.2021.3069023. [Google Scholar] [PubMed] [CrossRef]

12. Wurzberger J, Schwenker F. Learning in deep radial basis function networks. Entropy. 2024;26(5):368. doi:10.3390/e26050368. [Google Scholar] [PubMed] [CrossRef]

13. Kalina J, Vidnerová P, Janáček P. Highly robust training of regularized radial basis function networks. Kybernetika. 2024;60(1):38–59. doi:10.14736/kyb-2024-1-0038. [Google Scholar] [CrossRef]

14. Wang D, Gao N, Liu D, Li J, Lewis F. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA J Autom Sin. 2024;11(1):18–36. doi:10.1109/JAS.2023.123843. [Google Scholar] [PubMed] [CrossRef]

15. Le T-L, Huynh T-T, Hong S-K, Lin C-M. Hybrid neural network cerebellar model articulation controller design for non-linear dynamic time-varying plants. Front Neurosci. 2020;14:695. doi:10.3389/fnins.2020.00695. [Google Scholar] [PubMed] [CrossRef]

16. Razmi M, Macnab CJB. Near-optimal neural-network robot control with adaptive gravity compensation. Neurocomputing. 2020;389(6):83–92. doi:10.1016/j.neucom.2020.01.026. [Google Scholar] [CrossRef]

17. Tian Y, Zhang Y, Zhang H. Recent advances in stochastic gradient descent in deep learning. Mathematics. 2023;11(3):682. doi:10.3390/math11030682. [Google Scholar] [CrossRef]

18. Liang C, Ma W, Ma C, Guo L. Harnessing machine learning for identifying parameters in fractional chaotic systems. Appl Math Comput. 2025;500(2):129454. doi:10.1016/j.amc.2025.129454. [Google Scholar] [CrossRef]

19. Chakraborty S, Größmann AH, Benner P. Divide and conquer: learning chaotic dynamical systems using deep neural networks. Comput Methods Appl Mech Eng. 2024;430(8):117442. doi:10.1016/j.cma.2024.117442. [Google Scholar] [CrossRef]

20. Mikhaeil JM, Monfared Z, Durstewitz D. On the difficulty of learning chaotic dynamics with RNNs. In: NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc.; 2022. p. 11297–312. [Google Scholar]

21. Mohan N, Hosni A, Atef M. Neural networks implementations on fpga for biomedical applications: a review. SN Computer Sci. 2024;5(8):1004. doi:10.1007/s42979-024-03381-4. [Google Scholar] [CrossRef]

22. Majumdar P. Spiking neural networks: a comprehensive review of diverse applications, research progress, challenges and future research directions. Evol Syst. 2025;16(4):125. doi:10.1007/s12530-025-09755-0. [Google Scholar] [CrossRef]

23. Achour EM, Malgouyres F, Gerchinovitz S. The loss landscape of deep linear neural networks: a second-order analysis. J Mach Learn Res. 2024;25:242. [Google Scholar]

Cite This Article

APA Style

Palomino-Resendiz, S.I., Ulises Solís-Cervantes, C., Cantera-Cantera, L.A., de Jesús Morales-Mercado, J., Flores-Hernández, D.A. (2026). Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks. Computer Modeling in Engineering & Sciences, 147(1), 26. https://doi.org/10.32604/cmes.2026.077726

Vancouver Style

Palomino-Resendiz SI, Ulises Solís-Cervantes C, Cantera-Cantera LA, de Jesús Morales-Mercado J, Flores-Hernández DA. Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks. Comput Model Eng Sci. 2026;147(1):26. https://doi.org/10.32604/cmes.2026.077726

IEEE Style

S. I. Palomino-Resendiz, C. Ulises Solís-Cervantes, L. A. Cantera-Cantera, J. de Jesús Morales-Mercado, and D. A. Flores-Hernández, “Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks,” Comput. Model. Eng. Sci., vol. 147, no. 1, pp. 26, 2026. https://doi.org/10.32604/cmes.2026.077726

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Gradient Descent with Time-Decaying Regularization for Training Linear Neural Networks

Abstract

Keywords

References

Cite This Article

535

174

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link