Estimating Weibull Parameters Using Least Squares and Multilayer Perceptron vs. Bayes Estimation

: The Weibull distribution is regarded as among the finest in the family of failure distributions. One of the most commonly used parameters of the Weibull distribution (WD) is the ordinary least squares (OLS) technique, which is useful in reliability and lifetime modeling. In this study, we propose an approach based on the ordinary least squares and the multilayer perceptron (MLP) neural network called the OLSMLP that is based on the resilience of the OLS method. The MLP solves the problem of heteroscedasticity that distorts the estimation of the parameters of the WD due to the presence of outliers, and eases the difficulty of determining weights in case of the weighted least square (WLS). Another method is proposed by incorporating a weight into the general entropy (GE) loss function to estimate the parameters of the WD to obtain a modified loss function (WGE). Furthermore, a Monte Carlo simulation is performed to examine the performance of the proposed OLSMLP method in comparison with approximate Bayesian estimation (BLWGE) by using a weighted GE loss function. The results of the simulation showed that the two proposed methods produced good estimates even for small sample sizes. In addition, the techniques proposed here are typically the preferred options when estimating parameters compared with other available methods, in terms of the mean squared error and requirements related to time.

The form of the probability density function (PDF) of two parameters of WD is given by: The cumulative distribution function (CDF) and the survival function S of the WD can be expressed as where the parameters ϑ and λ represent the scale and the shape of the distribution, respectively.
Several approaches to estimating the parameters of the WD have been proposed [13]. They can generally be classified as manual or numerical [14].
In addition to computational methods, many studies in the literature have attempted to use the neural network (NN) to anticipate the parameters of the WD in many areas, such as the method developed by Jesus that applies the Weibull and ANN analysis to anticipate the shelf life and acidity of vacuum-packed fresh cheese [23]. In survival analysis, Achraf constructed a deep neural network model called DeepWeiSurv. It was assumed that the distribution of survival times follows a finite mixture of a two-parameter WD [24]. In another work in the field of electric power generation, an artificial NN (ANN) and q-Weibull were applied to the survival function of brushes in hydroelectric generators [25].
In the proposed method, we solve the problem whereby the reliability of the OLS method is compromised by outliers through the introduction of a pre-trained neural network after the linearization of the CDF. The remaining sections of this paper are organized as follows: Section 2 provides a review of different numerical and graphical methods for estimating the parameters of the WD, such as the MLE, OLS, WLS, and BLGE. In Section 3 we present the proposed methods. To evaluate their appropriateness in comparison with competing methods, the relevant performance metrics are covered in Section 4. The results are discussed in Section 5. Finally, the conclusions of this study are provided in Section 6.

Review of Numerical and Graphical Methods for Estimating Parameters of WD
The most commonly used approaches to estimate the parameters λ and ϑ of the WD are described below.

Maximum Likelihood Estimator (MLE)
Let the set (x 1 , x 2 , x 3 , . . . x n ) of n random lifetimes from the WD be defined by Eq. (1). Then, the likelihood function Lf and its corresponding logarithm for the given sample observations are shown in Eqs. (4) and (5), respectively [28]: The partial derivatives of the equation for with respect to the variables ϑ and λ are given by: The MLE estimatorθ MLE of ϑ is: The parameter λ can be obtained by using any numerical method, such as the Newton-Raphson.

Ordinary Least Squares Method (OLS)
To estimate the parameters of the WD, the OLS method is extensively used in mathematics and engineering problems [16]. We can obtain a linear relationship between parameters by taking the logarithm of Eq. (2) as follows: Then, Eq. (9) can be written as Y i = α 0 + βX i + i Let X (1) , X (2) , X (3) , . . . X (n) be order statistics of X 1 , X 2 , X 3 , . . . X n , and let x (1) < x (2) < x (3) < . . . < x (n) be the ordered observations in a random sample of size n. To estimate the values of the cumulative distribution function F(x (i) ϑ, λ), we use the mean rank method as follows: The estimatesα 0 andβ of the regression parameters α 0 and β minimize the function Therefore, the estimatesα 0 andβ of the parameters α 0 and β are given bŷ The estimatesλ OLS andθ OLS of the parameters λ and ϑ are given bŷ

Weighted Least Squares Method (WLS)
In the WLS estimate, the parameters λ and ϑ are the values of the parameters that minimize the function: The biggest challenge in the application of the WLS is in finding the weights W i in Eq. (15). We use the delta method [29] to find them: Hence, the weights can be written as follows: Minimizing Q * W (λ , ϑ) we obtain the WLS estimates of λ and ϑ aŝ

Approximate Bayes Estimator
In this section, the approximate Bayesian estimator under a GE loss function of the parameters λ and ϑ of the WD is discussed. We assume a non-informative (vague) prior according to [30] as The parameters λ and ϕ are estimated using Lindley's approximation technique. The posterior expectation E is given by Eq. (22) [31]: Moreover, it can be asymptotically estimated by: , and σ ij = element ( i, j) of the covariance matrix of the parameter estimators.
To apply the Lindley model of Eq. (24) to estimate the parameters of the WD, the following are obtained from Eq. (23): The elements σ ij of the covariance matrix are expressed by

Estimates Based on General Entropy Loss Function
The general entropy loss function L for φ, shown in Eq. (24), is expressed by the following form [32]: whereφ is an estimate of φ. The Bayes estimator of φ, denoted byφ GE , is the valueφ that minimizes Eq. (26): The BLGE ofλ BLGE for λ from Eq. (24) is found by the following expressions: In the same way, the BLGE ofθ BLGE for ϑ is found by the following expressions:

Proposed Methods
In the following sections, we describe the proposed BLWGE and OLSMLP methods.

Weighted General Entropy Loss Function
The WGE loss function was proposed as dependent on the weighted loss GE function as follows: where φ represents the estimated parameters that minimize the expectation of the loss function (Eq. (27)), and w(φ) represents the proposed weighted function as expressed by Eq. (28): Based on the posterior distribution of the parameter φ, and by using the WGE function given in Eq. (28), we obtain the estimated BLWGE of the parameter ϑ as follows: Thus, we can find that Consequently, the BLWGE of parameter φ, obtained by using the WGE loss function, isφ BLWGE as presented in Eq. (29): provided that E φ (φ −z ) and E φ (φ −(z+q) ) exist and are finite, where E φ represents the expected value.
We note that the GE is a special case of the WGE when z = 0 in Eq. (29). 4040 CMC, 2022, vol.71, no.2

Estimates of Parameters of WD Based on Weighted General Entropy Loss Function
Based on the WGE and by using Eq. (29), the approximate Bayes estimatorλ BLWGE for λ is shown as: where Thus, the BLWGEλ BLWGE for the shape parameter λ iŝ Similarly, the BLWGEθ BLWGE for ϑ, is given by Eq. (34): where and Thus, the weighted Bayes estimator for the shape parameter ϑ iŝ

Ordinary Least Squares and the Multilayer Perceptron Neural Network (OLSMLP)
As previous studies have shown [14,33], manual calculations yield the smallest standard deviation (STD) in the parameter λ, and are consequently more accurate than computational methods. Moreover, methods of manual estimation are more accurate for small sample sizes [14]. However, these computational methods, especially the OLS, are sensitive to outliers and specific residual behavior [34]. To solve these problems, many studies have proposed different methods, such as the iterative weighting method based on the modified OLS [34], the WLS, and many other methods based on the WLS [35]. A major challenge in these methods is determining the weights.

Proposed Method to Estimate Parameters of WD
We now describe the proposed method, which is divided into two main parts: the linearization of the CDF, and the application of a feedforward network with backpropagation to estimate the values of λ and ϑ of the WD.
The OLS method takes the CDF defined in Eq. (2) and linearizes it as described in Eq. (10). It then determines the coefficients α 0 and β via linear regression by using the slope and the intercept. The principle of the method used by the OLS to compute α 0 and β can be violated even with a few outliers.
Therefore, instead of using the slope and the intercept, we propose applying Algorithm 1 as described below. •

Application of Proposed Model to Estimate Parameters of WD
The steps used to evaluate the parameters of the WD from the input csv file are described by Algorithm 1.
Input: Three comma separated value file (CSV) files containing the matrices X i and Y i , and their corresponding parameters (shape and scale) SCi. Output: The predicted shapeλ OLSMLP and scaleθ OLSMLP for the test set. 1: Normalize the inputs matrices X i , Y i , and SCi separately to unit norm using RobustScaler followed by MinMaxScaler norm. 2: Split the normalized X i , Y i , and SCi into random training and test subsets. 3: Create the neural network model (define the input layer, hidden layer, and output layer). 4: Compile the model and fit it to the data. 5: Predictλ OLSMLP andθ OLSMLP for the test set. 6: Evaluate the performance of the proposed model. Steps 2, 3, 4, and 5 are explained in more detail in the following subsections:

• Data Normalization
Normalization is an essential preprocessing tool for a neural network [36,37]. Before training a neural network model, the input data are scaled using the RobustScaler norm in a preliminary phase, where each sample with at least one non-zero component is rescaled using the median and quartile range as described by Eq. (38). The RobustScaler norm is used to remove the influence of outliers. Following this, the MinMaxScaler, defined by Eq. (39), is applied to the output of the RobustScaler. The MinMaxScaler scales all the data features to the range [0, 1]: where X is a feature vector, X i is an element of feature X , X sm i is the rescaled element obtained by using MinMaxScaler, and X sr i is the rescaled element obtained by using RobustScaler. •

Structure of the Proposed Neural Network
To estimate the parameters of the WD, we propose using a multilayer perceptron (MLP), which is a feedforward network with backpropagation [38]. According to the structure of the MLP, the proposed network, as shown in Fig. 1, consists of an input layer (with n neurons), a hidden layer (with k neurons), and an output layer (with m neurons that yield the Weibull parameters as the output of the network).  [39]. In our architecture, we use the rule whereby "the number of hidden neurons k should be 2/3 times the size of the input layer, plus the size of the output layer" [38][39][40].
The hyperbolic tangent activation function (tanh) is proposed here in the input layer, and the sigmoid function in the output layer. They are used frequently in feedforward nets, and are suitable for shallow networks as well as applications of prediction and mapping [38,41].
The objective of our neural network is a model that performs well on the data used in both the training and the test datasets. For this reason, we add a well-known regularization layer as described in the next section.
• Regularization Regularization is a technique that can prevent overfitting [37,38]. A number of regularization techniques have been develop in the literature, such as L1 and L2 regularizations, bagging, and dropout. In the proposed structure, we use dropout, a well-known technique that randomly "drops out" or omits hidden neurons of the neural network to make them unavailable during part of the training [38,42]. This reduces the co-adaption between neurons, which results in less overfitting [38].
• Optimization Algorithm The optimization of deep networks is an active area of research [43]. The most popular gradientbased optimization algorithms are Adagrad, Momentum, RMSProp, Adam, AdaDelta, AdaMax, Nadam, and AMSGrad [38,43,44]. We chose Nadam due to its superiority in supervised machine learning over the other techniques, especially for a deep network [43]. Moreover, it combines the strengths of the Nesterov acceleration gradient (NAG) and the adaptive estimation (Adam) algorithms as described in [44]: wherê To evaluate the proposed methods with respect to other methods, we used two statistical tools, the mean squared error (MSE) and the mean absolute percentage error (MAPE) [5], in addition to the computation time.

Dataset Description
We generated 250,000 random data points from the WD for different parameters and different values of ϑ ranging from 1 to 299, and those of λ ranging from 0.5 to 100. For each shape/scale pair, we generated 10,000 samples of different sizes n = 10, 20, 30, 40, and 50.
We used the same dataset for the neural network in the training phase, but applied one sample to each shape/scale pair. This was unlike in the other methods (MLE, OLS, WLS, BLGE, and BLWGE), which used 10,000 samples to estimate the parameters of the WD. This dataset was divided into two subsets. The first subset was used to fit the model, and is referred to as the training dataset; it was characterized by known inputs and outputs. The second subset is referred to as the test dataset, and was used to evaluate the fitted machine learning model and make predictions on the new subset, for which we did not have the expected output. We chose the train-test procedure for our experiments because we guessed that we had a sufficiently large dataset available.

Parameter Selection for OLSMLP
In all experiments, we trained the model with Google Collaboratory (GPU) for 25 epochs. We used the Nadam optimizer with learning rate of α nad = 0.001; terms representing the momentum decay, scaling decay, and smoothing were kept at their default values: β 1 = 0.9, β 2 = 0.999, and ε = 10 −7 . A dropout with a ratio of 0.6 was applied to the hidden layer. As described in Section 3, the hidden and output layers used the tanh and sigmoid activation functions, respectively. The error function or loss function was the mean squared error, and was used to estimate the loss of the model.

Parameter Selection of BLGE and BLWGE
In all experiments, the parameters of the BLWGE and BLGE were empirically determined. The values of the weights q and z of the BLWGE were −3 and 6, respectively. For the BLGE, the parameter q = 1.5. Fig. 2 shows the evolution of the average MSE as a function of the sample size n. The MSE decreased quasi-linearly from n = 10 to n = 40 for all methods. Fig. 2 shows that the BLWGE, WLS, BLGE, and MLE had the lower MSE values for the different sample sizes compared with the OLS. We can deduce also that the WLS, GE, and MLE gave similar results with a slightly better start for the MLE at n = 10.

Effect of Sample Size on Estimation of WD Parameters Using Proposed Method
To illustrate how the sample size affects the calculation of the MSE, Fig. 3 shows the evolution of the latter as a function of the sample size n from 10 to 50.  1. The MLE and WLS behaved similarly as shown in Tab. 1: Their MSE values decreased gradually when their shape values increased at a fixed scale. Conversely, when the scale value increased with a fixed shape, the MSE increased. 2. The behavior of the OLS and GE was the opposite of that of the MLE and WLS. As depicted in Tab. 1, the MSE increased when the shape increased (at a fixed scale), and decreases when the scale increased (with a fixed shape). 3. The BLWGE and the OLSMLP behaved similarly in terms of scale estimation, as shown in Tab. 1. 4. All methods had the same global variation function, as shown in Fig. 4 and Tab. 2. 5. The MLE was slightly superior globally in terms of scale estimation to the other methods, but had the worst estimation of shape, as shown in Tab. 2. 6. The proposed MLP neural network acceptably estimated the scale, better than some methods. By contrast, it outperformed all other methods in terms of shape estimations most of the time.

Conclusion
This study proposed a method to estimate the parameters of the WD. This method is based on the OLS graphical method and the MLP neural network. The MLP solves the problems caused by the presence of outliers and eases the difficulty of determining the weights in the WLS method. It yielded acceptable results in simulations, especially in terms of shape estimation. It is also faster than the MLE, BLGE, and BLWGE.
We also proposed a second method (BLWGE), in which we introduced weight to the GE loss function. The results of simulations showed that BLWGE yields good results, especially in terms of shape estimation, compared with the other methods.