Reliability Analysis Based on Optimization Random Forest Model and MCMC

: Based on the rapid simulation of Markov Chain on samples in failure region, a novel method of reliability analysis combining Monte Carlo Markov Chain (MCMC) with random forest algorithm was proposed. Firstly, a series of samples distributing around limit state function are generated by MCMC. Then, the samples were taken as training data to establish the random forest model. Afterwards, Monte Carlo simulation was used to evaluate the failure probability. Finally, examples demonstrate the proposed method possesses higher computational efficiency and accuracy.


Introduction
With increasing functionality and complexity of modern products, the structural reliability analysis becomes an emerging research focus in the field of reliability engineering. Failure probability can be expressed as where x denotes the random variable in reliability study, f (x) denotes the combined probability density function, g(x) is performance function, where the inequality g > 0 represents the safe state and g < 0 represents the failure state. When g is 0, this state is called limit state which divides the whole domain the safe region and the failure region. The integral denotes the total probability of joint probability density function in failure region. Because f (x) is usually unknown, it is difficult to calculate the failure probability directly from Eq. (1), especially for high-dimension problems or implicit problems. In recent decades, many methods about solving the integral function have been proposed, including sampling method based on Monte Carlo simulation and approximation method based on performance function, such as the first-order reliability methods (FORM) and the second-order reliability method (SORM) [1][2][3][4][5]. The first-order or second-order reliability method makes a first-order function or second-order function to replace the corresponding limit state function, and the analytical solution requires to solve an optimization function of the shortest distance between limit state and origin in standard normal space [6]. However, the existing approximation methods are unstable if the performance function is nonlinear, or has multiple design points and failure modes [7,8].
Moreover, there are several approximate limit state functions combined surrogate model with Monte Carlo simulation to improve computational efficiency, such as polynomial response surface method (RSM) [9,10], polynomial chaos expansion (PCE) [11,12], Kriging model [13][14][15][16], support vector machine (SVM) [17][18][19], fault tree model [20] and artificial neural network algorithm (ANN) [21,22], etc. By means of surrogate model, the solution of integral function could be obtained with high computational accuracy and little computation cost. It has been become a research focus. Echard et al. [23] adopted the active learning strategy to establish Kriging model adaptively. Compared with conventional sampling methods, smaller samples were used to construct the Kriging model. Ghosh et al. [18] proposed a SVM model to substitute the performance function, and combined with Monte Carlo simulation to calculate the structural safety. However, the inherent parameters of the SVM model is difficult to determine, such as kernel parameters and the penalty factor. Cheng et al. [24] demonstrated a neural network based method combined with the uniform sampling method for predicting failure probability of structures. Oparaji et al. [25] applied the neural network to the reliability and sensitivity analysis of complex nonlinear systems, where the robustness of artificial neural network models was improved by Bayesian theory and model averaging technique. Although ANN has many advantages, including excellent learning capability, fitting for complex nonlinear function with chosen accuracy and so on, ANN also have several disadvantages, such as effect of subjective factors, large generalization errors and low convergence rate. With increasing model complexity, the nonlinear effect of system functions is enhanced.
As a kind of integrated learning algorithm based on decision tree for classification, random forest algorithm has been widely used for addressing issues like data classification and investigation [26]. The advantages of random forest regression model are that it has a better generalization ability and insensitive for abnormal data in samples. Besides, there are other advantages of random forest algorithm, including high accuracy, a fast convergence, less overfitting and so on. In terms of regression problems, the option of hyper-parameters for random forest algorithm has a significant effect on the accuracy of predication, and a further optimization is necessary.
Meanwhile, it requests good samples to construct the surrogate model. The conventional methods usually used the random sampling or design of experiments technique generate samples. For small failure probability event, it will take a lot of samples to establish a high precision surrogate model, because there should be at least one point in the failure domain. Therefore, it is requested an efficient algorithm which can generate samples in both failure region and the safe region.
Recently, the coupling algorithm of Markov Chain Monte Carlo (MCMC) is a powerful tool to simulate samples with random distribution [27,28]. MCMC simulation was adopted to generate samples in failure region. Regarding the generated samples as training data, this brings about an algorithm with higher accuracy [29].
In this study, a combined method called Markov Chain Random Forest (MCRF) algorithm, is adopted to adaptively generate samples. Based on the generated samples, the random forest model is established and improved to determine the best parameters of random forest.
Then, the reliability analysis is performed by Monte Carlo simulation with the random forest model. The work aims to use small number of calling performance function for high precision reliability analysis.

Monte Carlo Simulation
Here, the limit state of system can be expressed by g(x), where x is variables affecting structural performances. Meanwhile, the joint probability density of each variable is denoted as f x (x), thus the failure probability of the whole system can be represented as, where F denotes the failure domain, namely Eq.
(3) could be solved by transforming it into the mathematical expectation of samples, and the mathematical expectation can be obtained via the mean of samples. According to Law of Large Numbers, the unbiased estimatorP f and its variance var P f can be calculated by Eqs. (4) and (5), respectively.
where x i is the ith sample.

Adaptive Sampling Based on Markov Chain
In this section, samples in important region around the limit state are generated by Markov Chain Monet Carlo simulation. Then the generated samples are regarded as training samples which are used as the design of experiments to training random forest model. The random forest model is adopted to approximate the original limit state function. Here, the method for samples generated by Markov Chain is introduced as following.
Step (1): Select the initial state for Markov Chain can be determined from engineering or simple computation.
Step (2): Select appropriate stationary distribution function The stationary distribution function is determined as conditional distribution density function to generate the failure region samples as, As the length of Markov Chain increases, the stationary distribution function ensure that the generated state points tend to fit into the selected target distribution function, namely generating samples in failure region.
Step (3): Select proposal distribution function The proposal distribution function of Markov Chain directly affects the transforming method from current state to alternate state, which also has an influence on the distance between the candidate state and current state. Based on the basic principle of Markov Chain, proposal distribution adopting the current state x (j) as sampling center has a symmetric property. Here, the proposal distribution is a normal distribution, written as, where ε denotes the candidate state and x (j) represents the current state. The vector V is the standard variance vector of original joint probability density function, which affects the sampling region of Markov Chain. In order to enhance the convergence of Markov Chain, the gradual changing strategy is adopted. Setting the initial value of V to be V 0 and taking coefficient γ (γ < 1) make the standard variance gradually decreasing during the simulation. Those settings can enable Markov Chain simulation to seek its important zone of limit state more quickly, and enable more samples of simulation locating around the important domain and the limit state function.
Step (4): Determine the jth state x (j) of Markov Chain Based on previous state x (j−1) , the jth state x (j) of Markov Chain can be determined by combining its proposal distribution with the Metropolis-Hastings criterion. Firstly, candidate state points ε are generated by the proposal distribution f * ε | x (j−1) , then the ratio between stationary distribution function f (ε | F) of candidate state ε and its previous distribution function f x (j−1) | F can be calculated by Eq. (8), According to Metropolis-Hastings criterion, candidate state ε is regarded as the jth state takes the value ε of on the probability of r, or takes the value of x (j−1) on the probability of 1 − r, written as, where α is a random number subjected to uniform distribution in the interval [0, 1].
For reliability analysis, each of the candidate state ε, whether it is regarded as a Markov Chain state or not, is evaluated by calculating the performance function. Thus, all candidate states ε are used for training the random forest model, which makes most of valid information during Markov process. The proposed Markov Chain is adaptive to generate samples around the limit state function, so the random forest model has a better approximate accuracy for calculating failure probabilities.

Random Forest Algorithm
Random forest algorithm is a kind of machine learning regression algorithm. Random forest algorithm is a coupling model assembled with b-regressive trees θ, which can be labeled as {h (x, θ k ) , k = 1, 2, . . . , b}. By calculating the average of b-regressive trees, the prediction of random forest regressive model could be obtained. Once the samples are generated by the MCMC, the random forest model is constructed based on these samples. The establishing process of random forest algorithm for reliability analysis is described as follows.
Step (1) Step (2): Construct the regressive tree. From k variables among sub-branch points of each trees, select randomly m try (m try < k) variables as candidate branch variables, then estimate their optimal branch variables based on the branch-weight criterion.
Step (3): Each regressive tree branches grows unceasingly, recursively from top to bottom. The coefficient n tree -value of each tree, is regarded as the termination criterion of regressive growth.
Step (4): The produced b regressive trees construct regressive model of random forest. And the prediction of new model can be evaluated by the prediction accuracy of OOB, which is measured by the variance average of examination set of original data. Assuming the sample number of OOB to be m, then where y i denotes the real value of dependent variable among OOB,ŷ i denotes the corresponding predicted value obtained by random forest regressive model,σ 2 y denotes the variance of predictions.
For random forest regressive model, the effect of independent variable on dependent variable can be evaluated by variable importance measure (VIM). Here, VIM is an assessment approach based on the mean square error reduction of random permutation and reduction of model accuracy to evaluate the influence of independent variables. The calculation process of mean square error reduction is given as following.
(1) Establish regressive tree for each training sample set, and use the established regressive tree to predict the OOB data. Then, the mean square error of b-OOB data, labeled as MSE 1 , MSE 2 , . . . , MSE b , can be obtained.
(2) Due to the random selection of variables during constructing of regressive trees branch, variables X i of OOB data sets are replaced randomly to assemble a new examination set of OOB data. Using the established random forest regressive model to predict the new examination set, the mean square variances of OOB data can be obtained after replacing randomly. The mean square errors can be assembled as the following matrix, ⎡ (3) Substring the corresponding mean error of ith row for the above matrix from MSE 1 , MSE 2 , . . . , MSE b , respectively. Then divide these mean values by their standard deviation, the VIM can be obtained, which is expressed as Eq. (13), A large value means the independent variable is important for dependent variable. The key idea of the VIM is to apply random disturbance to each variable, and observe the change of accuracy on the established model. Based on the reduction of accuracy, the importance of variable can be evaluated. If the model's accuracy increases after a disturbance of variable decreases, it indicates that this specific variable possesses a higher importance.

Improved Random Forest Algorithm
The number of regressive trees n tree and number of predictors sampled for splitting at each node m try are important parameters for random forest algorithm. Partial observed values of independent variables are generated randomly by Bootstrap resampling technique during the construction process of random forest regressive tress. Partial explanatory variables selected randomly determine the branch number of regressive trees, which can create hundreds of regressive trees. The data is divided into n tree training samples, and n tree decision tree models are trained simultaneously. The splitting number m try at each node can also affect the accuracy of results. Meanwhile the parameter m try depends on n tree , making the regressive model sensitive. Therefore, the sequential quadratic programming (SQP) is adopted to search the best parameter combination of n tree and m try . The objective function is the mean square error (MSE) of the random forest model.

The Integral Process of the MCRF Method
The integral process of the Markov Chain, random forest modeling, and Monte Carlo simulation is shown in Fig. 1. Firstly, based on the engineering experience or numerical computation, an initial point is generated, which is located in failure region. Secondly, the adaptive sampling is performed by the Markov Chain simulation. During iteration, all the points at which the performance function is evaluated, are retained. Then these points obtained by the Markov Chain are used as the design of experiments, and the random forest model is constructed. The parameters of the random forest are optimized by SQP method. Finally, the failure probability is calculated by the Monte Carlo simulation and random forest.

Model Application
In this section, cases including nonlinear function with cross terms, exponential-nonlinear function, and an engineering problem, sealing ring of aircraft, are employed to verify the effectiveness of the proposed method.

Case I: Nonlinear Function with Cross Terms
Limit state function with cross terms, as a limit state nonlinear function with cross terms, is shown as Eq. (14), where x 1 , x 2 denote independent random variable subjected to the standard normal distribution, respectively. Using the generated samples to construct the random forest model. The random forest model is optimized by SQP. The parameter m try is set as 39 and the n tree is set as 500. The iteration of objective function is shown in Fig. 3. As can be seen, the RF model converged in about 37 iterations.
The prediction values and training values of random forest model are shown in Fig. 4. From  Fig. 4, these prediction values agree well with that of expectation, indicating the trained random forest model has a higher accuracy to replace their corresponding original performance function for reliability analysis. Based on these prediction values, it is possible to apply random forest model to calculate output values and to obtain their failure probability through Monte Carlo Simulation. Aiming to validate the effectiveness of proposed method, the failure probability is calculated by crude Monte Carlo Simulation. Then, the first order reliability method (FORM) method, SVM, ANN and subset simulation [30] are listed in Tab. 1, respectively.  The parameters used in the methods are explained as follows. For SVM, the RBF kernel function is applied, the parameters of the kernel are obtained by the grid searching method, which is 256 (as penalty factor) and 0.0625 (as radius of the kernel function), respectively. For ANN, the architecture is 25-10-5, namely, 25 neurons in the input layers, 10 neurons in the hidden layers and 5 neurons in the output layers. The learning rate is 0.1, and the max iteration number is 200. The activity function is sigmoid function. The standard subset simulation code is adopted here [30], the number of samples in each simulation level is 2000, the conditional probability is 0.1, and the maximum number of simulation levels is 10.
Here, the failure probability is obtained by 3 × 10 5 times of Monte Carlo simulations, which is considered as a benchmark solution. Compared with Monte Carlo simulation, results by the proposed method have a higher accuracy. Using the proposed model, less times of performance functions are conducted. Moreover, it needs not to calculate the design point for reliability analysis by the proposed RF method, which improves the stability of the algorithm.

Case II: Exponent Nonlinear Function
In this section, an exponent nonlinear function is validated [31], which is expressed by Eq. (15) as, where all basic random variables are independent with each other and obey standard normal distribution. By randomly selecting initial state points of Markov Chain, some samples are generated during the process of Markov Chain, and are shown in Fig. 5.  It shows, the failure probability reaches 3.59 × 10 −4 after 10 7 Monte Carlo Simulations, which can be regarded as the benchmark solution of this problem. The calculation obtained by the proposed model is very similar to that by FORM method. But compared with FORM method, the proposed method requires less callings of performance function, meaning that the proposed method has a higher computational efficiency than FORM method. Compared with SVM and ANN, the proposed method has much higher precision and can be used in probability analysis. The result by standard subset simulation method is also better than the FORM, SVM, and ANN, however it takes a large amount of calling performance function. The subset simulation combined with surrogate model will be effective to reduce the number of performance calling.

Case III: Sealing Ring
The reliability analysis of sealing is carried out as the last example [32]. The sealing equipment is the aircraft hydraulic system, including an O-ring, a piston rod, a seal groove etc. The simulation model is shown in Fig. 6, where the finite element model is performed in ANSYS. The diameter D, elastic modulus E, the amount of compression l and working oil pressure P are taken as random parameters, whose distributions are listed in Tab. 3.  The failure model of sealing presumes that the maximum contact pressure σ to be greater than the working oil pressure P between the ring and cylinder. Therefore, the performance function of the sealing can be written as, 1, 2, 3, 4) (16) where x i is the random vector for reliability analysis, x = (E, d, l, P); σ (x i ) is the contact pressure, and p (x i ) is the working oil pressure, where P = x 4 .
The results of failure probability calculated by literature and the proposed method are listed in Tab. 4. As can be seen, the literature uses 350 samples to establish the neural network and obtain the failure probability, while here only 200 samples are used to construct the random forest model and get the failure probability. The relative error between the proposed method and literature is less than 5%, indicating the proposed method is effective.

Conclusion
In this paper, a novel reliability analysis method combined with Markov Chain, random forest model and Monte Carlo simulation was proposed. Firstly, a series of samples, including rejected samples and accepted samples, are generated by Markov Chain. Then, based on the samples by Markov Chain, random forest model is established to substitute for the real performance function. Next, according to the distribution of random variables, the reliability analysis was performed by Monte Carlo simulation based on the random forest model. After validation by cases, the proposed random forest model has shown a higher accuracy and efficiency. Compared with Monte Carlo simulations, the proposed model could significantly reduce the calling times of performance functions, while holding a higher calculation accuracy. Compared with the first order reliability method, the proposed method does not require to calculate the gradient of limit state function, which reduces the explicit requirement of limit state function. Compared with SVM, ANN and subset simulation, the proposed method holds a higher precision in terms of failure probability.
The proposed method in this study makes a full use of the adaptability of Markov Chain and the excellent nonlinear fitting ability of random forest model, which reduces the dependence of conventional approximation method on experimental design. The high accuracy prediction of sampling simulation could be realized by means of random forest model, ensuring the calculation accuracy of reliability analysis and enhancing the calculation effectiveness. Application examples of the proposed model shows it is effectiveness for engineering. The proposed method is applicable in low dimension problems as demonstrated in this paper, and it will be a promising method for high dimension problems. In further study, it should be developed for high dimensional problems. Moreover, the parallel computing should be also adopted to further improve the computing efficiency.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.