Predicting the Electronic and Structural Properties of Two-Dimensional Materials Using Machine Learning

: Machine-learning (ML) models are novel and robust tools to estab-lish structure-to-property connection on the basis of computationally expensive ab-initio datasets. For advanced technologies, predicting novel materials and identifying their specification are critical issues. Two-dimensional (2D) materials are currently a rapidly growing class which show highly desirable properties for diverse advanced technologies. In this work, our objective is to search for desirable properties, such as the electronic band gap and total energy, among others, for which the accelerated prediction is highly appeal-ing, prior to conducting accurate theoretical and experimental investigations. Among all available componential methods, gradient-boosted (GB) ML algorithms are known to provide highly accurate predictions and have shown great potential to predict material properties based on the importance of features. In this work, we applied the GB algorithm to a dataset of electronic and structural properties of 2D materials in order to predict the specification with high accuracy. Conducted statistical analysis of the selected features identifies design guidelines for the discovery of novel 2D materials with desired properties.


Introduction
Traditional methods for discovering new materials with desirable properties, such as the empirical trial and error methods using the existing structures and density functional theory (DFT) based method, have exceedingly high computational cost and typically require long research [1][2][3][4]. Machine learning methods can substantially reduce the computational costs and shorten the development cycle [5][6][7]. As an example, machine learning interatomic potentials (MLIPs) have been found as accurate alternatives for DFT computations with substantially lower computational cost [8,9]. MLIPs have been also proposed as alternatives for density functional perturbation theory simulations and for assessing the phononic and complex thermal properties [10][11][12].
In the paradigm of materials informatics and accelerated materials discovery, the choice of feature set (i.e., attributes that capture aspect of lattice, chemistry and/or electronic structure) is critical [29,30]. Ideally, the feature sets should provide a simple physical basis for extracting major structural and chemical trends and enable rapid predictions of new material properties [31][32][33][34]. Machine learning has offered also novel ways to study the mechanical responses of various systems [35][36][37]. Here we show that machine learning methods can show superior ability in classification and prediction of properties of a rapidly extending class of materials, find the most important feature in classification and identifying the materials, thereby providing a feature selection approach for connecting among the components. In fact, in comparison with computational simulations, machine learning can identify patterns in large high-dimensional data sets effectively, extract useful information quickly [38] and discover hidden connections [38]. Typically, data processing consists of two sections: Data selection and feature engineering. The validity of data in both quality and quantity is a vital point in data selection. After data selection, one should extract suitable characteristics for the predicted target, which is called feature engineering. Feature engineering is the process of extracting features from raw data to enable the application of algorithms. In recent decades, the amount of scientific data collected approaches in the search for novel functional materials. In recent years, numerous extensive scientific data have been offered available in the form of online databases, for crystal structures [39,40] and materials properties [41][42][43]. In contrast to pure data-mining approaches, which focus on extracting knowledge from existing data [44], machine-learning approaches try to predict target properties directly [44], where a highly non-linear map between a crystal structure and its functional property of interest is approximated. In this context, machine learning offers an attractive framework for screening large collections of materials. In other words, developing an accurate machine learning model can substantially accelerate the identification of specific properties of materials, as the prediction of the property of interest for a given crystal structure bypasses computationally expensive ab-initio calculations.
In this work, our objective is to develop a machine learning model to predict important features of 2D-materials, including electronic band gap (Eg), work function (Ew), total energy (Et), unit-cell area (Aucell), fermi level (EFermi) and density of states (DOS), for which commonly employed DFT based approaches are computationally expensive. Since gradient-boosted (GB) algorithm provides highly accurate predictions both classification and regression, we used this tool in our exploration. We then consider a set of 2D materials and apply the developed methodology to identify the basic characteristic and relationship between different specifications.

Methods
Gradient boosted (GB) is an ensemble ML algorithm that uses multiple decision trees (DTs) as base learners [45]. Every DT is not independent, because a new added DT increases emphasis on the misclassified samples attained by previous DTs. It can be noticed that the residual of former DTs is taken as the input for the next DT. Then, the added DT is used to reduce residual, so that the loss decreases following the negative gradient direction in each iteration. Finally, the prediction result is determined based on the sum of results from all DTs. GB is a flexible non-parametric statistical learning technique for classification and regression. The GB method typically consists of three steps: feature selection [46,47], DT generation [48] and DT pruning [49]. Among them, the objective of feature selection is to retain features that exhibit sufficient classification performance and the objective of pruning is to make the tree simpler and thus more generalizable. Often features do not contribute equally to predict the target response. When interpreting a model, the first question usually is: what are those important features and how do they contributing in predicting the target response? A gradient boosted regression tree (GBRT) model derives this information from the fitted regression trees which intrinsically perform feature selection by choosing appropriate split points. Feature selection is the process of choosing a subset of features, from a set of original features, based on a specific selection criterion. The main advantages of feature selection are: (1) Reduction in the computational time of the algorithm, (2) Improvement in predictive performance, (3) Identification of relevant features, (4) Improved data quality and (5) Saving resources in subsequent phases of data collection. We can access this information via the instance attribute.
First, after collecting data we use the feature important algorithm for determining the valuable features with an impact on each other. In this regard, there are a few challenging aspects, such as heterogeneous features (different scales and distributions) and non-linear feature interactions. Furthermore, the data may contain some extreme values, such as the flat bands that appear in DOS with sharp peaks, therefore such a data set requires a robust regression technique [49]. In Fig. 1, the workflow for the present study is presented, which includes; collecting data properties, feature selection, classification and regression. For the classification process, the outcome of interest was binary with two categories: (i) Metal and (ii) Nonmetal. Dividing the materials to these two subsections is crucial for some computational and experimental approaches. For a binary outcome, a significant difference in the values of a continuous input variable for each outcome value shows that the variable can distinguish between the two outcome values. For the Gradient Boosting classification algorithm, the attribute feature importance ranks the importance of each feature for the constructed model. For each feature, the value of feature importance is between 0 and 1, where 0 means completely useless and 1 means perfect prediction. The sum of feature importance is always equal to 1. For analyzing the binary classification, we use The receiver operating characteristic (ROC) which is a measure of a classifier's predictive quality that compares and visualizes the tradeoff between the model's sensitivity and specificity. When plotted, a ROC curve displays true positive rate on the y-axis and the false positive rate on the x-axis on both a global average and a per-class basis. The ideal point is therefore the top-left corner of the plot, false positives are zero and true positives are one. This leads to another metric, area under the curve (AUC), which is a computation of the relationship between false positives and true positives. The higher the AUC, the better the model generally is. However, it is also important to inspect the steepness of the curve, as this describes the maximization of the true positive rate while minimizing the false positive rate. Besides, based on the optimal hyperparameters configuration, the regression models were constructed using different training set for each variable. Then, the test set was adopted to evaluate the performance of each model. The precision, recall and mean absolute error was utilized to analyze prediction results of each level and the accuracy values were used to assess the overall prediction performance. Based on the sensitivity analysis of indicators, the relative importance of each indicator was obtained. In GB for gaining the high accuracy results, some parameters must be chosen so that they generally improve the performance of the algorithm by reducing overfitting [50]. These hyperparameters must be selected for the accurate optimization of results. Most of ML algorithms contain hyperparameters that need to be tuned. These hyperparameters should be adjusted based on the dataset rather than specifying manually. Generally, the hyperparameters search methods include grid search. The central challenge in machine learning is that we must perform well on new, previously unseen inputs, not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization. We require that the model learn from known examples and generalize from those known examples to new examples in the future. We use k-fold cross-validation only to estimate the ability of the model to generalize to new data. Tuning the hyperparameters is a very essential issue in GB predictions, because of the trivial overfitting problem. The general tuning strategy for exploring GB hyperparameters builds onto the basic and stochastic GB tuning strategies. There are different hyperparameters that we can tune and the parameters are different from base learner to base learner. We introduced some hyperparameters as usual in machine learning, because it is tedious to optimize them, especially knowing their correlation. In tree-based learners, which are the most common ones in GB applications, the following are the most commonly tuned hyperparameters. Learning-rate governs how quickly the model fits the residual error using additional base learners. If it is a smaller learning rate, it needs more boosting rounds and more time to achieve the same reduction in residual error, as one with a larger learning rate. In other words, it determines the contribution of each tree on the outcome and controls how quickly the algorithm proceeds down the gradient descent (learns) its values ranges from 0 to 1, with typical values between 0.001-0.3. Smaller values make the model robust to the specific characteristics of each tree, thus allowing to generalize better. Smaller values also facilitate to stop before overfitting, however, they increase the risk of not reaching the optimum with a fixed number of trees and are therefore more computationally demanding, noting that this hyperparameter is also called shrinkage. The most important regularization technique for GB is shrinkage, the idea is basically to conduct slow learning by shrinking the predictions of each tree by smaller learning-rate, helpful to achieve re-enforce concepts. A lower learning-rate requires a higher number of estimators to get to the same level of training error. Generally, the smaller this value, the more accurate the model can be developed, but it also requires more trees in the sequence. The two other hyperparameters, the subsample and the number of estimators are regularization hyperparameters. Subsampling is the fraction of the total training set that can be used in any boosting round. A low value may lead to under-fitting issues. In contrary, a very high value can lead to over-fitting issue. Besides regularization, increasing the number of training examples and reducing the number of input and features, are more advantages for fixing overfitting. The depth of the individual trees is one aspect of model complexity. The depth of the trees controls the degree of feature interactions that our model can fit. For example, if one wants to capture the interaction between a feature latitude and a feature longitude our trees need a depth of at least two to capture this. Unfortunately, the degree of feature interactions is not known in advance but it is usually fine to assume that it is fairly low in practice, a depth of 4-6 usually gives the best results. In scikit-learn [51], we can constrain the depth of the trees using the maximum depth argument. Despite interpolating these parameters two more common overfitting and underfitting problems for the GB algorithm addressed the unfavorable results. We can address underfitting by increasing the capacity of the model. Capacity refers to the ability of a model to fit a variety of functions, more capacity means that a model can fit more types of functions for mapping inputs to outputs. Increasing the capacity of a model is easily achieved by changing the structure of the model, such as adding more layers and/or more nodes to layers. Because an underfit model is so easily addressed, it is more common to have an over-fit model. Both train-test splits and k-fold cross-validation are examples of resampling methods, although both of these methods cannot overcome the overfitting, they can reduce these complications, but the regularization is one of the best methods for the resistance of data for overfitting. An over-fit model is easily diagnosed by monitoring the performance of the model during training by evaluating it on both a training dataset and on a holdout validation dataset. To get highly accurate predictions, we employed the iteration process for optimizing hyperparameters: first, we choose loss based on our problem at hand (i.e., target metric), then pick number of estimators as large as (computationally) possible (e.g., 3000). Next, we tune the maximum depth, learning-rate, subsample and maximum features via iteration process in a loop that we can achieve to less difference between the training and test deviance (find the Apendix). All other parameters were left to their default values.

Results and Discussions
The considered dataset of 2D-materials considered in this work includes more than 1,500 compounds, taken from the recent work of Haastrup et al. [50]. In the aforementioned work, authors calculated structural, thermodynamic, elastic, electronic, magnetic and optical properties of an extensive ensemble of 2D-materials by state-of-the-art DFT within the many-body perturbation theory. In this paper, we model the structure-property relationship via machine learning for 2D materials. Among these data, we select thermodynamically and dynamically stable compounds. Worthy to note that dynamically stable lattices do not show imaginary frequencies in the phonon dispersions. With considered stabilities criteria, 981 compounds were selected. In these types of datasets, the materials vary from a range of nonmetal up to metal that these sorts kinds of materials are applicable in many research areas. The kind of crystal structure distributed over more than 50 different structures, also the unit cell and internal coordinates of the atoms are relaxed in both a spin-paired (NM), ferromagnetic (FM) and anti-ferromagnetic (AFM) configuration. The projected density of states (PDOS) is another label that is a useful tool for identifying which atomic orbitals comprise a band. The electron density is determined self-consistently on a uniform k-point grid of density 12.0/Å. From this density, the PBE band structure is computed non-self consistently at 400 K-points distributed along the band path. Also, The fermi level is calculated using the PBE xc-functional including spin-orbit coupling (SOC) for all metallic compounds in the database. For metallic compounds, the work function is obtained as the difference between the Fermi energy and the asymptotic value of the electrostatic potential in the vacuum region. Electronic polarizability, Plasma frequencies and speed of sound are other labels that we consider for the prediction of our targets. Fig. 2a shows histograms for some of the features and the response. We can see that they are quite different. The band gap and density of states are left-skewed, the area of the unit cell is bi-modal and the total energy is right-skewed. As shown in Fig. 2a the total energy varies a range from 100 ev to 0 and the work function ranged from 2.5-7.5 ev that describes the non-equal distribution of datasets. For showing the relationship between these features, Correlation is a well-known similarity measure between two features. If two features are linearly dependent, then their correlation coefficient is ±1. If the features are uncorrelated, the correlation coefficient is 0. The association between the features is found out by using the correlation method. Because the GB algorithm is not capable in calculating the important score of two attributes in datasets, the pearson's correlation method is used for finding the association between the continuous features and the class feature. According to Fig. 2b, the correlation between the datasets is shown, that for each pair of features x and y, the square of the correlation coefficient R 2 were calculated and the intensity of color show the intensity of correlation. For a prediction of the classification based on machine learning, we discuss the gradient boosting model, considering 15 features for predicting targets. The total of 15 features includes the following: The heat of formation (HOF), band gap (Eg), density of states at the fermi energy (DOS), total energy (Et), fermi level (EFermi), speed of sound (x) (SoSx), speed of sound (y) (SoSy), work function (Ew), static electronic polarizability (x) (SePx), static electronic polarizability (y) (SePy), direct band gap (DirEg), stiffness tensor, 11-component (St11com), 2D plasma frequency (x) (2DPfx), 2D plasma frequency (y) (2DPfy), area of unit-cell (AUcell). Because among these features some attributes such as the density of states, two-dimension plasma frequency in x and y direction, the direct band gap and band gap have a direct impact on classification on materials in two such groups so we neglect these categories although with attention to these features the effect of other features decreases. With 10-fold cross-validation train the models using 90% materials (training set) and evaluate their performance using mean absolute error (MAE) for the remaining 10% materials (test set). we focus on loss functions such as MAE, which is less sensitive to outliers and also to be able to apply the method to a classification problem with a loss function such as deviance, or log loss. For the feature engineering process, we consider 10 important features that the importance of each feature shown in Fig. 3a. A novel candidate material is first classified as a metal or a nonmetal. If the material is classified as a nonmetal, Eg is predicted, here as classification as a metal implies that the material has no Eg (Eg = 0). The accuracy of the metal/nonmetal classifier is reported as the AUC of the ROC plot (Fig. 3b). Acquired ROC curve illustrates the model's ability to differentiate between metallic and nonmetallic input two-dimension materials. It plots the prediction rate for nonmetal (correctly vs. incorrectly predicted) throughout the full spectrum of possible prediction thresholds. An area of 1.0 represents a perfect test, whereas an area of 0.5 characterizes a random guess (the dashed line). The model shows excellent external predictive power with the area under the curve at 0.98, a nonmetal-prediction success rate (sensitivity) of 0.98, a metal-prediction success rate (specificity) of 0.98 and an overall classification accuracy (R-squared) of 0.93 (Fig. 3a). For the complete set of 980 2D lattice, this corresponds to about 20 misclassified materials, including 10 misclassified metals and 10 misclassified nonmetals. The model exhibits a positive bias toward predicting metals, where bias refers to whether the ML model tends to over-or under-estimate the predicted property. This low false-metal rate is acceptable as the model is unlikely to misclassify a novel and potentially interesting nonmetal as a metal. Overall, the metal classification model is robust enough to handle the full complexity of the periodic table. The direction has about 50% direct impact on classification in contrast to the other features. For the regression process, the visualization result of feature importance in this paper is shown in Figs. 4a-4f, which displays the relative importance of the 14 most influential predictor variables for each target. Since these measures are relative, a value of 1 was assigned to the most important predictor and the others were scaled accordingly. There is a clear differential effect between the models. For instance, for band gap target, static electron polarization x and density of states have more percent in feature values relate to other variables but in contrast, these two variables are far less important in the total energy model. The work function is the most relevant predictor in the fermi level model, while it is far less important in the area of the unit cell model. Among the other influential predictors in the area of the unit cell, stiffness tensor and total energy are more important features for identifying the target. For the density of state, we find the presence of an area of the unit cell, two-dimension plasma frequency in x and y-direction. For the work function, the fermi level is the most influential predictor, followed by the stiffness tensor and the area of the unit cell. Figs. 4a-4f show how the feature in the prediction of the targets changes under the process of extracting key-features. Remarkably, we found that accuracy-score increases almost monotonically in all cases, which indicates that a feature with a higher feature score has stronger predictive power. Through this process, we selected the most important features for each target property. For Eg, the score abruptly increases when the number of features is larger than five (Fig. 4a). This means that at least five features are required to predict band gap within MAE of 0.16 eV. In the case of DOS, seven features comprise a minimal set to predict the density of states within an MAE of 0.53 eV (Fig. 4d). Tab. 1 showed the minimum selected features for the best result in both MSE and accuracy scores. Key-features extracted from the feature importance score of the GBRT method can have various implications for a fast search for new material. For instance, selected key-features can be utilized as an optimal set of features in another type of ML algorithms such as classification. With a smaller number of features, as seen in Fig. 4 the prediction accuracy is lower than that with all features. However, it is notable that regression with the top key-features selected for targets provides a good approximation to that with the full features. The results of the variables regression model are plotted in Fig. 5. Fig. 5a presents the prediction of Eg, that the results show that the MAE of test sets for Eg is 0.16 eV. Even though the number of the current dataset was limited, it is noteworthy that the accuracy of the predictive model of Eg, Et, Aucell, Efermi. Besides, a statistical profile of these predictions, along with that of the six other regression models, is provided in Tab. 1, which includes metrics such as the MAE and coefficient of determination (R2). Similar to the classification model, the GBRT model exhibits a positive predictive bias. The biggest errors in DOS, come from materials with narrow band gaps, that is, the scatter in the lower-left corner in Fig. 5c. similar results for the other five targets is shown the power of the prediction but in the density of the state and work function the less accuracy among the variables exists. The outcome of the interpolating of parameters was shown in Tab. 2. The result can be slightly improved by decreasing the learning rate and adjusting subsample related to it. With attention to these hyperparameters, the performance measure reported by 10-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.  In comparison, we plot the residual gradient. boosting regression with train test split that the accuracy of the train and test data, showing not any different and the R 2 -squared are similar to the 10-fold validation.

Concluding Remarks
In this work, we show that machine learning can be effectively employed to rapidly characterize the specification and predict the intrinsic properties of 2D materials. In particular, it is confirmed that the use of feature engineering in the machine learning models enhances the accuracy, computational cost and prediction power. However, we note that an explicit relationship between features and properties is still difficult to obtain directly from the feature importance score and additional steps are required. The models developed in this work are capable of predicting electrical and structural properties for diverse 2D lattices with accuracies compatible with DFT methods. This is in fact a promising finding for the use in high-throughput material screening and could be valuable in influencing engineering decisions and bring a new dynamic to materials discovery.