Recently, machine learning-based technologies have been developed to automate the classification of wafer map defect patterns during semiconductor manufacturing. The existing approaches used in the wafer map pattern classification include directly learning the image through a convolution neural network and applying the ensemble method after extracting image features. This study aims to classify wafer map defects more effectively and derive robust algorithms even for datasets with insufficient defect patterns. First, the number of defects during the actual process may be limited. Therefore, insufficient data are generated using convolutional auto-encoder (CAE), and the expanded data are verified using the evaluation technique of structural similarity index measure (SSIM). After extracting handcrafted features, a boosted stacking ensemble model that integrates the four base-level classifiers with the extreme gradient boosting classifier as a meta-level classifier is designed and built for training the model based on the expanded data for final prediction. Since the proposed algorithm shows better performance than those of existing ensemble classifiers even for insufficient defect patterns, the results of this study will contribute to improving the product quality and yield of the actual semiconductor manufacturing process.
A wafer is a basic unit created to evaluate electrical properties during semiconductor manufacturing [
In the actual semiconductor manufacturing process, the occurrence of defects is very rare. In general, there are very few cases with detectable defect patterns when collecting manufacturing process data, and most of the data are in a normal state. Since it is necessary to classify data with a small defect pattern by learning the imbalanced dataset, the classification accuracy is very poor and time consuming [
In order to improve wafer map pattern classification accuracy, this study aims to suggest a Boosted Stacking Ensemble Machine Learning (BSEML) algorithm that applies data augmentation to insufficient defect patterns. With a given training dataset, data augmentation is first performed through CAE-based model learning. Then, features are extracted through handcrafted feature extraction techniques based on features such as density, Radon, and geometry. The extracted feature vectors are combined to construct a BSEML model that performs final prediction. The contributions of this study are listed as follows.
The effectiveness of the proposed technique was verified using wafer datasets collected from semiconductor manufacturers. The computational efficiency was increased by extracting the key defect pattern information hidden from the original image using various feature extraction techniques. Data augmentation was performed using a CAE-based model to solve the problem of lack of defect patterns and imbalance, and the accuracy of the proposed model was improved using augmented data.
The rest of this study is structured as follows. Section 2 briefly describes the techniques used in related studies. Section 3 introduces the proposed algorithm. Section 4 describes the data structure and experimental methods. Sections 5 and 6 contain the results of the study and conclusions
In the past few years, there have been many studies that have applied machine learning to wafer map pattern classification. These are largely divided into two types based on the method of extracting the features of the wafer map and classifying the defects.
Ref. No | Method | Ensemble method | Input feature | Input shape | Classifier | Data processing |
---|---|---|---|---|---|---|
[ |
MFE | - | Wafer map | 30 | SVM | EOL test |
[ |
MFE | - | Features | 53 | JLNDA-FD | Denoising |
[ |
MFE | Bagging | Features | 4 | DT | - |
[ |
MFE | Voting | Features | 66 | LR, RF, GBM, ANN | - |
[ |
MFE | Stacking | Spatial | 10 | AB, ET, XGB | - |
[ |
CNN | - | Wafer map | (286, 400) | CNN | Simulated generation |
[ |
CNN | - | Wafer map | (100, 100) | DCNN | Noise reduction |
[ |
CNN | - | Wafer map | (256, 256) | CNN | Contrast, binarization |
[ |
CNN | Stacking | Wafer map | (300, 300) | ECNN | - |
[ |
CNN | - | Wafer map | (64, 64) | CNN | CAE |
[ |
CNN | - | Wafer map | (224, 224) | CBAM | C-Mean filtering |
[ |
CNN | - | Wafer map | (416, 416) | YOLO | - |
Proposed | MFE | Boosted Stacking | Features | (32, 32), 59 | DT, SVM, RF, KNN, XGB | CAE |
The first method is to extract hand-made features and build a ready-made classifier. The most commonly used features for feature extraction techniques in this approach include density, geometry, and Radon properties [
The NB model is based on Bayes’ theorem and learns very quickly compared to existing learning algorithms. In particular, it allows easy and quick prediction in multi-class classification that is probabilistically independent [
The second method is a CNN-based raw image classification method. As shown in
The ensemble system is constructed based on principles such as reliability estimation, data fusion, and unbalanced data processing. The performance of an ensemble system depends on the accuracy of individual classifiers and the number of base-level classifiers included [
In recent years, increasing interest in ensemble techniques has led to the emergence of various ensemble-based algorithms such as Voting, Bagging, Boosting, AdaBoost, XGBoost, and Mixture of Experts (MoE) [
There are three types of voting methods for deriving the result: the majority, hard, and soft voting methods. Through experimental verification, the soft voting ensemble method has been verified to have the best performance for deriving the final result [
This section describes the technique proposed in this study in detail.
The feature extraction technique makes a one-dimensional array by reducing the dimension of a two-dimensional array of the wafer map that exists as an image. With the dimension reduction, not only the amount of computation is reduced, but also important feature information is vectorized and converted into a one-dimensional vector [
First, the density-based feature extraction technique is a method of calculating how densely the defects are in the corresponding section of the wafer map [
Second, the Radon-based feature extraction technique is a method to generate an image of a two-dimensional representation of a wafer map by Radon transformation based on projection [
Here,
Third, a geometry-based feature extraction technique is used to evaluate the geometric properties of each wafer map [
A decision tree (DT), also called a classification and regression tree used in both classification and regression analysis, is a classification model that divides the independent variable space while sequentially applying various rules. In predicting target variables or solving classification problems, the model enables checking which explanatory variable is the most important influencing factor and determines the detailed criteria for the prediction and classification of each explanatory variable [
A random forest (RF) is a bagging ensemble algorithm that trains several DT models and synthesizes the results to make the prediction. The bagging ensemble algorithm is a method of training individual DT models with a sampled dataset by allowing duplicates from the original dataset. In addition, DT is based on the principle of uncertainty called entropy, and the concept of entropy is expressed by the following expression [
KNN is an algorithm that is used to determine the classification of new data. The KNN for classification is expressed as follows [
For the input data
SVM is an algorithm that performs classification using support vectors and hyperplanes. The data are classified by maximizing the margin between the separated hyperplane and the support vector while minimizing the error [
The proposed BSEML model is an ensemble technique combining base-level classifiers to improve prediction performance [
The base-level classifier output is then provided to the meta-level classifier to make final predictions [
XGB is the most popular algorithm in tree-based ensemble learning, which is based on the principle of boosting. A strong prediction model is built by weighing the learning error of the weak learner and reflecting it sequentially on the next learning model. Although the model is based on a gradient boosting machine (GBM), it works by solving the problems of slow execution time and lack of regularization, which are the weaknesses of GBM [
In this experiment, the meta-level classifier increased the accuracy of final predictions by applying weights to predictions of weak leaner models among basic classifiers and performing parallel learning.
|
1: |
2: |
4: learn |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: Determine weight |
13: Initialized distribution |
14: Update weights |
15: |
16: |
17: |
18: Learn |
19: |
Once the BSEML model is trained, it can be utilized to classify wafer map patterns. Given wafer map
The WM-811K dataset, obtained in an actual industrial process, was used in this study; the dataset is publicly available in [
As feature extraction was not possible for array elements with fewer than 100 defective elements, four abnormal wafer maps were removed. These four abnormal wafer maps were found to belong to the None class. Therefore, the number of datasets was reduced to 172,496.
Class Index | Defect pattern | Wafer |
---|---|---|
1 | Center | 4294 |
2 | Donut | 555 |
3 | Edge-local | 5189 |
4 | Edge-ring | 9680 |
5 | Local | 3593 |
6 | Near-full | 149 |
7 | Random | 866 |
8 | Scratch | 1193 |
9 | None | 147472 |
Total | 172946 |
In the dataset acquired from the actual process, there is a difference in the amount of data for each defect class, and in severe cases, the data is biased toward only the majority class. Machine learning algorithms proceed with learning by assuming that each class has an equal ratio. As for a dataset with a class imbalance, machine learning does not perform precise learning and is biased toward the class which occupies a large proportion of the dataset [
The WM-811K used in this study has an imbalanced dataset. The None class accounts for more than 90% of the total defects, and there are insufficient defect patterns in the Donut and Near-full classes.
Therefore, in order to expand the number of defect images in the dataset and improve the generalization ability of the model, a data augmentation method based on CAE was used [
CAE is a variant of convolutional neural networks that is used as a tool for unsupervised learning of convolution filters [
However, unlike normal AE that completely ignores the 2D image structure, CAE is a feature extractor that can learn even from two-dimensional images [
SSIM was used to compare the difference between the original wafer image data and the augmented wafer image data. SSIM is a method designed to evaluate visual similarity rather than numerical error. SSIM specializes in deriving the structural information of the image and compares the degree of distortion of the structural information [
In
In
In
By comparing and analyzing raw image data and augmented image data with the SSIM scale, augmented image data with an SSIM value of 90% or more were used as feature extraction model input value [
In this study, a dataset of four cases was constructed from the training dataset for performance evaluation. Since the defect class is in a very unbalanced state in the original data, three augmentations were performed to solve this problem.
Case index | Defect pattern type | Data augmentation |
---|---|---|
Case 1 (Original) | Center | 90 |
Donut | 12 | |
Edge-Loc | 285 | |
Edge-ring | 31 | |
Loc | 297 | |
Near-full | 23 | |
Random | 74 | |
Scratch | 65 | |
None | 13,489 | |
Case 2 (30% augmentation) | Center | 630 |
Donut | 80 | |
Edge-Loc | 888 | |
Edge-ring | 527 | |
Loc | 891 | |
Near-full | 96 | |
Random | 592 | |
Scratch | 576 | |
None | 13,489 | |
Case 3 (40% augmentation) | Center | 900 |
Donut | 252 | |
Edge-Loc | 1,184 | |
Edge-ring | 806 | |
Loc | 1,188 | |
Near-full | 268 | |
Random | 888 | |
Scratch | 864 | |
None | 13,489 | |
Case 4 (50% augmentation) | Center | 1,170 |
Donut | 302 | |
Edge-Loc | 1,480 | |
Edge-ring | 1,054 | |
Loc | 1,485 | |
Near-full | 324 | |
Random | 1,110 | |
Scratch | 1,080 | |
None | 13,489 | |
This experiment was performed using Python 3.6 in the Ubuntu 12.04 environment, and handcrafted feature extraction was obtained through the scikit-image library [
Macro-average
The confusion matrix is a table that supports the visualization of the performance of a trained classification algorithm in a classification problem. Each row of the matrix denotes an instance of the predicted class, and each column presents an instance of the actual class. The confusion matrix used in this experiment was normalized for effective analysis [
The proposed model was compared with two basic classifiers and four ensemble-based models.
Metric | Case | SVM | KNN | Voting | Stacking | Bagging | Boosting | BSEML |
---|---|---|---|---|---|---|---|---|
1 | 0.457 | 0.274 | 0.471 | 0.517 | 0.432 | 0.398 | ||
2 | 0.535 | 0.647 | 0.668 | 0.821 | 0.807 | 0.851 | ||
3 | 0.545 | 0.761 | 0.765 | 0.867 | 0.833 | 0.872 | ||
4 | 0.526 | 0.756 | 0.753 | 0.891 | 0.843 | 0.901 | ||
1 | 0.788 | 0.836 | 0.879 | 0.922 | 0.898 | 0.907 | ||
2 | 0.839 | 0.877 | 0.885 | 0.926 | 0.912 | 0.923 | ||
3 | 0.806 | 0.892 | 0.895 | 0.930 | 0.957 | 0.927 | ||
4 | 0.799 | 0.873 | 0.895 | 0.934 | 0.961 | 0.933 |
Case index | Defect | SVM | KNN | Voting | Stacking | Bagging | Boosting | BSEML |
---|---|---|---|---|---|---|---|---|
Case 1 | Center | 0.558 | 0.176 | 0.378 | 0.390 | 0.400 | 0.475 | |
Donut | 0.323 | 0.333 | 0.452 | 0.365 | 0.331 | 0.436 | ||
Edge-Loc | 0.424 | 0.208 | 0.466 | 0.543 | 0.454 | 0.534 | ||
Edge-Ring | 0.400 | 0.316 | 0.400 | 0.545 | 0.222 | 0.308 | ||
Loc | 0.277 | 0.148 | 0.292 | 0.323 | 0.276 | 0.336 | ||
Near-Full | 0.167 | 0.556 | 0.933 | 0.933 | 0.667 | 0.222 | ||
Random | 0.650 | 0.080 | 0.545 | 0.515 | 0.595 | 0.727 | ||
Scratch | 0.267 | 0.132 | 0.346 | 0.296 | 0.320 | 0.261 | ||
None | 0.969 | 0.978 | 0.980 | 0.981 | 0.979 | 0.933 | ||
Case 2 | Center | 0.403 | 0.580 | 0.603 | 0.861 | 0.868 | 0.869 | |
Donut | 1.000 | 0.821 | 0.951 | 1.000 | 1.000 | 1.000 | ||
Edge-Loc | 0.276 | 0.353 | 0.416 | 0.605 | 0.754 | 0.627 | ||
Edge-Ring | 0.703 | 0.790 | 0.889 | 0.934 | 0.955 | 0.958 | ||
Loc | 0.192 | 0.262 | 0.246 | 0.512 | 0.494 | 0.565 | ||
Near-Full | 0.371 | 0.636 | 0.389 | 0.778 | 0.982 | 0.830 | ||
Random | 0.385 | 0.726 | 0.807 | 0.845 | 0.926 | 0.926 | ||
Scratch | 0.514 | 0.684 | 0.741 | 0.863 | 0.908 | 0.909 | ||
None | 0.970 | 0.971 | 0.969 | 0.967 | 0.975 | 0.975 | ||
Case 3 | Center | 0.435 | 0.756 | 0.694 | 0.871 | 0.883 | 0.891 | |
Donut | 0.994 | 0.871 | 0.887 | 1.000 | 1.000 | 0.994 | ||
Edge-Loc | 0.273 | 0.561 | 0.524 | 0.680 | 0.789 | 0.700 | ||
Edge-Ring | 0.714 | 0.891 | 0.905 | 0.969 | 0.970 | 0.971 | ||
Loc | 0.285 | 0.441 | 0.464 | 0.658 | 0.667 | 0.639 | ||
Near-Full | 0.382 | 0.673 | 0.762 | 0.824 | 0.859 | 0.878 | ||
Random | 0.371 | 0.857 | 0.856 | 0.918 | 0.935 | 0.898 | ||
Scratch | 0.386 | 0.825 | 0.797 | 0.909 | 0.905 | 0.906 | ||
None | 0.971 | 0.967 | 0.974 | 0.976 | 0.969 | 0.975 | ||
Case 4 | Center | 0.405 | 0.718 | 0.665 | 0.891 | 0.923 | 0.891 | |
Donut | 1.000 | 0.886 | 0.940 | 0.995 | 0.995 | 0.995 | ||
Edge-Loc | 0.297 | 0.557 | 0.567 | 0.758 | 0.785 | 0.759 | ||
Edge-Ring | 0.699 | 0.865 | 0.912 | 0.971 | 0.976 | 0.972 | ||
Loc | 0.234 | 0.489 | 0.483 | 0.693 | 0.741 | 0.693 | ||
Near-Full | 0.350 | 0.630 | 0.588 | 0.861 | 0.915 | 0.885 | ||
Random | 0.424 | 0.803 | 0.943 | 0.953 | 0.921 | 0.916 | ||
Scratch | 0.354 | 0.796 | 0.912 | 0.918 | 0.917 | 0.971 | ||
None | 0.971 | 0.966 | 0.967 | 0.976 | 0.976 | 0.947 |
The proposed model presents good performance for all defect classes. Such results indicate that the proposed model increases the
In this study, an algorithm that combines the reinforcement of insufficient defect patterns with an excellent hybrid model was proposed. The proposed method performs data augmentation using CAE on an image-type wafer map and features were subsequently extracted by applying density-based, geometry-based, and Radon-based feature extraction methods. This feature extraction technique improved the efficiency of the wafer defect identification system by providing detailed information about the wafer map and reducing the amount of computation required for learning. Then, four machine learning classifiers were stacked, and an ensemble model was built by using the XGB Classifier as a meta-level classifier. The proposed method demonstrated superior classification performance compared to those of the base-level classifier and ensemble models and showed robustness against insufficient defects. The effectiveness of the proposed method was verified experimentally using real data sets.
The improved classification performance demonstrated in this study is expected to have a significant effect on the stable automation of wafer map classification, leading to an improvement in product quality and yield in the actual semiconductor manufacturing process. Based on the proposed model, it will be possible to develop a model that guarantees robust performance while maintaining higher performance in various manufacturing domains, and it will also be possible to develop a model optimized for any domain by applying actual datasets from various manufacturing fields.