Automated hyperparameter optimization (HPO) is a crucial and time-consuming part in the automatic generation of efficient machine learning models. Previous studies could be classified into two major categories in terms of reducing training overhead: (1) sampling a promising hyperparameter configuration and (2) pruning non-promising configurations. These adaptive sampling and resource scheduling are combined to reduce cost, increasing the number of evaluations done on more promising configurations to find the best model in a given time. That is, these strategies are preferred to identify the best-performing models at an early stage within a certain deadline. Although these time and resource constraints are significant for designing HPO strategies, previous studies only focused on parallel exploration efficiency using resource awareness. In this study, we propose a novel diversification strategy for HPO, which exploits the dynamic hyperparameter space allocation for a sampler according to the remaining time budget. We provide a simple yet effective method to accelerate the maturity of the sampler that is independent of the sampling algorithm. Compared to previous resource awareness solutions, our solution achieves better performance via both time and resource awareness. We demonstrate the performance gains of our solution on several well-known HPO benchmarks. Furthermore, we implement them to our high-performance computing AI convergence platform. Considering the different types of users, both a fully automated HPO service based on graphic processing unit (GUI) interfaces and HPO job management via python application programming interface (API) on the Jupyterlab are served on the platform, publicly.
Automated machine learning (AutoML) is the automated process of the whole machine learning (ML) pipeline, which includes data collection, preprocessing, feature extraction, feature selection, model training, validation, and model outcome integration in various business processes without human intervention; it aims to reduce the demand for human experts and achieve optimal performance on a given task or dataset. Recently, several businesses in the artificial intelligence (AI) space have started employing AutoML owing to data heterogeneity, model sensitivity, and/or service efficiency; AutoML can be used in several applications such as speech recognition and image recognition. For example, Google AutoML provides several services according to the type of user dataset, such as natural language, vision, and tables. By contrast, Amazon SageMaker offers data preprocessing that enables the user to easily build, train, integrate, and deploy AutoML models using a given dataset at any scale. In addition, several studies have focused on developing and employing model compression, which is particularly relevant in large data applications, by reducing the size of neural networks without diminishing the model’s accuracy, to develop ML models that reflect realistic solutions (e.g., real-time diagnostics, inferences on a mobile device). The most crucial part of AutoML is automated model generation, owing to the critical selection process of a wide range of hyperparameters that greatly affects the model’s architecture, regularization, and optimization. Thus, data scientists often spend a significant amount of time tuning hyperparameters for obtaining the best-performing model. Over the past decade, a variety of automated hyperparameter optimization (HPO) techniques have been proposed, which aim to improve a model’s performance by choosing the right set of hyperparameters by finding the global optimum configuration,
Previous studies applied the Bayesian optimization on the HPO problem, including sequential model-based Bayesian optimization (SMBO) algorithms such as the Spearmint [
To the best of our knowledge, only the recent research by Liaw et al. [
In this study, we first design a sampling strategy under time awareness. We propose a novel diversification strategy that exploits the dynamic hyperparameter space allocation. According to the remaining time, the application-level scheduler dynamically restricts the hyperparameter space by using shrinking and gradual expanding. Such a simple yet effective method accelerates the maturity of a sampling model that is independent of the type of sampling algorithm. It can combine with ASHA for efficient parallel evaluation as pruners, to have the solution achieve a better performance via both time and resource awareness at a certain time compared to existing solutions.
The rest of this paper is organized as follows. Section 2 explains the related works with respect to the overall HPO schemes that motivated our proposed scheme. Section 3 shows the system architecture and diversification strategies of the proposed scheme, and the performance evaluation of the said scheme is compared to the schemes of the previous studies via well-known HPO benchmarks in Section 4. Section 5 introduces the systemic implementation of the proposed scheme as a practical web service on the HPC AI convergence platform, including challenging issues and design principles. Section 6 concludes the study.
In conventional ML, hyperparameters are a set of values that affect how the ML algorithms, including supported vector machine, random forest, and deep learning, fit a model, and they are set prior to the learning process. Moreover, recent HPO studies have conducted in-depth analyses of various model configurations using hyperparameters, which includes the simultaneous handling of multiple algorithms, using different data sampling, and finding an excellent network architecture of a deep learning model to automatically build the best-performing model that is selected as the best configuration among possible candidates. Strictly, these are not a set of hyperparameters but many researchers exploit such configurations for HPO problems. Therefore, in this study, we also consider them as hyperparameters for convenience. We also cite previous studies that have discussed major challenges of HPO such as sampling configuration, pruning configuration, and parallelization.
The hyperparameter configuration space, also known as the search space, organizes all the hyperparameters, and is where the configurations are sampled. In addition, the goal of a sampling algorithm is to find the best-performing configuration out of all the possible candidates. Their strategies could be classified into the following: a non-adaptive solution and an adaptive solution. The grid search is one of the most simple and intuitive non-adaptive solution in which a set of parameters is selected at regular intervals through a manually specified subset in the search space to build a model. Since the grid search defines the search space as a grid of all possible hyperparameter values and evaluates all possible position in the grid, it is difficult to find the optimal value probabilistically. Random Search [
Although the adaptive solution samples a configuration that is expected to perform better than its original performance based on the sampling history to date and the corresponding evaluation score, several Bayesian optimization (BO)-based algorithms have been proposed and improved in the past decade, such as Spearmint, SMAC, and TPE. BO-based algorithms consist of a surrogate model, which estimates the evaluation score instead of evaluating the sampled configuration, and an acquisition function, which suggests a new sample for discovering promising configurations using the results of the surrogate model. The surrogate model usually employs the Gaussian process (GP), random forest [
In addition, most adaptive samplers suffer from low accuracy, which suggests that the sampling results are insufficient; this results in a lower performance than the performance of those from a random draw in worst cases. We referred this as a sampling maturity problem and describe it in detail in the proposed scheme.
If we build all models using an entire set of hyperparameter candidates, a large computing time is required. For example, to optimize nine types of hyperparameters for the LeNet-1 model using the Modified National Institute of Standards and Technology (MNIST) dataset, 20,000 models need to be generated in Cho et al. [
The number of parameters employed in ML models increases as the tasks, which are to be performed by ML, become increasingly sophisticated and complex. Increasing the number of parameters inevitably requires a larger amount of data to mature a model. Therefore, to understand an ML model with a complex structure such as deep learning, a given dataset should be iteratively learned. In other words, given a dataset, we might consider it as the cost of training the ML model. For example, if the cost of training a dataset is
In the past years, several studies have been proposed to realize a successful parallelization of HPO. Since a sampler suggests almost similar combinations within the same sampling history whether it is parallelized or not, simple parallelization cannot affect the HPO performance. Therefore, in parallel BO algorithms, various methods have been proposed to diversify the sampling results. For example, Contal et al. [
Population-based training (PBT) [
In this section, we first look at the factors that may affect the maturity of the sampler for HPO problems and then discuss the strategies to effectively control them.
We refer to a study as a set of trials in which each individual trial evaluates the objective function, which contains the ML model training and validation processes via the testing dataset, using a specific hyperparameter configuration. Each trial exploits different configurations sampled by its own sampling algorithm to optimize the objective function. We assume that each study has a static budget, including the number of computing resources and the time limit to finish the optimization process. Although each trial runs independently if parallelized, all trials could be finished simultaneously by their scheduler.
A sampler, which selects a configuration from the hyperparameter space, completely relies on the previous evaluations of the objective function. Thereby, it is theoretically identical to the problem of designing a recommended model that predicts a promising new configuration based on identifying the correlation among configurations and an evaluation score of the objective function using the finished evaluations dataset. Unlike general ML modeling, where a large number of datasets are prepared in advance and models are designed based on them, samplers in HPO problems should predict values in real-time with almost no data. Thus, intuitive descriptors, which can affect the maturity of a sampler, are defined as
Alternatively, we can alleviate this problem by temporarily limiting the search space the sampler has to search by decreasing
We now introduce the GES running under the recognition of the time budget for the early maturing of a HPO sampler. GES divides the training steps of a model into multiple steps according to the given budget and partially restricts a given search space per each step. As shown in Algorithm 1, GES first operates by exploiting both the hyperparameter importance among well-known parameters and their initial default parameter values, providing a guided search that utilizes well-known criteria, while conventional samplers commonly set the search direction through random observations. In particular, if the type of task or the nature of the data is similar, we can expect a faster search rather than searching from the scratch.
|
|
|
|
|
|
|
|
|
The budget is used as an important criterion for gradual expanding. GES divides the entire search space into several steps based on the budget, referring to them as the search dimension
Since the sampling operates independent of both pruning and parallelization, the sampler performance would be evaluated in a single processor non-pruning environment. The features of the proposed scheme are analyzed using a simple test function, and then the performance of the HPO framework is compared on several well-known regression datasets.
The Beale function is multimodal with sharp peaks at the corners of the input domain. The global minimum
As shown in
The 6-dimensional Hartmann function has six local minima, and the function is usually evaluated on the hypercube
We use UCI regression datasets for the HPO benchmark. Test dataset has been generated randomly by selected 20% of the entire dataset before model training, and a 5-fold cross-validation has been applied. In
Hyperparameter Name | Type | Range | Default | Rank |
---|---|---|---|---|
eta |
Continuous <float> |
[0.0, 1.0] |
0.3 |
1 |
Overall, the proposed method using GES showed the best performance in all simulations. Both TPE and BO differ according to the type of dataset used, but they performed well in most problems. Since the sampling seed sequence is the same, TPE initially operates in the same way as RS, but after the sampler learns the relationship between parameters to some extent, it can be inferred that the result is superior to RS. CMA showed lower performance compared with other samplers owing to the difficulty in maturing within a given search range.
Dataset | Algorithm | avg. Trialtime (sec) | avg. Studytime (sec) | xTimes comparedto RS |
---|---|---|---|---|
Boston House Prices | GESTPE (ours) |
1.58 |
79.06 |
1.378312 |
Power Plant | GESTPE (ours) |
50.93 |
2546.82 |
1.810518 |
Concrete | GESTPE (ours) |
2.64 |
131.9 |
1.39385 |
Parkinsons | GESTPE (ours) |
17.52 |
876.02 |
1.696628 |
ENB2021: y1 | GESTPE (ours) |
1.58 |
79.05 |
1.233037 |
ENB2021: y2 | GESTPE (ours) |
1.69 |
84.79 |
1.320305 |
Naval: gt_c_decay | GESTPE (ours) |
10.33 |
516.39 |
4.508775 |
Naval: gt_t_decay | GESTPE (ours) |
9.18 |
459 |
4.137372 |
Protein | GESTPE (ours) |
53.69 |
2684.75 |
2.122433 |
Yacht Hydrodynamics | GESTPE (ours) |
1.01 |
50.39 |
1.319801 |
Slice Localization | GESTPE (ours) |
963.45 |
48172.77 |
1.831621 |
Total | GESTPE (ours) |
N/A | N/A |
We describe the systemic implementation of the proposed scheme to provide a practical web service on our HPC AI convergence platform. We first introduce several challenging issues and then suggest design principles to solve them.
Before designing and developing the HPO scheme on the online platform, it is important to consider the users . If the goal is to automatically design high-performance AI models that anyone can easily build, the HPO should be simple and not complex. For example, in the case of novice users who want to automatically generate AI models using an AutoML service, if they first need to learn how to select valid hyperparameters according to a given task and data type and how to determine the search range of these parameters, the utilization of the service will significantly drop owing to the complexity of the process. Meanwhile, if the goal is to automatically maximize a pre-defined objective function, HPO should be ready for the model developer without any concern of infrastructures. If possible, HPO schemes such as parallel training and hyperparameter search should be provided as simple APIs like python-APIs or rest-APIs to facilitate an easier means of developing a best-performing AI model under the same development circumstance. For example, users should be able to do everything in a Jupyter notebook if possible.
As mentioned above, the quality and quantity of configurations consumed have a close influence on the maturity of the sampler. In terms of quantity, a certain number of evaluations is required to mature the sampler, and the cost depends on some parameters regarding training complexity such as epochs, learning rate, the number of layers, and the length of layers. In terms of quality, the larger the hyperparameter space, the more configurations the sampler requires to mature.
To date, various sampling algorithms have been proposed to select a better (promising) configuration, but none of them perform well. That is, the performance of the sampler may vary depending on a given task and dataset type. In addition, in most cases, the ML modeling is closely related to data generation and pre-processing except in the case in which a given dataset starts from a defined competition (i.e., in Kaggle competition). Thus, the HPO process might be repeated several times. In other words, when data or new features are added, the whole HPO process has to be done from scratch.
The entire search range changes depending on how the hyperparameter space (search space) is set, resulting in a rapid change in the amount of computation required to obtain the desired level of the model. For example, the best model using the MNIST dataset can achieve an accuracy of up to 99%; therefore, it is not difficult to develop a model that achieves more than 90% accuracy. However, in the case of HPO, there is a high possibility that it will take a few hours to find a model with 99% performance efficiency or none at all, even after wasting a few days of GPU time, depending on how the search range is specified. In addition, if the search range is too wide, it takes a significant amount of time for the sampler to mature, and in some cases, even if a large number of resources are consumed, the desired level of results may not be obtained at all.
Effectively performing HPO optimization using parallel resources is perhaps the most crucial metric for achieving a practical AutoML. In particular, the parallelization efficiency in HPO goes beyond reducing the overall exploration time by simultaneously evaluating multiple configurations, making it possible to determine whether to create a model that satisfies the requirements during a single project because the resources given to us are always finite. As shown in the case of task dependency, it is virtually impossible to wait for months for an HPO task that might be repeated several times from scratch.
One of the most difficult aspects of designing an ML model is the estimation of the number of resources and effort required to develop the desired level of the model. Since HPO is the best-effort solution to find the semi-optimal hyperparameter configuration, the result of the optimization is expected to be the given number of configurations for the evaluation or a given time for using resources.
We introduce, in detail, the suggested principles to help deal with the above challenges in the high-performance computing (HPC) AI convergence platform.
First, note that to use the HPO services in the platform, we have provided the GUI interfaces for novice users and the Python APIs for expert users, which are considered as different types of HPO users. In the platform, users can have personal storage managed by Jupyterlab so that they can easily upload their own dataset via the web interface. The system supports a CSV-formatted structured dataset or a CSV file containing relative file paths in the GUI-based HPO interface. After a dataset is selected for the hyperparameter to optimize, the task types, algorithm to be modelled, and resource budgets, such as seconds and the number of nodes for the training models, are chosen. These selections are managed by a metadata file. Subsequently, Python scripts and a job file are automatically generated. We exploit the SLURM workload manager to schedule jobs in which submitted jobs can be managed by web interfaces. Also, users can submit their HPO jobs on Jupyterlab directly via Python APIs. With the well-known HPO framework OPTUNA [
Second, diversification is achieved through the ensemble of sampling algorithms in our HPO scheme, to handle the issue with the sampler maturity. Since the internal characteristics of the sampling algorithm are very different, the ensemble rule is set in consideration of a given time constraint. For example, BO-based algorithms, such as GP, generally propose configurations that are expected to have higher performance compared with the configurations generated by the other algorithms. However, the computational amount increases as the consumed configurations increase, which is a drawback of these algorithms. Since the common TPE algorithm infers each parameter independently, it is very fast compared to other tightly coupled hyperparameter-based algorithms such as GP or CMA-ES. However, in general, there are a lot of configurations required to achieve the desired level of results in the TPE algorithm. In the case of CMA-ES, an effective sampling can be performed with sufficient configurations. However, it is difficult to expect high performance in early-stage sampling. Therefore, in the platform, we propose a progressive scheme that sequentially applies BO, TPE, and CMA-ES in consideration of parallel resources. Through this method, the task dependency of the sampler could be removed and diversification is performed.
Third, we suggest several methods in handling both the sampler maturity and search space sensitivity. In terms of configuration quality, the proposed scheme consisted of the GES strategy, which limits the hyperparameter search range according to a given time to accelerate the maturity of the sampler and dynamically allocate the hyperparameter space. Such a heuristic method can help the sampler to quickly grasp the relationship between reduced dimensions of hyperparameters and their evaluation score. Subsequently, it can gradually expand the search range to find the global optimum effectively. While in terms of configuration quantity, we add some parameters into the hyperparameter space, which are strongly correlated with evaluation time. For example, epoch, learning rate, or even sampling rates of datasets can be added to increase the number of finished evaluations in a certain time period. As previously stated, the gain of the hyperparameter optimization efficiency can be obtained even by adjusting the sampling of a dataset. When considering a given time, it may be the better choice to conduct quick searches several times than trying to train the dataset fully with a specific configuration. Additionally, guided sampling can be used. If a user knows some good configurations in advance, the user can add them in the early stage of the HPO process. It helps to quickly estimate an approximation close to the desired evaluation score in the most common situation where time constraints exist.
Fourth, ASHA is used for pruners in the proposed scheme since it is more important to increase the usability of each resource in a massively parallel computing environment such as HPC. As mentioned above, nowadays, the complexity of the AI model to be optimized increases gradually, and the number of resources required to solve it increases rapidly; therefore, the utilization of each computing resource should be maximized through aggressive early stopping. This is also related to the sampler maturity issue. Synchronous pruning algorithms such as HyperBand and Success Halving (SH) determine the number of resources to allocate through comparison of median values between them even though their samplers select a set of non-promising configurations. In other words, since such synchronous promotion largely depends on the maturity of the sampler, we adopted the asynchronous method as pruners.
Finally, the proposed scheme determines all detailed strategies based on a certain deadline allowed for a HPO job. By using several diversification methods considering both time and resource awareness, we believe that the proposed scheme has been implemented into our platform as a practical HPO service.
This section discusses a case study in which a ML model is developed using several HPO services in the HPC AI convergence platform.
On the top left of
Currently, the supported algorithms are: multiple linear regression (MLR), support vector machine, random forest, boost tree, XGBoost, LightGBM, and deep learning (PyTorch-based). The job submission is completed by setting the budget such as the time deadline (seconds) and the number of nodes. The use-case diagram is shown in
As shown in
Additionally, the
AutoML can save the significant amount of time and resources required to build high-performance models by automating time-consuming processes in model optimization such as hyperparameter tuning. In addition, since the performance of most AI models greatly depends and is highly sensitive to the selection of the hyperparameters of the model, it is worth noting that AutoML showcases its efficiency on the selection of hyperparameters. Moreover, various methods have been proposed for efficient hyperparameter optimization over the past decade, mainly focusing on finding the best combination within a given input algorithm and hyperparameter search category. In other words, an AutoML problem should prepare an appropriate hyperparameter search space in advance. In addition, recent studies on large-scale parallelization of hyperparameter optimization have focused on assessing the exploration efficiency in improving the productivity within a given resource constraint. However, studies that consider the time constraint are few. From the user's perspective, time constraint plays a crucial role in setting the direction for solving this problem. Thus, these perspectives must be reflected for practical AutoML service. Since most of the HPO sampling strategies proposed till date are based on the sampling history datasets, their samplers can understand configuration datasets as well as recommend a new promising configuration set only after sufficient sampling has been made. In other words, the amount of metadata composed of evaluated configurations tend to determine the performance of the sampler. Unfortunately, the amount of time and resources we can use is always limited; therefore, we need a method to adjust the amount of metadata in a given time according to the type and size of the problem.
In this study, we propose a novel diversification strategy for HPO, which exploits dynamic hyperparameter space allocation for a sampler according to the remaining time budget. In the proposed scheme, the search range of a specific model selects mainly used hyperparameters for a given task and uses their well-known initial values as default values. The optimal value is searched by expanding the dimension of the search space in process of time. Substantially, the goal of AutoML is to address entry barriers for novice users or to automate repeat procedures for intermediate and/or high-level users; therefore, its interface should be as simple as possible. Also, the details for optimizing a model, such as programming codes and job scripts, should be automatically generated and easily managed. The performance evaluations show that our solution outperforms existing studies. We developed an easy-to-use AutoML service on our HPC AI convergence platform, considering several challenging issues for designing the practical hyperparameter optimization service.
In the future, we plan to develop a wider range of efficient AutoML services including automated feature engineering (AFE) as well as existing HPO problems and building them into our platform. In particular, for an ML model that uses large-scale structured data such as scientific data, not only the performance but also explainability and interpretability that provide the basis for model prediction are important. Therefore, we plan to expand the existing service to be an explainable one. Moreover, we further plan to study complex tasks from previous simple regression or classification tasks, which require simultaneous multiple model optimization (i.e., automated design of generative models [