Open Access
ARTICLE
Impact of Dataset Size on Machine Learning Regression Accuracy in Solar Power Prediction
1 Department of Electrical and Electronic Engineering, International University of Business Agriculture and Technology, Dhaka, 1230, Bangladesh
2 Department of Electrical and Electronic Engineering, Dhaka University of Engineering and Technology, Gazipur, 1707, Bangladesh
3 Department of Electrical Engineering, INTI International University, Persiaran Perdana BBN, Putra Nilai, Nilai, 71800, Malaysia
4 Department of Mechanical Engineering, International University of Business Agriculture and Technology, Dhaka, 1230, Bangladesh
5 Department of Sustainable and Renewable Energy Engineering, University of Sharjah, Sharjah, 27272, United Arab Emirates
* Corresponding Author: Mamdouh Assad. Email:
(This article belongs to the Special Issue: Advances in Renewable Energy Systems: Integrating Machine Learning for Enhanced Efficiency and Optimization)
Energy Engineering 2025, 122(8), 3041-3054. https://doi.org/10.32604/ee.2025.066867
Received 19 April 2025; Accepted 19 June 2025; Issue published 24 July 2025
Abstract
Knowing the influence of the size of datasets for regression models can help in improving the accuracy of a solar power forecast and make the most out of renewable energy systems. This research explores the influence of dataset size on the accuracy and reliability of regression models for solar power prediction, contributing to better forecasting methods. The study analyzes data from two solar panels, aSiMicro03036 and aSiTandem72-46, over 7, 14, 17, 21, 28, and 38 days, with each dataset comprising five independent and one dependent parameter, and split 80–20 for training and testing. Results indicate that Random Forest consistently outperforms other models, achieving the highest correlation coefficient of 0.9822 and the lowest Mean Absolute Error (MAE) of 2.0544 on the aSiTandem72-46 panel with 21 days of data. For the aSiMicro03036 panel, the best MAE of 4.2978 was reached using the k-Nearest Neighbor (k-NN) algorithm, which was set up as instance-based k-Nearest neighbors (IBk) in Weka after being trained on 17 days of data. Regression performance for most models (excluding IBk) stabilizes at 14 days or more. Compared to the 7-day dataset, increasing to 21 days reduced the MAE by around 20% and improved correlation coefficients by around 2.1%, highlighting the value of moderate dataset expansion. These findings suggest that datasets spanning 17 to 21 days, with 80% used for training, can significantly enhance the predictive accuracy of solar power generation models.Graphic Abstract
Keywords
Cite This Article
Copyright © 2025 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools