TY - EJOU AU - Karim, S. M. Rezaul AU - Hossain, Md. Shouquat AU - Akter, Khadiza AU - Sarker, Debasish AU - Kabir, Md. Moniul AU - Assad, Mamdouh TI - Impact of Dataset Size on Machine Learning Regression Accuracy in Solar Power Prediction T2 - Energy Engineering PY - 2025 VL - 122 IS - 8 SN - 1546-0118 AB - Knowing the influence of the size of datasets for regression models can help in improving the accuracy of a solar power forecast and make the most out of renewable energy systems. This research explores the influence of dataset size on the accuracy and reliability of regression models for solar power prediction, contributing to better forecasting methods. The study analyzes data from two solar panels, aSiMicro03036 and aSiTandem72-46, over 7, 14, 17, 21, 28, and 38 days, with each dataset comprising five independent and one dependent parameter, and split 80–20 for training and testing. Results indicate that Random Forest consistently outperforms other models, achieving the highest correlation coefficient of 0.9822 and the lowest Mean Absolute Error (MAE) of 2.0544 on the aSiTandem72-46 panel with 21 days of data. For the aSiMicro03036 panel, the best MAE of 4.2978 was reached using the k-Nearest Neighbor (k-NN) algorithm, which was set up as instance-based k-Nearest neighbors (IBk) in Weka after being trained on 17 days of data. Regression performance for most models (excluding IBk) stabilizes at 14 days or more. Compared to the 7-day dataset, increasing to 21 days reduced the MAE by around 20% and improved correlation coefficients by around 2.1%, highlighting the value of moderate dataset expansion. These findings suggest that datasets spanning 17 to 21 days, with 80% used for training, can significantly enhance the predictive accuracy of solar power generation models. KW - Correlation coefficients; dataset size; machine learning; mean absolute error; regression; solar power prediction DO - 10.32604/ee.2025.066867