Open Access
ARTICLE
A Composite Loss-Based Autoencoder for Accurate and Scalable Missing Data Imputation
Department of Computer Engineering, Faculty of Engineering, Eskisehir Technical University, Eskisehir, 26555, Türkiye
* Corresponding Author: Cahit Perkgoz. Email:
Computers, Materials & Continua 2026, 86(1), 1-21. https://doi.org/10.32604/cmc.2025.070381
Received 15 July 2025; Accepted 23 September 2025; Issue published 10 November 2025
Abstract
Missing data presents a crucial challenge in data analysis, especially in high-dimensional datasets, where missing data often leads to biased conclusions and degraded model performance. In this study, we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision. The proposed loss combines (i) a guided, masked mean squared error focusing on missing entries; (ii) a noise-aware regularization term to improve resilience against data corruption; and (iii) a variance penalty to encourage expressive yet stable reconstructions. We evaluate the proposed model across four missingness mechanisms, such as Missing Completely at Random, Missing at Random, Missing Not at Random, and Missing Not at Random with quantile censorship, under systematically varied feature counts, sample sizes, and missingness ratios ranging from 5% to 60%. Four publicly available real-world datasets (Stroke Prediction, Pima Indians Diabetes, Cardiovascular Disease, and Framingham Heart Study) were used, and the obtained results show that our proposed model consistently outperforms baseline methods, including traditional and deep learning-based techniques. An ablation study reveals the additive value of each component in the loss function. Additionally, we assessed the downstream utility of imputed data through classification tasks, where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios. The model demonstrates strong scalability and robustness, improving performance with larger datasets and higher feature counts. These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations, making it a promising solution for robust data recovery in clinical applications.Keywords
Supplementary Material
Supplementary Material FileCite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools