AI Model Compression Methods: A Distribution-Aware Residual Entropy Quantization

Nikita Sakovich¹, Dmitry Aksenov¹, Ekaterina Pleshakova^1,*, Sergey Gataullin^1,2
1 MIREA—Russian Technological University, Institute of Advanced Technologies and Industrial Programming, Russia 78 Vernadsky Avenue, Moscow, Russia
2 Social Modeling Lab, Central Economics and Mathematics Institute, Russian Academy of Sciences, Nakhimovsky Pr., 47, Moscow, Russia
* Corresponding Author: Ekaterina Pleshakova. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.079522

Received 22 January 2026; Accepted 02 April 2026; Published online 08 May 2026

Download PDF

Abstract

We introduce the DARE-Q (Distribution-Aware Residual Entropy Quantization) method—a post-training quantization method for neural network weights designed to reduce bit-width with minimal degradation of model quality. Unlike traditional approaches that solely optimize the mean squared error of weight approximation, DARE-Q additionally considers the entropy of the quantization residual, allowing for control over the statistical properties of the resulting error. The method is based on channel-wise symmetric uniform quantization with scaling based on a combined loss function that includes L2 distortion and entropy regularization. The DARE-Q method is implemented as a compact DAREQuantLinear module which can be easily integrated into standard transformer pipelines without changing the inference logic or using specific kernels. The experimental analysis was conducted on the language models facebook/opt-125m and facebook/opt-350m, which contain approximately 125 and 350 million parameters. The quality of the models was assessed using the standard perplexity metric (PPL) computed on the wikitext-2-raw-v1 dataset. DARE-Q is completely data-free and does not require model retraining or calibration data, which makes it the only viable option in privacy-sensitive or confidential environments where access to the original training data is restricted—precisely the setting where methods such as GPTQ and AWQ cannot be applied. The observed increase in PPL relative to data-dependent baselines reflects this fundamental trade-off rather than a shortcoming of the approach. By leveraging per-channel scale selection and a combined loss function, DARE-Q provides a flexible trade-off between approximation accuracy and quantization error structure, creating an attractive algorithmic basis for further improvement of model compression methods.

Keywords

Artificial intelligence; large language models; mathematical optimization methods; model compression; quantization methods; information theory; high-performance computing

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

216

View
61

Download
1

Like

Internet of Cultural Things: Current Research, Challenges and Opportunities
Xiaoting Liang, Fang Liu, Linqi...
The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition
Mohammad Amaz Uddin, Mohammad...
Impact of Portable Executable Header Features on Malware Detection Accuracy
Hasan H. Al-Khshali, Muhammad...
Real Objects Understanding Using 3D Haptic Virtual Reality for E-Learning Education
Samia Allaoua Chelloug, Hamid...
Optimized Evaluation of Mobile Base Station by Modern Topological Invariants
Khalid Hamid, Muhammad Waseem...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

AI Model Compression Methods: A Distribution-Aware Residual Entropy Quantization

Abstract

Keywords

216

61

1

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link