Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs across Two Processors

Simona-Vasilica Oprea; Adela Bâra

doi:10.32604/cmc.2026.078985

Open Access icon Open Access

ARTICLE

Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs across Two Processors

Simona-Vasilica Oprea, Adela Bâra^*

Economic Informatics and Cybernetics Department, Bucharest University of Economic Studies, Calea Dorobanţi 15-17, district 1, Bucharest, Romania

* Corresponding Author: Adela Bâra. Email: email

Computers, Materials & Continua 2026, 87(3), 91 https://doi.org/10.32604/cmc.2026.078985

Received 12 January 2026; Accepted 20 February 2026; Issue published 09 April 2026

Abstract

Quantization has emerged as an important technique for enabling efficient deployment of large language models (LLMs) by reducing their memory and computational requirements. This research conducts an evaluation of INT8 quantization on several state-of-the-art LLMs, GPT-2, LLaMA-2-7B-Chat and Qwen1.5-1.8B-Chat, across two hardware configurations: NVIDIA RTX4070 Laptop GPU and RTX4080 Laptop GPU and two tasks: text and code generation. By comparing quantized INT8 models with their FP16 counterparts and a human-written reference, we quantify the trade-offs between performance and efficiency using standard natural language generation metrics (BLEU, ROUGE-1, ROUGE-L) and semantic analysis via GPT-4o and Gemini 2.5 Flash (Google). The results reveal that INT8 post-training quantization (PTQ), hereafter referred to as INT8, substantially reduces inference time and memory footprint, with minimal impact on topical relevance but a notable decline in lexical precision, fluency and structural coherence. The extent of quality degradation varies by model size and architecture, with smaller models demonstrating greater resilience to quantization. Furthermore, we identify several limitations in quantized outputs, including reduced expressiveness, while highlighting their suitability for resource-constrained or real-time applications, such as robots monitoring safety standards in manufacturing environments. On average, INT8 quantization results in a 3.4 times speedup over FP16 inference across all tested models and GPUs (excluding configurations affected by CPU offloading), with the largest gains observed in large models like LLaMA-2-7B-Chat. The results also indicate that structured code generation exhibits slightly greater sensitivity to INT8 quantization compared to explanatory text generation.

Keywords

Large language models; quantization; integer precision inference; text generation; code generation; hardware configuration

Cite This Article

APA Style

Oprea, S., Bâra, A. (2026). Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs across Two Processors. Computers, Materials & Continua, 87(3), 91. https://doi.org/10.32604/cmc.2026.078985

Vancouver Style

Oprea S, Bâra A. Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs across Two Processors. Comput Mater Contin. 2026;87(3):91. https://doi.org/10.32604/cmc.2026.078985

IEEE Style

S. Oprea and A. Bâra, “Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs across Two Processors,” Comput. Mater. Contin., vol. 87, no. 3, pp. 91, 2026. https://doi.org/10.32604/cmc.2026.078985

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs across Two Processors

Abstract

Keywords

Cite This Article

664

249

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link