Low-Complexity Hardware Architecture for Batch Normalization of CNN Training Accelerator

Go-Eun Woo; Sang-Bo Park; Gi-Tae Park; Muhammad Junaid; Hyung-Won Kim

doi:10.32604/cmc.2025.063723

Open Access icon Open Access

ARTICLE

Low-Complexity Hardware Architecture for Batch Normalization of CNN Training Accelerator

Go-Eun Woo, Sang-Bo Park, Gi-Tae Park, Muhammad Junaid, Hyung-Won Kim^*

Department of Electronics, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju, 28664, Republic of Korea

* Corresponding Author: Hyung-Won Kim. Email: email

Computers, Materials & Continua 2025, 84(2), 3241-3257. https://doi.org/10.32604/cmc.2025.063723

Received 22 January 2025; Accepted 28 May 2025; Issue published 03 July 2025

Abstract

On-device Artificial Intelligence (AI) accelerators capable of not only inference but also training neural network models are in increasing demand in the industrial AI field, where frequent retraining is crucial due to frequent production changes. Batch normalization (BN) is fundamental to training convolutional neural networks (CNNs), but its implementation in compact accelerator chips remains challenging due to computational complexity, particularly in calculating statistical parameters and gradients across mini-batches. Existing accelerator architectures either compromise the training accuracy of CNNs through approximations or require substantial computational resources, limiting their practical deployment. We present a hardware-optimized BN accelerator that maintains training accuracy while significantly reducing computational overhead through three novel techniques: (1) resource-sharing for efficient resource utilization across forward and backward passes, (2) interleaved buffering for reduced dynamic random-access memory (DRAM) access latencies, and (3) zero-skipping for minimal gradient computation. Implemented on a VCU118 Field Programmable Gate Array (FPGA) on 100 MHz and validated using You Only Look Once version 2-tiny (YOLOv2-tiny) on the PASCAL Visual Object Classes (VOC) dataset, our normalization accelerator achieves a 72% reduction in processing time and 83% lower power consumption compared to a 2.4 GHz Intel Central Processing Unit (CPU) software normalization implementation, while maintaining accuracy (0.51% mean Average Precision (mAP) drop at floating-point 32 bits (FP32), 1.35% at brain floating-point 16 bits (bfloat16)). When integrated into a neural processing unit (NPU), the design demonstrates 63% and 97% performance improvements over AMD CPU and Reduced Instruction Set Computing-V (RISC-V) implementations, respectively. These results confirm that our proposed BN hardware design enables efficient, high-accuracy, and power-saving on-device training for modern CNNs. Our results demonstrate that efficient hardware implementation of standard batch normalization is achievable without sacrificing accuracy, enabling practical on-device CNN training with significantly reduced computational and power requirements.

Keywords

Convolutional neural network; normalization; batch normalization; deep learning; training; hardware

Cite This Article

APA Style

Woo, G., Park, S., Park, G., Junaid, M., Kim, H. (2025). Low-Complexity Hardware Architecture for Batch Normalization of CNN Training Accelerator. Computers, Materials & Continua, 84(2), 3241–3257. https://doi.org/10.32604/cmc.2025.063723

Vancouver Style

Woo G, Park S, Park G, Junaid M, Kim H. Low-Complexity Hardware Architecture for Batch Normalization of CNN Training Accelerator. Comput Mater Contin. 2025;84(2):3241–3257. https://doi.org/10.32604/cmc.2025.063723

IEEE Style

G. Woo, S. Park, G. Park, M. Junaid, and H. Kim, “Low-Complexity Hardware Architecture for Batch Normalization of CNN Training Accelerator,” Comput. Mater. Contin., vol. 84, no. 2, pp. 3241–3257, 2025. https://doi.org/10.32604/cmc.2025.063723

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Low-Complexity Hardware Architecture for Batch Normalization of CNN Training Accelerator

Abstract

Keywords

Cite This Article

805

253

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link