Open Access iconOpen Access

ARTICLE

Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification

Ye-Chan Park1, Mohd Asyraf Zulkifley2, Bong-Soo Sohn3, Jaesung Lee4,*

1 Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea
2 Department of Electrical, Electronic and Systems Engineering, Universiti Kebangsaan Malaysia, Bangi, 43600, Malaysia
3 School of Computer Science and Engineering, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
4 AI/ML Innovation Research Center, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea

* Corresponding Author: Jaesung Lee. Email: email

Computers, Materials & Continua 2026, 87(1), 36 https://doi.org/10.32604/cmc.2025.074141

Abstract

Legal case classification involves the categorization of legal documents into predefined categories, which facilitates legal information retrieval and case management. However, real-world legal datasets often suffer from class imbalances due to the uneven distribution of case types across legal domains. This leads to biased model performance, in the form of high accuracy for overrepresented categories and underperformance for minority classes. To address this issue, in this study, we propose a data augmentation method that masks unimportant terms within a document selectively while preserving key terms from the perspective of the legal domain. This approach enhances data diversity and improves the generalization capability of conventional models. Our experiments demonstrate consistent improvements achieved by the proposed augmentation strategy in terms of accuracy and F1 score across all models, validating the effectiveness of the proposed method in legal case classification.

Keywords

Legal case classification; class imbalance; data augmentation; token masking; legal NLP

Cite This Article

APA Style
Park, Y., Zulkifley, M.A., Sohn, B., Lee, J. (2026). Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification. Computers, Materials & Continua, 87(1), 36. https://doi.org/10.32604/cmc.2025.074141
Vancouver Style
Park Y, Zulkifley MA, Sohn B, Lee J. Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification. Comput Mater Contin. 2026;87(1):36. https://doi.org/10.32604/cmc.2025.074141
IEEE Style
Y. Park, M. A. Zulkifley, B. Sohn, and J. Lee, “Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification,” Comput. Mater. Contin., vol. 87, no. 1, pp. 36, 2026. https://doi.org/10.32604/cmc.2025.074141



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 556

    View

  • 218

    Download

  • 0

    Like

Share Link