Open Access iconOpen Access

ARTICLE

crossmark

OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks

Hanyong Lee#, Ye-Chan Park#, Jaesung Lee*

Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea

* Corresponding Author: Jaesung Lee. Email: email
# These authors contributed equally to this work

Computers, Materials & Continua 2025, 85(3), 4977-4993. https://doi.org/10.32604/cmc.2025.068156

Abstract

Restoring texts corrupted by visually perturbed homoglyph characters presents significant challenges to conventional Natural Language Processing (NLP) systems, primarily due to ambiguities arising from characters that appear visually similar yet differ semantically. Traditional text restoration methods struggle with these homoglyph perturbations due to limitations such as a lack of contextual understanding and difficulty in handling cases where one character maps to multiple candidates. To address these issues, we propose an Optical Character Recognition (OCR)-assisted masked Bidirectional Encoder Representations from Transformers (BERT) model specifically designed for homoglyph-perturbed text restoration. Our method integrates OCR preprocessing with a character-level BERT architecture, where OCR preprocessing transforms visually perturbed characters into their approximate alphabetic equivalents, significantly reducing multi-correspondence ambiguities. Subsequently, the character-level BERT leverages bidirectional contextual information to accurately resolve remaining ambiguities by predicting intended characters based on surrounding semantic cues. Extensive experiments conducted on realistic phishing email datasets demonstrate that the proposed method significantly outperforms existing restoration techniques, including OCR-based, dictionary-based, and traditional BERT-based approaches, achieving a word-level restoration accuracy of up to 99.59% in fine-tuned settings. Additionally, our approach exhibits robust performance in zero-shot scenarios and maintains effectiveness under low-resource conditions. Further evaluations across multiple downstream tasks, such as part-of-speech tagging, chunking, toxic comment classification, and homoglyph detection under conditions of severe visual perturbation (up to 40%), confirm the method’s generalizability and applicability. Our proposed hybrid approach, combining OCR preprocessing with character-level contextual modeling, represents a scalable and practical solution for mitigating visually adversarial text attacks, thereby enhancing the security and reliability of NLP systems in real-world applications.

Keywords

Homoglyph attack; text restoration; token-level correction; text restoration; character-level BERT; OCR-assisted NLP

Cite This Article

APA Style
Lee, H., Park, Y., Lee, J. (2025). OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks. Computers, Materials & Continua, 85(3), 4977–4993. https://doi.org/10.32604/cmc.2025.068156
Vancouver Style
Lee H, Park Y, Lee J. OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks. Comput Mater Contin. 2025;85(3):4977–4993. https://doi.org/10.32604/cmc.2025.068156
IEEE Style
H. Lee, Y. Park, and J. Lee, “OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks,” Comput. Mater. Contin., vol. 85, no. 3, pp. 4977–4993, 2025. https://doi.org/10.32604/cmc.2025.068156



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 797

    View

  • 247

    Download

  • 0

    Like

Share Link