OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks

Hanyong Lee; Ye-Chan Park; Jaesung Lee

doi:10.32604/cmc.2025.068156

Open Access icon Open Access

ARTICLE

OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks

Hanyong Lee^#, Ye-Chan Park^#, Jaesung Lee^*

Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea

* Corresponding Author: Jaesung Lee. Email: email
# These authors contributed equally to this work

Computers, Materials & Continua 2025, 85(3), 4977-4993. https://doi.org/10.32604/cmc.2025.068156

Received 22 May 2025; Accepted 07 August 2025; Issue published 23 October 2025

Abstract

Restoring texts corrupted by visually perturbed homoglyph characters presents significant challenges to conventional Natural Language Processing (NLP) systems, primarily due to ambiguities arising from characters that appear visually similar yet differ semantically. Traditional text restoration methods struggle with these homoglyph perturbations due to limitations such as a lack of contextual understanding and difficulty in handling cases where one character maps to multiple candidates. To address these issues, we propose an Optical Character Recognition (OCR)-assisted masked Bidirectional Encoder Representations from Transformers (BERT) model specifically designed for homoglyph-perturbed text restoration. Our method integrates OCR preprocessing with a character-level BERT architecture, where OCR preprocessing transforms visually perturbed characters into their approximate alphabetic equivalents, significantly reducing multi-correspondence ambiguities. Subsequently, the character-level BERT leverages bidirectional contextual information to accurately resolve remaining ambiguities by predicting intended characters based on surrounding semantic cues. Extensive experiments conducted on realistic phishing email datasets demonstrate that the proposed method significantly outperforms existing restoration techniques, including OCR-based, dictionary-based, and traditional BERT-based approaches, achieving a word-level restoration accuracy of up to 99.59% in fine-tuned settings. Additionally, our approach exhibits robust performance in zero-shot scenarios and maintains effectiveness under low-resource conditions. Further evaluations across multiple downstream tasks, such as part-of-speech tagging, chunking, toxic comment classification, and homoglyph detection under conditions of severe visual perturbation (up to 40%), confirm the method’s generalizability and applicability. Our proposed hybrid approach, combining OCR preprocessing with character-level contextual modeling, represents a scalable and practical solution for mitigating visually adversarial text attacks, thereby enhancing the security and reliability of NLP systems in real-world applications.

Keywords

Homoglyph attack; text restoration; token-level correction; text restoration; character-level BERT; OCR-assisted NLP

Cite This Article

APA Style

Lee, H., Park, Y., Lee, J. (2025). OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks. Computers, Materials & Continua, 85(3), 4977–4993. https://doi.org/10.32604/cmc.2025.068156

Vancouver Style

Lee H, Park Y, Lee J. OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks. Comput Mater Contin. 2025;85(3):4977–4993. https://doi.org/10.32604/cmc.2025.068156

IEEE Style

H. Lee, Y. Park, and J. Lee, “OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks,” Comput. Mater. Contin., vol. 85, no. 3, pp. 4977–4993, 2025. https://doi.org/10.32604/cmc.2025.068156

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks

Abstract

Keywords

Cite This Article

1696

587

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link