Open Access iconOpen Access

ARTICLE

crossmark

A Study on Re-Identification of Natural Language Data Considering Korean Attributes

Segyeong Bang#, Soeun Kim#, Gaeun Ahn, Hyemin Hong, Junhyoung Oh*

Department of Information Security, College of Future Industry Convergence, Seoul Women’s University, Seoul, 01797, Republic of Korea

* Corresponding Author: Junhyoung Oh. Email: email
# These authors contributed equally to this work

Computers, Materials & Continua 2025, 85(3), 4629-4643. https://doi.org/10.32604/cmc.2025.068221

Abstract

This study analyzes the risks of re-identification in Korean text data and proposes a secure, ethical approach to data anonymization. Following the ‘Lee Luda’ AI chatbot incident, concerns over data privacy have increased. The Personal Information Protection Commission of Korea conducted inspections of AI services, uncovering 850 cases of personal information in user input datasets, highlighting the need for pseudonymization standards. While current anonymization techniques remove personal data like names, phone numbers, and addresses, linguistic features such as writing habits and language-specific traits can still identify individuals when combined with other data. To address this, we analyzed 50,000 Korean text samples from the X platform, focusing on language-specific features for authorship attribution. Unlike English, Korean features flexible syntax, honorifics, syllabic and grapheme patterns, and referential terms. These linguistic characteristics were used to enhance re-identification accuracy. Our experiments combined five machine learning models, six stopword processing methods, and four morphological analyzers. By using a tokenizer that captures word frequency and order, and employing the LSTM model, OKT morphological analyzer, and stopword removal, we achieved the maximum authorship attributions accuracy of 90.51%. This demonstrates the significant role of Korean linguistic features in re-identification. The findings emphasize the risk of re-identification through language data and call for a re-evaluation of anonymization methods, urging the consideration of linguistic traits in anonymization beyond simply removing personal information.

Keywords

Re-identification; data anonymization; authorship attributions; Korean text

Cite This Article

APA Style
Bang, S., Kim, S., Ahn, G., Hong, H., Oh, J. (2025). A Study on Re-Identification of Natural Language Data Considering Korean Attributes. Computers, Materials & Continua, 85(3), 4629–4643. https://doi.org/10.32604/cmc.2025.068221
Vancouver Style
Bang S, Kim S, Ahn G, Hong H, Oh J. A Study on Re-Identification of Natural Language Data Considering Korean Attributes. Comput Mater Contin. 2025;85(3):4629–4643. https://doi.org/10.32604/cmc.2025.068221
IEEE Style
S. Bang, S. Kim, G. Ahn, H. Hong, and J. Oh, “A Study on Re-Identification of Natural Language Data Considering Korean Attributes,” Comput. Mater. Contin., vol. 85, no. 3, pp. 4629–4643, 2025. https://doi.org/10.32604/cmc.2025.068221



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 799

    View

  • 277

    Download

  • 0

    Like

Share Link