A Study on Re-Identification of Natural Language Data Considering Korean Attributes

Segyeong Bang; Soeun Kim; Gaeun Ahn; Hyemin Hong; Junhyoung Oh

doi:10.32604/cmc.2025.068221

Open Access icon Open Access

ARTICLE

A Study on Re-Identification of Natural Language Data Considering Korean Attributes

Segyeong Bang^#, Soeun Kim^#, Gaeun Ahn, Hyemin Hong, Junhyoung Oh^*

Department of Information Security, College of Future Industry Convergence, Seoul Women’s University, Seoul, 01797, Republic of Korea

* Corresponding Author: Junhyoung Oh. Email: email
# These authors contributed equally to this work

Computers, Materials & Continua 2025, 85(3), 4629-4643. https://doi.org/10.32604/cmc.2025.068221

Received 23 May 2025; Accepted 26 August 2025; Issue published 23 October 2025

Abstract

This study analyzes the risks of re-identification in Korean text data and proposes a secure, ethical approach to data anonymization. Following the ‘Lee Luda’ AI chatbot incident, concerns over data privacy have increased. The Personal Information Protection Commission of Korea conducted inspections of AI services, uncovering 850 cases of personal information in user input datasets, highlighting the need for pseudonymization standards. While current anonymization techniques remove personal data like names, phone numbers, and addresses, linguistic features such as writing habits and language-specific traits can still identify individuals when combined with other data. To address this, we analyzed 50,000 Korean text samples from the X platform, focusing on language-specific features for authorship attribution. Unlike English, Korean features flexible syntax, honorifics, syllabic and grapheme patterns, and referential terms. These linguistic characteristics were used to enhance re-identification accuracy. Our experiments combined five machine learning models, six stopword processing methods, and four morphological analyzers. By using a tokenizer that captures word frequency and order, and employing the LSTM model, OKT morphological analyzer, and stopword removal, we achieved the maximum authorship attributions accuracy of 90.51%. This demonstrates the significant role of Korean linguistic features in re-identification. The findings emphasize the risk of re-identification through language data and call for a re-evaluation of anonymization methods, urging the consideration of linguistic traits in anonymization beyond simply removing personal information.

Keywords

Re-identification; data anonymization; authorship attributions; Korean text

Cite This Article

APA Style

Bang, S., Kim, S., Ahn, G., Hong, H., Oh, J. (2025). A Study on Re-Identification of Natural Language Data Considering Korean Attributes. Computers, Materials & Continua, 85(3), 4629–4643. https://doi.org/10.32604/cmc.2025.068221

Vancouver Style

Bang S, Kim S, Ahn G, Hong H, Oh J. A Study on Re-Identification of Natural Language Data Considering Korean Attributes. Comput Mater Contin. 2025;85(3):4629–4643. https://doi.org/10.32604/cmc.2025.068221

IEEE Style

S. Bang, S. Kim, G. Ahn, H. Hong, and J. Oh, “A Study on Re-Identification of Natural Language Data Considering Korean Attributes,” Comput. Mater. Contin., vol. 85, no. 3, pp. 4629–4643, 2025. https://doi.org/10.32604/cmc.2025.068221

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

A Study on Re-Identification of Natural Language Data Considering Korean Attributes

Abstract

Keywords

Cite This Article

1365

558

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link