Open Access iconOpen Access

ARTICLE

SciCN: A Scientific Dataset for Chinese Named Entity Recognition

Jing Yang, Bin Ji, Shasha Li*, Jun Ma, Jie Yu

College of Computer, National University of Defense Technology, Changsha, 410073, China

* Corresponding Author: Shasha Li. Email: email

Computers, Materials & Continua 2024, 78(3), 4303-4315. https://doi.org/10.32604/cmc.2023.035594

Abstract

Named entity recognition (NER) is a fundamental task of information extraction (IE), and it has attracted considerable research attention in recent years. The abundant annotated English NER datasets have significantly promoted the NER research in the English field. By contrast, much fewer efforts are made to the Chinese NER research, especially in the scientific domain, due to the scarcity of Chinese NER datasets. To alleviate this problem, we present a Chinese scientific NER dataset–SciCN, which contains entity annotations of titles and abstracts derived from 3,500 scientific papers. We manually annotate a total of 62,059 entities, and these entities are classified into six types. Compared to English scientific NER datasets, SciCN has a larger scale and is more diverse, for it not only contains more paper abstracts but these abstracts are derived from more research fields. To investigate the properties of SciCN and provide baselines for future research, we adapt a number of previous state-of-the-art Chinese NER models to evaluate SciCN. Experimental results show that SciCN is more challenging than other Chinese NER datasets. In addition, previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models. Motivated by this fact, we provide a scientific domain-specific lexicon. Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains. We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research. The dataset and lexicon are available at: .

Keywords


Cite This Article

APA Style
Yang, J., Ji, B., Li, S., Ma, J., Yu, J. (2024). Scicn: A scientific dataset for chinese named entity recognition. Computers, Materials & Continua, 78(3), 4303-4315. https://doi.org/10.32604/cmc.2023.035594
Vancouver Style
Yang J, Ji B, Li S, Ma J, Yu J. Scicn: A scientific dataset for chinese named entity recognition. Computers Materials Continua . 2024;78(3):4303-4315 https://doi.org/10.32604/cmc.2023.035594
IEEE Style
J. Yang, B. Ji, S. Li, J. Ma, and J. Yu "SciCN: A Scientific Dataset for Chinese Named Entity Recognition," Computers Materials Continua , vol. 78, no. 3, pp. 4303-4315. 2024. https://doi.org/10.32604/cmc.2023.035594



cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 99

    View

  • 79

    Download

  • 0

    Like

Share Link