Open Access iconOpen Access

ARTICLE

Beyond Accuracy: Evaluating and Explaining the Capability Boundaries of Large Language Models in Syntax-Preserving Code Translation

Yaxin Zhao1, Qi Han2, Hui Shu2, Yan Guang2,*

1 School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450001, China
2 Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, 450001, China

* Corresponding Author: Yan Guang. Email: email

(This article belongs to the Special Issue: AI-Powered Software Engineering)

Computers, Materials & Continua 2026, 86(2), 1-24. https://doi.org/10.32604/cmc.2025.070511

Abstract

Large Language Models (LLMs) are increasingly applied in the field of code translation. However, existing evaluation methodologies suffer from two major limitations: (1) the high overlap between test data and pretraining corpora, which introduces significant bias in performance evaluation; and (2) mainstream metrics focus primarily on surface-level accuracy, failing to uncover the underlying factors that constrain model capabilities. To address these issues, this paper presents TCode (Translation-Oriented Code Evaluation benchmark)—a complexity-controllable, contamination-free benchmark dataset for code translation—alongside a dedicated static feature sensitivity evaluation framework. The dataset is carefully designed to control complexity along multiple dimensions—including syntactic nesting and expression intricacy—enabling both broad coverage and fine-grained differentiation of sample difficulty. This design supports precise evaluation of model capabilities across a wide spectrum of translation challenges. The proposed evaluation framework introduces a correlation-driven analysis mechanism based on static program features, enabling predictive modeling of translation success from two perspectives: Code Form Complexity (e.g., code length and character density) and Semantic Modeling Complexity (e.g., syntactic depth, control-flow nesting, and type system complexity). Empirical evaluations across representative LLMs—including Qwen2.5-72B and Llama3.3-70B—demonstrate that even state-of-the-art models achieve over 80% compilation success on simple samples, but their accuracy drops sharply below 40% on complex cases. Further correlation analysis indicates that Semantic Modeling Complexity alone is correlated with up to 60% of the variance in translation success, with static program features exhibiting nonlinear threshold effects that highlight clear capability boundaries. This study departs from the traditional accuracy-centric evaluation paradigm and, for the first time, systematically characterizes the capabilities of large language models in translation tasks through the lens of program static features. The findings provide actionable insights for model refinement and training strategy development.

Keywords

Large language models (LLMs); code translation; compiler testing; program analysis; complexity-based evaluation

Cite This Article

APA Style
Zhao, Y., Han, Q., Shu, H., Guang, Y. (2026). Beyond Accuracy: Evaluating and Explaining the Capability Boundaries of Large Language Models in Syntax-Preserving Code Translation. Computers, Materials & Continua, 86(2), 1–24. https://doi.org/10.32604/cmc.2025.070511
Vancouver Style
Zhao Y, Han Q, Shu H, Guang Y. Beyond Accuracy: Evaluating and Explaining the Capability Boundaries of Large Language Models in Syntax-Preserving Code Translation. Comput Mater Contin. 2026;86(2):1–24. https://doi.org/10.32604/cmc.2025.070511
IEEE Style
Y. Zhao, Q. Han, H. Shu, and Y. Guang, “Beyond Accuracy: Evaluating and Explaining the Capability Boundaries of Large Language Models in Syntax-Preserving Code Translation,” Comput. Mater. Contin., vol. 86, no. 2, pp. 1–24, 2026. https://doi.org/10.32604/cmc.2025.070511



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 339

    View

  • 110

    Download

  • 0

    Like

Share Link