TY  - EJOU
AU  - Feng, Xiaorong 
AU  - Gao, Ying 
AU  - Shi, Leyu 

TI  - Large Language Model-Based Representations of Heterogeneous Graphs for Vulnerability Detection
T2  - Computers, Materials \& Continua

PY  - 
VL  - 
IS  - 
SN  - 1546-2226

AB  - Open source software has become a fundamental component of modern software ecosystems, supporting a wide range of critical applications in operating systems, cloud services, embedded systems, and security-sensitive infrastructures. However, the rapid growth of open source projects also brings increasingly serious security challenges. Many widely used C/C++ components still contain hidden vulnerabilities, and attackers are no longer limited to exploiting traditional memory-related bugs such as buffer overflows or use-after-free errors. In recent years, non-memory logic flaws, including improper authentication, incorrect state transitions, flawed boundary checks, and insecure API usage, have become more prevalent and more difficult to detect using conventional static analysis or pattern-matching methods. To address these limitations, this study proposes a novel vulnerability detection framework that combines the semantic understanding capability of large language models (LLMs) with the structural representation ability of heterogeneous graph learning. Specifically, we construct a Heterogeneous Vulnerability Graph (HeVG) to explicitly model multiple types of code structures in C/C++ programs, including syntax, control flow, data dependency, and function-call relationships. By representing source code as a heterogeneous graph, the proposed framework can capture both local code patterns and long-range dependencies that are essential for identifying complex vulnerabilities. In addition, a cross-modal alignment mechanism is introduced to effectively fuse code-text semantic features extracted by LLMs with graph-based structural representations. This enables the model to jointly understand what the code means and how different program elements interact. Experimental results show that the proposed approach achieves state-of-the-art performance, reaching 95.61% accuracy in single-file vulnerability detection and 90.97% accuracy in cross-file vulnerability detection. Further analysis demonstrates that the framework is particularly effective in detecting complex logic vulnerabilities and maintains strong generalization ability across different projects. These results indicate that integrating LLMs with heterogeneous graph learning provides a promising direction for more accurate and robust open source software vulnerability detection. Open source software faces growing security challenges, with widespread vulnerabilities in critical components and an increasing prevalence of non-memory logic flaws. To address these issues, this study proposes a novel vulnerability detection framework that integrates large language models (LLMs) with heterogeneous graph learning. We introduce a Heterogeneous Vulnerability Graph (HeVG) to explicitly model diverse code structures in C/C++ programs, and employ a cross-modal alignment mechanism to fuse semantic information from code text with graph representations. Experimental results demonstrate the effectiveness of our approach, achieving state-of-the-art performance in both single-file (95.61% accuracy) and cross-file (90.97% accuracy) vulnerability detection. The framework shows particular strength in identifying complex logic vulnerabilities while maintaining high generalization capability across projects.
KW  - Large language model; graph neural network; open source software security; vulnerability detection; heterogeneous vulnerability graph

DO  - 10.32604/cmc.2026.082481