Binary Code Similarity Detection: Retrospective Review and Future Directions

Shengjia Chang; Baojiang Cui; Shaocong Feng

doi:10.32604/cmc.2025.070195

icon Open Access

REVIEW

Binary Code Similarity Detection: Retrospective Review and Future Directions

Shengjia Chang, Baojiang Cui^*, Shaocong Feng

School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, 100876, China

* Corresponding Author: Baojiang Cui. Email: email

Computers, Materials & Continua 2025, 85(3), 4345-4374. https://doi.org/10.32604/cmc.2025.070195

Received 10 July 2025; Accepted 10 September 2025; Issue published 23 October 2025

Abstract

Binary Code Similarity Detection (BCSD) is vital for vulnerability discovery, malware detection, and software security, especially when source code is unavailable. Yet, it faces challenges from semantic loss, recompilation variations, and obfuscation. Recent advances in artificial intelligence—particularly natural language processing (NLP), graph representation learning (GRL), and large language models (LLMs)—have markedly improved accuracy, enabling better recognition of code variants and deeper semantic understanding. This paper presents a comprehensive review of 82 studies published between 1975 and 2025, systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence (AI) techniques. Particular emphasis is placed on the role of LLMs, which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance. The review is organized around five central research questions: (1) the chronological development and milestones of BCSD; (2) the construction of AI-driven technical roadmaps that chart methodological transitions; (3) the design and implementation of general analytical workflows for binary code analysis; (4) the applicability, strengths, and limitations of LLMs in capturing semantic and structural features of binary code; and (5) the persistent challenges and promising directions for future investigation. By synthesizing insights across these dimensions, the study demonstrates how LLMs reshape the landscape of binary code analysis, offering unprecedented opportunities to improve accuracy, scalability, and adaptability in real-world scenarios. This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective, serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.

Keywords

Binary code similarity detection; semantic code representation; graph-based modeling; representation learning; large language models

1 Introduction

Binary code similarity assessment involves the comparison of two or more binary code segments—ranging from basic blocks and functions to entire programs—to discern their similarities and differences. This process is crucial, particularly when the source code is inaccessible, a common scenario in commercial-off-the-shelf (COTS) software, legacy systems, and malware scenarios [1]. In such instances, Binary Code Similarity Detection (BCSD) serves as a vital tool, enabling security professionals to identify vulnerabilities, detect malware patterns, and enhance the overall security posture. However, determining binary code similarity is inherently complex due to the loss of semantic information—such as function names, variable names, comments, and data structure definitions—during the compilation process. Moreover, even if the source code remains unchanged, recompilation might result in substantially different binary outputs because of variations in compilers, optimization levels, or target architectures. Additionally, obfuscation techniques deliberately obscure the code structure, further complicating the analysis. Despite these challenges, BCSD is indispensable in system security, particularly in the realms of vulnerability detection [2–6] and malware analysis [7–11]. It facilitates the identification of reused vulnerable functions across multiple binaries, aids in generating patches and remediation, and assists in classifying similarities between known and unknown malware samples, thereby enhancing malware clustering, behavior analysis, and family classification. These applications underscore the necessity for robust, semantically aware BCSD techniques to improve the accuracy and scalability of contemporary software security analysis.

The burgeoning advancements in artificial intelligence (AI), particularly in the fields of natural language processing (NLP) [12,13] and graph representation learning (GRL) [14,15], have significantly propelled the development of AI-powered BCSD technologies. A multitude of studies [16–20] have explored the application of widely used neural networks, including Convolutional Neural Networks (CNN) [6], Graph Neural Networks (GNN) [21], and Long Short-Term Memory (LSTM) networks [22], to address the challenges inherent in BCSD. In stark contrast to traditional methods such as symbolic execution, graph matching, and hashing [23–27], AI-driven approaches not only enhance detection accuracy but also significantly improve the recognition of complex code variants, malware patterns, and cross-architecture discrepancies. Furthermore, the incorporation of large language models (LLMs) has further advanced BCSD by facilitating a deeper understanding and representation of code [28,29]. LLMs adeptly capture subtle semantic similarities between code segments, thereby augmenting model accuracy and enhancing malware detection capabilities. Their pre-training and fine-tuning capabilities facilitate more effective transfer learning, boosting adaptability and generalization across diverse compilation environments and architectures [30–34]. These technological strides position AI-based BCSD as a more efficient and reliable solution, especially in contexts where source code remains inaccessible, ultimately enhancing the accuracy and scalability of modern software security analysis.

Research into BCSD and LLMs has progressed substantially, as evidenced by numerous scholarly reviews [1,35]. BCSD is pivotal in areas such as software security, malware analysis, and vulnerability detection, where it identifies similarities across binary code segments. LLMs have achieved remarkable success across various NLP tasks, including text generation, translation, and sentiment analysis. The recent application of LLMs in code analysis is garnering increased attention due to their proficiency in modeling complex semantic structures and relationships. Specifically, LLMs have been extensively applied to program synthesis, code generation, and bug detection, demonstrating significant potential to automate and enhance diverse software engineering tasks. Common applications encompass code testing [36–39], where LLMs automatically generate test cases to improve test coverage and error detection; vulnerability patching [40–42], where they produce patches to rectify known vulnerabilities, thereby minimizing manual intervention and enhancing efficiency; vulnerability detection [43–45], where LLMs analyze extensive code patterns to pinpoint potential security vulnerabilities, thus bolstering code security; and binary recovery [46–49], where LLMs aid in restoring lost source code, facilitating reverse engineering and malware analysis. Through these applications, LLMs not only improve the efficiency of software development but also enhance the security and reliability of software systems. Despite the growing intersection of BCSD and LLMs, to the best of our knowledge, no review has yet specifically focused on the integration and potential of LLMs within BCSD. As modern software grows increasingly complex, traditional BCSD methods face substantial challenges, including code obfuscation, polymorphism, and architectural differences. These challenges underscore a critical research gap. In contrast, LLMs, with their advanced semantic understanding, offer novel approaches to surmount these obstacles and substantially improve detection accuracy.

This paper conducts an exhaustive review of 82 seminal papers published between 1975 and 2025, focusing on BCSD. This review critically examines traditional BCSD methods, AI-enhanced BCSD approaches, code analysis using LLMs, and LLM-based BCSD techniques. An analysis of traditional methods provides foundational insights into the evolution of BCSD, while the evaluation of AI-driven approaches highlights significant recent advancements. The exploration of LLM capabilities in code analysis underscores their potential for enhancing code understanding and substantiates their utility in BCSD applications. Additionally, this review offers practical perspectives on the implementation of current LLM-based BCSD methods. Drawing on the historical and contemporary developments in these technologies, the review addresses five pivotal research questions:

• RQ1: What are the historical and developmental stages of BCSD technology?

• RQ2: How can the technical roadmap for AI-driven BCSD be systematically articulated?

• RQ3: What constitutes the general workflow of AI-driven BCSD, and how can it be effectively summarized?

• RQ4: What is the applicability of LLMs to BCSD tasks?

• RQ5: What challenges are encountered in LLM-based BCSD systems, and what potential future research directions could be pursued in this domain?

Paper Structure. The organization of this paper is as follows: Section 2 introduces the fundamental concepts of BCSD, reviewing its origins and technological progression, and addressing RQ1. Section 3 delves into the integration of AI within BCSD, offering an in-depth analysis of recent advancements in machine learning and deep learning that enhance detection accuracy. It addresses RQ2, which pertains to the development of an AI-driven BCSD framework, and RQ3, which discusses the general workflow of AI-based BCSD. Section 4 examines LLMs and their applications in BCSD, focusing on two representative LLM-based methods and addressing RQ4 concerning their applicability to BCSD tasks. Section 5 discusses the principal challenges faced by LLM-based BCSD systems across the preprocessing, representation, and embedding stages, including issues of data ambiguity, inaccurate feature extraction, limited control-flow modeling, and scalability. It also proposes future research directions aimed at improving data processing, incorporating domain-specific knowledge, enhancing model robustness and scalability, and optimizing few-shot and zero-shot learning for more effective vulnerability detection. Section 6 analyzes the limitations of this paper, while Section 7 concludes the paper by summarizing the key findings of the review.

2 Fundamentals and Evolution of Binary Code Similarity Detection

This section offers an in-depth examination of BCSD, starting with its fundamental principles, including the compilation process, syntactic variations, and primary challenges in detecting similarity. It then traces the development of BCSD techniques, from initial syntax-based differencing methods to advanced semantic similarity assessments, ultimately leading to the integration of AI, specifically through deep learning models.

2.1 Fundamentals of Binary Code Similarity Detection

This subsection delineates the core principles of BCSD. It covers the compilation process, the syntactic discrepancies in analogous binaries, the challenges inherent in detecting similarities, and the growing utilization of BCSD in fields such as vulnerability assessment, malware analysis, and patch generation. Recent studies particularly emphasize its role in detecting vulnerabilities.

Compilation Process: Binary code comprises machine-level instructions created through a standardized compilation process, which are then directly executable by a CPU. As depicted in Fig. 1, the compilation sequence initiates with source code files that are processed by compilers such as GCC, Clang, or MSVC. This processing involves the selection of optimization levels (e.g., -O0, -O2, -O3) and the specification of target platform parameters, which may include the instruction set architecture (e.g., x86, ARM, MIPS), word length (32-bit or 64-bit), and operating system (e.g., Windows, Linux, macOS). Each source file is transformed into an object file containing machine code. These object files are subsequently submitted to a linker, which resolves external references and amalgamates them into the final binary output, which could be an executable or a library file. Additionally, developers can enhance the compilation process to produce polymorphic variants of the same source code by implementing modifications both before and after compilation. Source code transformations are applied pre-compilation, affecting both input and output in source code format, whereas binary code transformations are applied post-compilation, altering the produced binary files directly.

images

Figure 1: Illustration of the binary code compilation process. This figure details how source code is transformed into an executable via the processes of compilation and linking, and highlights the potential for generating polymorphic variants through source- or binary-level transformations

Syntax Differences: Homologous binary code denotes binaries derived from the same source code. Despite originating from identical source input, these binaries can demonstrate substantial syntactic variations due to the compilation process. These variations may affect aspects such as instruction selection, register allocation, control-flow organization, and function inlining strategies, thereby creating binaries that are structurally different yet semantically equivalent.

These differences in syntax often originate from the nuances of the compilation process, which include variations in compiler settings and target platforms. Switching between compilers, such as transitioning from GCC to Clang, impacts code generation. Similarly, altering optimization levels, for instance, from -O0 to -O3, modifies instruction sequences and introduces efficiency-centric transformations. The activation of platform-specific instruction set extensions, such as SSE for x86 architectures or NEON for ARM, further customizes the code to suit specific hardware. These changes lead to a heterogeneity in binary structures, which is a common challenge in cross-platform development and when targeting diverse hardware architectures.

Furthermore, syntax discrepancies can be deliberately introduced through advanced compilation techniques. Pre-compilation source-level transformations, including variable renaming, insertion of dead code, and rewriting of control structures, pre-emptively alter the source code, resulting in distinctive binary outcomes. Post-compilation, binary-level transformations like instruction substitution, insertion of junk code, and control-flow flattening enhance the structural diversity of equivalent binaries. These strategies are prevalent in commercial software to safeguard intellectual property and in security applications to thwart static analysis tools, thereby generating polymorphic binary variants.

Impact of Syntax Differences: The detection of binary code similarity is significantly complicated by syntax differences in equivalent binaries, even when the originating source code is identical. These discrepancies, stemming from variations in compiler choices, optimization levels, instruction set extensions, and transformation techniques, impede traditional binary analysis methods from consistently identifying semantically equivalent code across different compilations.

Variations in instruction selection, register allocation, and control-flow configurations result in structurally unique binaries that execute identical functionalities. These structural variations challenge direct binary comparisons, as the representations can differ substantially while retaining identical underlying semantics. Moreover, extended compilation techniques, encompassing both source-level and binary-level transformations, further complicate this issue by deliberately obscuring the relationships between original and altered binaries, thereby complicating the identification of corresponding code segments.

Given the obstacles presented by syntactical differences in equivalent binaries, BCSD confronts difficulties in establishing precise correspondences between functionally identical code transformed into varied binary formats. Traditional binary comparison techniques frequently fail due to these introduced variations during the compilation process. Thus, the development of advanced methodologies capable of discerning semantic similarities despite syntactic disparities is imperative for enhancing the efficacy of BCSD. BCSD proves indispensable in scenarios where identifying semantically equivalent code is crucial, such as in software security for detecting vulnerabilities in recompiled or obfuscated binaries, in malware analysis for identifying reused malicious components across different variants, and in software maintenance and reverse engineering for generating patches and conducting binary diffs. These applications highlight the necessity for robust, semantics-aware BCSD techniques adept at managing syntactic variability.

Trends in BCSD Application Scenarios: An analysis of 33 representative BCSD studies conducted between 1999 and 2019 delineates three principal application scenarios, as encapsulated in Table 1. Detailed classifications are delineated below:

• Patch generation and analysis concentrates on detecting alterations between binary versions to identify patches, security enhancements, and facilitate the development of proof-of-concept exploits for earlier software iterations.

• Vulnerability detection involves scrutinizing potential vulnerabilities by juxtaposing known defective binary segments against extensive binary repositories, even across disparate platforms characterized by varying instruction set architectures.

• Malware analysis employs BCSD to discern and categorize malware samples, pinpoint polymorphic or obfuscated variants, and trace malware evolution through methods such as clustering and lineage analysis.

The trends depicted in BCSD application scenarios, as illustrated in Fig. 2, reveal a significant temporal shift. Both patch generation and analysis, and malware analysis were among the initial applications, with research in these domains beginning to expand steadily from the early 2000s. These applications have been pivotal, addressing critical issues such as the identification of software modifications and the classification of malware. Nonetheless, vulnerability detection emerged later, exhibiting substantial growth since its inception in the mid-2000s. The volume of studies in this realm escalated markedly post-2015, positioning vulnerability detection as the preeminent application scenario in BCSD at present. This growth underscores an escalating recognition of the significance of identifying vulnerabilities in binary code, particularly in the face of challenges posed by code reuse and platform disparities. This trend accentuates how vulnerability detection, once a peripheral area, has swiftly ascended to become the central focus of BCSD research, reflecting its escalating relevance in contemporary software security.

images

Figure 2: Trends in BCSD application scenarios over time. Over the years, BCSD application scenarios have evolved, with vulnerability detection becoming increasingly predominant

BCSD technology holds significant application value in identifying homologous vulnerabilities in IoT firmware. The specific process is as follows: First, the target firmware is unpacked to extract the binary files, providing raw data for subsequent analysis. Next, individual functions within the firmware are extracted and compared using BCSD tools to detect the homology between functions in the target firmware’s binary files and known vulnerable binary functions. This process efficiently identifies potentially dangerous functions in the firmware, especially since many IoT devices reuse code or rely on similar codebases. Through BCSD, potential sources of vulnerabilities can be quickly located, enabling the identification of security risks in firmware without access to the source code. Finally, the identified potential vulnerability functions can be further validated for their effectiveness through manual verification or automated validation tools. This BCSD-based vulnerability detection process significantly enhances the efficiency and accuracy of IoT firmware security testing, making it particularly suitable for complex, resource-constrained embedded devices.

2.2 Evolution of Binary Code Similarity Detection

This subsection delineates the three-stage evolution of BCSD, encompassing its origin in syntactic code differencing (1975–1999), the subsequent development of semantic similarity detection (2000–2015), and the recent maturity phase driven by AI and deep learning models (2016–2024). These stages are systematically summarized and illustrated in Fig. 3, which provides a visual overview of the historical progression and methodological transitions in BCSD research.

images

Figure 3: The Three-Stage Evolution of BCSD. The evolution of BCSD is categorized into three stages: the origin stage (1975–1999), the development stage (2000–2015), and the maturity stage, propelled by AI (2016–2024)

Stage I (1975–1999): The origins of BCSD are rooted in source-level differencing techniques. In 1975, the introduction of SCCS [70] marked a significant milestone by formalizing the process of source code differencing. This development profoundly influenced subsequent approaches to binary analysis, emphasizing structural comparisons between different versions of programs. A groundbreaking advancement came in 1991 when Reichenberger introduced the pioneering technique of binary differencing that operated directly on compiled binaries and computed byte-level differences, thereby accommodating various binary formats [71]. The field witnessed a major evolution in 1999 with the advent of Baker’s EXEDIFF [50], which expanded the scope from mere byte-level comparison to an instruction-level analysis. This innovation marked the beginning of binary similarity detection, with EXEDIFF pioneering the evaluation of both structural and behavioral similarities across binary segments, as opposed to simply identifying exact differences. Initial efforts predominantly centered on syntactic similarities, progressively encompassing semantic similarities. These explorations delved into how attributes such as control flow, instruction patterns, and basic data flow could reveal functional equivalences, thereby establishing the groundwork for semantic-aware detection frameworks.

Stage II (2000–2015): The evolution of BCSD transitioned from syntactic comparisons to a more robust semantic-aware analysis, driven by an enhanced understanding of binary structures, control/data flow modeling, and symbolic reasoning. This period saw the formal recognition of binary similarity detection as an independent research discipline, underscoring the inadequacies of syntax-based approaches and the importance of capturing deeper behavioral equivalences. The shift toward representation commenced in 2004 with Thomas’s introduction of call graph isomorphism as a method for binary similarity detection, which facilitated the modeling of interprocedural structures [51]. In 2005, Kruegel introduced a novel graph-coloring method that annotated control flow graphs (CFGs) to reflect function semantics, thereby extending the analysis beyond mere control flow to include behavioral characteristics [72]. A significant breakthrough occurred in 2008 with Gao’s BinHunt [24], which was the first BCSD approach to compute semantic similarity using symbolic execution and constraint solving, thereby verifying behavioral equivalence. In 2013, Wei’s RENDEZVOUS [73] shifted the focus towards function-level similarity, thus enhancing the scalability and practical applications of BCSD. Subsequently, in 2014, David’s TRACY [57] introduced tracelets—short, meaningful execution paths extracted from CFGs—that facilitated targeted vulnerability detection. A key milestone was reached in 2015 with Pewny’s MULTI-MH [59], which supported cross-architecture binary function matching by abstracting functions based on input-output semantics. These advancements heralded the systematic integration of semantic reasoning into BCSD, transitioning from syntax-based matching to an emphasis on semantic-driven similarity detection. While many semantic models remained handcrafted during this era, they set the stage for the subsequent integration of AI-driven methods in BCSD.

Stage III (2016–2024): The maturation of BCSD is epitomized by the transition from expert-dependent, manually-crafted models to automated, learning-based semantic representations. This shift, propelled by breakthroughs in AI, notably deep learning, has evolved from reliance on static heuristics to the adoption of neural architectures capable of learning multi-level representations. Such developments enable the effective capture of code behavior across varied environments and obfuscation techniques. Noteworthy advancements include the introduction of Genius in 2016 [62], which utilized Attributed Control Flow Graphs (ACFGs) for delineating function semantics, and the deployment of Gemini in 2017 [3], which merged GNNs with a Siamese network to refine binary function matching. Subsequent innovations in 2019 by Asm2Vec [65] applied NLP techniques to binary instruction sequences to address the out-of-vocabulary (OOV) problem, while SAFE harnessed a transformer encoder with self-attention mechanisms. In 2020, BinaryAI [74] enhanced function matching across architectures through the integration of multiple modalities, followed by jTrans in 2022 [75], which pioneered jump-aware representation learning for improved semantic precision. The year 2023 witnessed kTrans [76] advancing with knowledge-aware pretraining, and in 2024, He et al. [77] proposed a semantics-guided graph representation that enhanced semantic fidelity by modeling low-level conventions. These developments signify the full maturation of BCSD as an AI-driven discipline, incorporating graph-based reasoning, sequence modeling, cross-modal embedding, and semantic supervision to significantly enhance accuracy, robustness, and scalability. This stage also underscores the growing convergence of software engineering, machine learning, and program analysis.

Answering to RQ1

The evolution of BCSD can be delineated into three distinct stages: inception, development, and maturity. The initial stage (1975–1999) was marked by the advent of BCSD in code differencing, primarily employing syntactic features for similarity detection focused on static structural comparisons. The ensuing stage (2000–2015) represented a shift from syntactic to semantic similarity detection, leveraging expert knowledge to extract semantic features from code, thus advancing the state of the art. The final stage (2016–2024) reflects the maturity of BCSD, where the integration of deep learning, graph-based reasoning, and sequence modeling enables a more precise capture of complex code behaviors, culminating in the transformation from expert-driven to AI-driven models.

3 Integrating AI into BCSD: A Roadmap and Workflow

This section delineates the AI-driven BCSD roadmap, describing its evolution from direct and indirect comparison methods to deep learning-based representation. It also outlines the BCSD workflow through preprocessing, representation, and embedding stages, emphasizing contemporary techniques such as graph-based and code-based embedding methods.

3.1 Roadmap of AI-Driven Binary Code Similarity Detection

This section presents the AI-driven roadmap of BCSD, tracing its evolution from direct and indirect comparison methods to hash- and embedding-based representations, and ultimately to advanced graph- and code-oriented embeddings, as illustrated in Fig. 4.

images

Figure 4: Roadmap of AI-Driven BCSD. The diagram delineates the AI-driven BCSD roadmap, progressing from direct/indirect comparison to hash/embedding-based representations, and finally to graph/code embeddings

Layer I:BCSD techniques are fundamentally categorized into two paradigms: direct and indirect comparison. These paradigms diverge markedly in terms of input format, modeling strategies, computational efficiency, and practical applicability.

Direct comparison methods assess the similarity between binary code segments by utilizing raw instruction sequences, handcrafted features, or learned representations. Models such as Bayesian networks [78], CNNs [6], GMNs [79], and FNNs [80] provide flexibility and interpretability. However, they necessitate exhaustive pairwise comparisons, leading to significant computational expenses and limited scalability. To alleviate this, optimization techniques such as tree-based indexes [11,81], LSH [54,82–84], Bloom filters [73], and distributed search frameworks are employed. Indirect comparison has emerged as the predominant approach in recent BCSD research, owing to its scalability and efficiency, especially when handling large data sets.

Conversely, indirect comparison encodes code segments into low-dimensional embeddings, with similarity ascertained through metrics such as Euclidean distance or cosine similarity. This method compresses input features into condensed representations, facilitating efficient comparisons via distance metrics, and enabling scalable one-to-many comparisons. For instance, to compare a new function with an existing dataset, each function in the repository is initially mapped to its respective low-dimensional representation—a one-time operation. The same procedure is then applied to the new function, and the resultant embeddings are compared using efficient techniques such as approximate nearest neighbors.

Layer II:Within indirect comparison techniques for BCSD, two distinct technical approaches have emerged: hash-based and embedding-based methods. Both strategies aim to map binary code to compact, low-dimensional representations to facilitate efficient similarity computations. However, they differ fundamentally in their underlying assumptions, representation granularity, and capacity to encapsulate code semantics.

Hash-based methods produce fingerprints or digests of binary code using algorithms designed for approximate matching. Unlike conventional cryptographic hashes, fuzzy hashes [85–87] yield similar outputs for analogous inputs, enabling swift lookups. Nevertheless, these methods are susceptible to structural noise, such as instruction reordering or compiler transformations, which can diminish their efficacy in robust function-level similarity detection. To overcome this, function-level adaptations like FunctionSimSearch [88] normalize instruction sequences to foster more meaningful comparisons. Despite their operational efficiency, hash-based methods primarily capture shallow syntactic similarities and are generally inadequate for modeling complex semantic nuances.

Embeddings represent a prevalent form of low-dimensional representation extensively utilized in machine learning. These embeddings map semantically similar inputs to proximate points within a low-dimensional space, irrespective of their disparities in the original representation. The fundamental goal of machine learning models is to devise embeddings that amplify similarity for analogous functionalities and reduce it for non-analogous ones. The literature delineates two primary categories of embeddings: those concentrating on summarizing the code for each function and those that encapsulate the structure of their corresponding graphs. The escalating sophistication of methods based on embeddings has increasingly shifted the focus in BCSD towards exploiting machine learning for semantic alignment.

Layer III:In the realm of embedding-based techniques for BCSD, two predominant approaches have surfaced: graph-based and code-based embeddings. Both strategies map binary functions into semantic vector spaces, thereby facilitating efficient, architecture-agnostic similarity detection. Nonetheless, they fundamentally diverge in their dependency on structural abstractions and modeling capabilities.

Graph-based approaches encapsulate the execution semantics of binary code, primarily employing CFGs [62] to represent function blocks and control transitions. With the augmenting demand for more nuanced semantic representations, ACFGs [3] were introduced, enabling the utilization of GNNs [3,79,89,90] to capture both topological and semantic patterns. This approach evolved into Semantic-Oriented Graphs (SOGs) [77], which integrate higher-order semantic data such as symbolic dependencies and function behaviors, thereby capturing implicit semantics that simpler methods overlook.

Conversely, code-based methods, inspired by NLP, interpret binary code as sequences of tokens. Initial models like Asm2Vec [65] employed word2vec frameworks, whereas subsequent seq2seq [81] models facilitated cross-architecture translation, enhancing semantic alignment. Transformer-based models such as OrderMatters [90] and Trex [91] have furthered semantic reasoning and transferability across different compilers and optimizations.

Both graph-based and code-based embeddings have demonstrated robust performance in BCSD tasks, including clone detection, malware variant identification, and patch analysis. Graph-based embeddings excel in structural fidelity and robustness but confront challenges in complexity and scalability. In contrast, code-based embeddings, benefiting from advances in deep learning and NLP, offer superior scalability, ease of deployment, and enhanced integration potential with other modalities such as comments and vulnerability descriptions. Consequently, code-based embeddings hold increasing promise for future BCSD research, aligning with trends in multi-modal analysis, zero-shot reasoning, and explainable security, thereby positioning them at the forefront of next-generation binary similarity detection.

Answering to RQ2

AI-driven BCSD employs a variety of technological strategies across several layers. Initially, it distinguishes between direct comparison methods that depend on unprocessed instructions and entail significant computational expenses, and indirect comparison methods that employ low-dimensional embeddings to achieve scalable similarity detection. The latter are further divided into hash-based techniques, which focus on speed but offer limited semantic depth, and embedding-based approaches that utilize machine learning to discern more complex semantic relationships. Moreover, BCSD integrates both graph-based embeddings, which emphasize structural fidelity, and code-based embeddings, inspired by NLP, enhancing both scalability and adaptability. This development marks a transition from traditional symbolic methods to more advanced deep learning-driven representation, thereby improving both the accuracy and scalability of detection.

3.2 Workflow of AI-Driven Binary Code Similarity Detection

This subsection describes the BCSD workflow through its three key stages: (1) preprocessing, where input binaries are normalized to isolate stable, fundamental features; (2) representation, in which the normalized code is organized for neural network analysis; and (3) embedding, where neural networks transform these structured representations into low-dimensional vector embeddings for subsequent similarity assessment. Further, it differentiates contemporary BCSD techniques into graph-based and code-based embedding methods, as depicted in Fig. 5. The ensuing discussion elucidates how each method progresses through these stages. The following discussion elaborates on how each method outlined in Table 2 addresses these developmental stages.

images

Figure 5: AI-driven BCSD technical workflow. The depicted workflow encompasses three stages: preprocessing, representation, and embedding. Here, preprocessed code is transformed into low-dimensional vectors, which are then utilized for similarity computation through both graph-based and code-based methods

The ensuing discussion details how prominent methods embody this workflow. Order Matters [90], for instance, introduces a semantic-aware neural network leveraging ACFGs and hierarchical pre-training, demonstrating strong performance in vulnerability detection and malware analysis. Asteria [92] applies TreeLSTM on ASTs for cross-platform similarity, excelling in IoT security analysis. BinDeep [22] adopts LSTM and Siamese networks for instruction-level embeddings, while PalmTree [93] employs large-scale self-supervised pre-training to capture instruction semantics. BinShot [94] advances one-shot similarity detection with a BERT-based framework, and XBA [95] aligns binary dependency graphs for cross-platform robustness. Similarly, jTrans [75] enhances Transformers with control-flow awareness, VulHawk [96] integrates entropy-based adaptation for cross-environment detection, and FASER [97] utilizes Long-document Transformers for scalable cross-platform search.

Building on these foundations, advanced frameworks have emerged: Asteria-Pro [98] integrates domain knowledge for efficiency and accuracy; sem2vec [99] extracts semantic traces robust to compiler variation; SOG [77] incorporates semantics-guided graph representations; and CLAP [28] aligns code with natural language descriptions for few-shot and zero-shot applications. Collectively, these methods exemplify the diversity of AI-driven BCSD techniques and highlight how the three-stage workflow underpins innovations across different architectures and application contexts.

Diverse AI-driven BCSD methodologies exhibit variations in their preprocessing, representation, and embedding phases, yet they exhibit several overarching similarities. During the preprocessing phase, each methodology employs distinct tools—such as IDA Pro, radare2, Binary Ninja, and angr—for tasks including feature extraction, IR generation, and symbolic execution. For example, IDA Pro is extensively utilized in methodologies like Order Matters and Asteria for disassembling binaries and extracting instruction sequences, while radare2 is employed in FASER to generate IR for multi-platform analysis. Binary Ninja facilitates the generation and processing of instruction sequences in PalmTree, and angr is used in sem2vec to analyze control flow and data flow through symbolic execution. In the representation phase, methodologies transform preprocessed data into semantically meaningful abstractions, such as ACFG, AST, instruction sequences, and graph representations. These abstractions capture control flow, syntax, and semantic relationships within the code. For instance, Order Matters and VulHawk employ ACFG to depict control flow dependencies between functions, while Asteria and Asteria-Pro utilize AST to capture syntactic structures, aiding in cross-platform vulnerability detection. XBA generates BDGs to model function dependencies, and methodologies such as BinDeep, PalmTree, and BinShot represent code as instruction sequences, enhancing their suitability for training deep learning models. In the embedding phase, various deep learning techniques—such as GNN, LSTM, CNN, and Transformers—are applied to convert representations into low-dimensional vectors for similarity computation. For instance, Order Matters and VulHawk utilize GNNs for graph embeddings to encapsulate topological relationships, BinDeep combines LSTM and CNN for processing instruction sequences and computing similarity, and PalmTree and BinShot employ Transformers or MLM to generate semantically rich instruction embeddings. Despite these methodological variances, all approaches strive to leverage deep learning models to capture richer semantic information from binary code, thereby enhancing similarity detection across platforms and architectures, and addressing challenges arising from compiler optimizations and hardware architecture differences.

Answering to RQ3

AI-driven BCSD typically follows a structured workflow comprising three principal stages: preprocessing, representation, and embedding. In the preprocessing stage, a variety of tools including IDA Pro, radare2, Binary Ninja, and angr are utilized to extract features, generate intermediate representations, or conduct symbolic execution. Subsequently, during the representation stage, the preprocessed data is transformed into semantically meaningful structures such as ACFG, AST, or sequences of instructions. These structures encapsulate control flow, syntax, and semantic relationships. In the final stage, embedding, advanced deep learning techniques such as GNN, LSTM, CNN, and Transformers are employed to convert these detailed representations into low-dimensional vectors. These vectors facilitate the computation of similarity, thereby enhancing cross-platform and cross-architecture detection capabilities.

4 Leveraging Large Language Models for Enhanced BCSD: Roles and Recent Techniques

This section introduces mainstream LLMs, providing an overview of their key characteristics and recent advancements. It further explores their applications in code analysis, emphasizing how LLMs have been instrumental in enhancing tasks such as code completion, vulnerability detection, and binary code similarity analysis. The section concludes with a detailed presentation of two exemplary LLM-based BCSD methods, elucidating their methodologies, benefits, and contributions to the domain of binary code analysis.

4.1 Overview of Mainstream Large Language Models

LLMs represent a pivotal development in the evolution of language models. Predominantly based on the Transformer architecture, these models are distinguished by their scalability and typically comprise hundreds of billions of parameters, trained on extensive datasets. This substantial scale endows them with robust generalization capabilities, which in turn enables superior performance across a wide spectrum of tasks, especially in NLP. This subsection presents an overview of notable LLMs, as summarized in Table 3, laying the groundwork for the analysis of their applications in code analysis.

Types of LLMs. LLMs can be broadly categorized into two distinct types: General LLMs and Code LLMs. General LLMs are engineered to address a wide spectrum of NLP tasks such as text generation, summarization, translation, and question answering. Their versatility makes them applicable across various domains. These models are trained on extensive textual datasets, with notable examples including BERTOverflow [102], PLBART [104], GPT-4 [115], and DeepSeek [117]. They are predominantly used in content creation, customer service, and general text processing. Conversely, Code LLMs are specifically optimized for programming and software engineering tasks, excelling in code completion, summarization, vulnerability detection, and function search. They are trained on data from programming languages, with models such as CodeBERT [100], CuBERT [101], GraphCodeBERT [103], CodeT5 [105], CodeGPT [106], Codex [107], Copilot [108], VulBERTa [109], UniXcoder [110], NatGen [111], Polycoder [32], Incoder [112], CodeGen [113], CCBERT [114], and Code Llama [116] demonstrating specialized proficiency in understanding and generating code. The fundamental distinction between these categories lies in their training data and task-specific optimizations: General LLMs are designed for a broad array of natural language tasks, whereas Code LLMs are tailor-made for programming languages, thereby offering enhanced performance in software-related tasks. This differentiation is essential for grasping their specialized roles within the realms of NLP and code-related applications.

Structure of LLMs. LLMs generally adhere to one of three principal architectures: Encoder-only, Encoder-Decoder, and Decoder-only. The Encoder-only models, such as CodeBERT [100], CuBERT [101], BERTOverflow [102], GraphCodeBERT [103], VulBERTa [109], CCBERT [114], and Code Llama [116], focus on extracting features from input data and excel in tasks like text classification and function search by concentrating on generating representations without sequential output. Encoder-Decoder models, such as PLBART [104], CodeT5 [105], UniXcoder [110], NatGen [111], and Incoder [112], integrate both encoding and decoding processes and are ideally suited for tasks requiring comprehension and generation of output, such as code summarization and translation. Decoder-only models, like CodeGPT [106], Codex [107], Copilot [108], Polycoder [32], CodeGen [113], and GPT-4 [115], specialize in generating sequences from a given context and are particularly adept at generative tasks, including text generation and code completion. Each architecture is purpose-built to optimize for specific applications: Encoder-only models excel in representation learning, Encoder-Decoder models are adept at both understanding input and generating output, and Decoder-only models focus on generative tasks.

Parameter Count of LLMs. The parameter count in LLMs is a critical determinant of the models’ capacity and performance. As illustrated in Table 3, high-capacity models such as GPT-4 [115] with 1.76 trillion parameters and DeepSeek [117] with 671 billion parameters represent the forefront of general LLMs’ capabilities, enabling them to handle complex tasks with remarkable proficiency. Conversely, many specialized Code LLMs, such as CodeBERT [100] and CuBERT [101], feature fewer parameters (e.g., 125 M or 110 M) and are optimized for code-related tasks, providing high efficiency within smaller architectures yet remaining highly effective for specific programming challenges. The parameter count directly influences a model’s generalization abilities; larger models typically deliver superior performance across a wider array of tasks but require more computational resources for training and inference.

Window Size of LLMs. The context window size, or the number of tokens a model can process simultaneously, is another essential characteristic of LLMs. Larger context windows enable models to comprehend and generate longer text sequences, which is particularly beneficial in applications that depend on long-range dependencies, such as elaborate code generation or extensive summarization tasks. For instance, models like GPT-4 [115] and Code Llama [116] are capable of accommodating context windows up to 8192 and 100K tokens, respectively, allowing them to manage larger contexts and more complex code patterns effectively. In contrast, models with smaller window sizes, such as CodeBERT [100] and CuBERT [101], are generally confined to processing up to 512 tokens, which, while adequate for many coding tasks, may not capture long-range dependencies as efficiently as their larger counterparts.

4.2 Overview of LLM-Based Approaches in Code Analysis

LLMs have demonstrated substantial potential in the realm of code analysis, attributed to their capacity to understand and generate intricate patterns across both natural and programming languages. Extensive research has been conducted to explore their utility in code analysis, yielding notable enhancements in areas such as fuzzing, bug detection, and code recovery. Table 4 provides a summary of key approaches, emphasizing pertinent details such as the application domain, code type (either source code or binary code), and various applications of LLMs including fuzzing, bug repair, bug detection, and code recovery.

Code Type of LLM-Based Code Analysis. LLM-based code analysis methodologies can be classified according to the type of input code: source code (SC) and binary code (BC). Predominantly, existing approaches are skewed towards source code analysis, which capitalizes on the abundant syntactic and semantic information inherent in high-level programming languages. Representative works such as COMFORT [36], VulRepair [40], CODAMOSA [37], Qtypist [38], TITANFUZZ [39], ARP-LLM [41], CEDAR [42], VD-LLM [44], and SCALE [45] utilize diverse LLMs such as GPT-2, GPT-3, Codex, and CodeT5 to address tasks like fuzzing, bug repair, and bug detection. Conversely, an increasing body of research has been extending LLM capabilities to binary code analysis, where source-level information is less accessible. Noteworthy examples include RESYM [46], ProRec [47], GENNM [48], and SYMGEN [49], employing models like StarCoder, CODEART, and CodeLlama for code recovery in stripped binaries. This nascent trend underscores the adaptability of LLMs to more abstract and information-sparse representations and points to a growing interest in cross-architecture binary analysis. While source code analysis continues to predominate, the field of binary code analysis with LLMs is evolving rapidly, reflecting the models’ capacity to handle increasingly complex and less information-dense inputs.

Application of LLM-Based Code Analysis. LLMs have been deployed in a broad array of code analysis tasks, notably within four principal categories: fuzzing (Fuzz), bug repair (BR), bug detection (BD), and code recovery (CR). In fuzzing, several methodologies such as COMFORT [36], CODAMOSA [37], Qtypist [38], and TITANFUZZ [39] employ LLMs including GPT-2, Codex, and GPT-3 to generate or direct test inputs, frequently utilizing zero-shot or prompt-based scenarios to enhance generalization. Significant advancements have been made in bug repair, with systems like VulRepair [40], ARP-LLM [41], and CEDAR [42] applying models such as CodeT5 and Codex to autonomously rectify coding issues, utilizing fine-tuning and few-shot learning approaches. For bug detection, techniques like VD-LLM [44] and SCALE [45] leverage GPT-3 in zero-shot configurations to pinpoint software vulnerabilities. The field of code recovery, particularly concerning binary code analysis, has recently gained prominence, with initiatives like RESYM [46], ProRec [47], GENNM [48], and SYMGEN [49] employing advanced models such as StarCoder, CODEART, and CodeLlama, primarily through fine-tuning, to deduce semantics from stripped or obfuscated binaries. Overall, LLMs increasingly underpin a vast spectrum of code analysis tasks, with fuzzing and code recovery emerging as particularly dynamic areas. While initial approaches favored source code and simpler prompting techniques, recent endeavors have expanded into binary analysis and increasingly rely on task-specific fine-tuning for enhanced performance and generalization.

LLM-Based Code Analysis. In the domain of LLM-based code analysis, a variety of foundational language models have been employed, ranging from earlier iterations such as GPT-2 to more contemporary, code-centric models like Codex, CodeT5, StarCoder, CODEART, and CodeLlama. The GPT series has been deployed across a multitude of tasks. For instance, COMFORT [36] utilized GPT-2 for fuzzing purposes, whereas Qtypist [38], VD-LLM [44], and SCALE [45] employed GPT-3 for both fuzzing and bug detection. These applications typically leverage prompt-based or zero-shot approaches, necessitating minimal fine-tuning. Codex has been prominently featured in various studies such as CODAMOSA [37], TITANFUZZ [39], ARP-LLM [41], and CEDAR [42], mainly for tasks related to fuzzing and bug repair, utilizing zero-shot or few-shot techniques to exhibit significant generalizability. CodeT5 was specifically employed in VulRepair [40] for source-level bug repair and was fine-tuned to enhance task-specific performance. In the context of binary code analysis, increasingly specialized models have been introduced recently—such as StarCoder [46], CODEART [47], and CodeLlama [48,49]—primarily fine-tuned for complex code recovery tasks. These developments suggest a shift from general-purpose language models to more specialized, code-focused LLMs, with fine-tuning emerging as the predominant strategy for tasks that demand profound semantic understanding or involve binary analysis.

Utilization of LLMs for Code Analysis. The integration of LLMs into code analysis is characterized by four principal methodologies: fine-tuning (FT), zero-shot (ZS), prompt-based (P), and few-shot (FS) learning. Fine-tuning is primarily employed for tasks that necessitate specific adaptations, as evidenced in COMFORT [36] which utilizes GPT-2 for fuzzing, VulRepair [40] employing CodeT5 for bug repair, and a series of recent endeavors in binary code recovery such as RESYM [46], ProRec [47], GENNM [48], and SYMGEN [49], all of which harness the capabilities of fine-tuned models like StarCoder, CODEART, and CodeLlama. Conversely, zero-shot applications are widely adopted for their ability to generalize without further training, as demonstrated by models like CODAMOSA [37], TITANFUZZ [39], ARP-LLM [41], VD-LLM [44], and SCALE [45], predominantly utilizing Codex or GPT-3. Prompt-based approaches are illustrated through applications such as Qtypist [38], where GPT-3 directs fuzzing via specific input prompts without necessitating model retraining. Few-shot learning is utilized in CEDAR [42] to improve bug repair efficacy with a minimal set of examples. Notably, fine-tuning is preferred for binary code analysis and recovery tasks due to its precision in task-specific scenarios, whereas zero-shot and prompt-based methods are more common in source code analysis, offering flexibility and straightforward deployment. This delineation exemplifies the trade-off between adaptability and generality inherent in the strategic deployment of LLMs.

4.3 Comparative Analysis of Typical LLM-BCSD Systems

When introducing LLMs in BCSD, it is crucial to compare LLM-based methods with traditional AI approaches to provide historical context and underscore their evolution. Traditional AI methods in BCSD typically rely on rule-based algorithms and feature engineering, which struggle with the complexity and variability of binary code and require extensive manual intervention. In contrast, LLMs, such as the GPT series, leverage deep learning to autonomously learn patterns and syntax from vast datasets, allowing them to effectively tackle complex similarity detection tasks. The transition from rule-based to deep learning-based methods has markedly improved the accuracy, efficiency, and scalability of BCSD, representing a significant breakthrough in binary code analysis.

The deployment of LLMs in BCSD offers innovative prospects for bridging the chasm between low-level binary code and high-level semantic insights. This paper conducts a thorough comparative analysis of two LLM-based BCSD techniques—Bin2SrcSim and CLAP—elucidating their distinctive methodologies, objectives, and contexts of application. Both approaches capitalize on the capabilities of LLMs, yet they adopt different strategies to optimize binary code analysis, each addressing specific challenges. As illustrated in Fig. 6, LLMs fulfill two complementary functions within the contemporary AI-driven BCSD frameworks. The Bin2SrcSim approach emphasizes source-level similarity analysis, where LLMs are fine-tuned to transform disassembled pseudocode into high-level source code representations, facilitating similarity assessments through established textual metrics such as cosine and Jaccard similarity. In contrast, the CLAP strategy focuses on enhancing the quality and transferability of binary code embeddings by employing language supervision. Through aligning binary instructions with their corresponding semantic descriptions in natural language, CLAP promotes effective knowledge transfer across diverse compilation environments and supports robust few-shot and zero-shot similarity detection capabilities.

images

Figure 6: Workflow of LLM-Driven BCSD. The distinct roles of Bin2SrcSim and CLAP in the preprocessing, representation, and embedding stages are designed to enhance binary code analysis through source code recovery and knowledge transfer

Bin2SrcSim [29]: leverages LLMs to bridge the semantic gap between binaries and source code, pioneering source code recovery for BCSD. The framework fine-tunes LLMs at the function level to transform assembly or intermediate representations into high-level code approximations. This enables the use of established similarity metrics, such as cosine and Jaccard similarity, for assessing functional equivalence. Operating at the source level enhances interpretability and semantic clarity, allowing analysts to understand binary behavior even without original source code. It also improves consistency across compilers, optimization levels, and architectures, supporting robust similarity analysis for tasks such as vulnerability detection, reverse engineering, and malware analysis. Furthermore, the fine-tuned model generalizes across languages and binary formats, and can be incrementally refined with feedback or additional training data, making Bin2SrcSim a versatile tool for legacy systems and security-critical applications.

CLAP [28]: adopts a different strategy by generating semantically enriched embeddings through supervised alignment between binary instructions and natural language annotations. This alignment allows the model to capture both structural and functional intent, producing transferable embeddings that generalize across compilers, optimizations, and architectures. A scalable data generation module automatically synthesizes diverse assembly snippets with descriptive annotations (e.g., “This function computes the factorial of an integer”), enabling robust pretraining. The resulting embeddings perform strongly in few-shot and zero-shot scenarios, where conventional methods often fail, and are highly effective for cross-architecture binary analysis. Beyond BCSD, CLAP supports natural language queries for tasks such as identifying cryptographic functions, integrating binary analysis with broader NLP workflows. By combining LLM capabilities with semantic alignment, CLAP offers a scalable and flexible solution for security and software analysis.

The Bin2SrcSim and CLAP methodologies delineate distinct approaches to BCSD utilizing LLMs. Bin2SrcSim concentrates on source-level similarity by reconstructing high-level code from disassembled binaries, employing similarity metrics such as cosine and Jaccard to ascertain functional equivalence and ensure semantic consistency. This enhancement in interpretability renders it ideal for applications such as binary code matching, vulnerability detection, and reverse engineering. Conversely, CLAP generates semantically enriched binary embeddings through supervised alignment with natural language, capturing both structural and functional attributes. It particularly excels in cross-architecture and cross-optimization tasks, notably in few-shot and zero-shot learning environments. While Bin2SrcSim prioritizes functional equivalence, CLAP amplifies transferability, rendering both methods complementary within AI-driven BCSD frameworks. A comparison of these methods is provided in Table 5.

images

Answering to RQ4

LLMs have advanced rapidly with the adoption of the Transformer architecture, offering strong scalability and task generalization. Beyond driving breakthroughs in NLP, they are increasingly applied to code analysis, including both source and binary code. In BCSD, approaches such as Bin2SrcSim leverage LLMs to map disassembled binaries to high-level constructs, improving interpretability and vulnerability detection. Similarly, CLAP aligns binary code with natural language to generate semantically enriched embeddings, enhancing cross-architecture transferability and achieving superior performance in few-shot and zero-shot scenarios. Overall, LLMs play an increasingly critical role in tackling complex challenges in binary code analysis.

5 Challenges and Future Directions in BCSD

This section explores the challenges faced by LLM-based BCSD during the preprocessing, representation, and embedding stages. It also addresses issues such as data type identification, feature extraction, and scalability. Additionally, this section delineates future research directions aimed at refining preprocessing techniques, enhancing LLM architectures, and developing more scalable embedding strategies. The discussion concludes by underscoring the potential of hybrid approaches and few-shot learning in surmounting these challenges, thereby augmenting the accuracy and robustness of BCSD.

5.1 Challenge

This subsection discusses the challenges faced by LLMs in binary code analysis, focusing on the preprocessing, representation, and embedding stages that hinder accurate vulnerability detection and analysis.

Challenges in Preprocessing. During decompilation, LLMs often struggle to infer precise data types, which causes substantial information loss and reduces overall accuracy. For example, evaluations of the LLM4Decompile framework showed notable gains in readability but still reported 20–30% discrepancies in type inference compared to human experts [118]. Similarly, DeGPT achieved a 24.4% reduction in analysts’ cognitive burden on Ghidra outputs, yet persistent type identification errors remained [119]. The reliability of preprocessing tools such as IDA Pro, radare2, and Binary Ninja is therefore crucial. Incomplete feature extraction in these tools can propagate errors downstream, as confirmed by the BinMetric benchmark, which reported error rates of 15–25% in tasks such as function identification [120]. Real-world binaries add further complexity, with external dependencies, nested function calls, and intricate control flows often exceeding LLM context limits. LLM-powered static taint analysis, for instance, showed failure rates above 40% when handling large binaries with complex flows [121]. In addition, LLMs frequently misinterpret variables with identical names across different scopes, resulting in incomplete or inaccurate representations. ReSym experiments confirmed this limitation, showing that direct LLM prompting reduced accuracy in variable and type recovery by 1.5–16.5% compared to hybrid approaches [46].

Challenges in Representation. LLMs also face difficulties in generating representations that capture both structural and functional semantics after preprocessing. This problem is pronounced in cases involving complex control flows, deep nesting, or race conditions. For example, stripped binary analysis with CodeLlama-34b yielded low F1-scores (27.59%) in function name recovery, reflecting the lack of semantic cues in binaries [122]. Large-scale code inputs present further challenges. Experiments with GPT-4 on code samples exceeding 10,000 lines revealed accuracy drops and inconsistent outputs, including incomplete reasoning and repetition [123]. These shortcomings often lead to oversimplification of vulnerabilities and logical flows, resulting in high false positive rates. A multi-model evaluation reported false positives exceeding 30%, with patched code mistakenly flagged as vulnerable due to lost context in multi-function data flows [124]. Mapping low-level binaries to high-level constructs also remains difficult. In LLM4Decompile, re-executability rates on HumanEval-Decompile and ExeBench were only 45.4% and 18.0%, respectively, largely due to errors in reconstructing high-level structures such as nested loops [46].

Challenges in Embedding. LLM embeddings are often non-deterministic: identical inputs can yield variable outputs, undermining reproducibility in tasks like vulnerability detection and patch repair. Evaluations across multiple LLMs reported inconsistent classifications of CWEs such as out-of-bounds writes and SQL injection, even under low-temperature settings [124]. Hallucinations further exacerbate this problem. A study on LLM-powered code generation found pass rates below 10% on affected snippets, with intent-conflicting and context-inconsistent hallucinations directly causing vulnerabilities [125]. Scalability is another challenge. Token length limitations restrict LLMs from processing large binaries, forcing code partitioning and risking incomplete analysis. For instance, in WannaCry ransomware analysis, token constraints caused overlooked details and partial detection of malicious behavior. Adversarial robustness is also limited: minor perturbations can mislead LLM outputs. Explainer-guided attacks achieved higher success rates than prior methods, demonstrating efficient transferability in vulnerability detection [126]. Finally, performance in few-shot and zero-shot settings remains uneven. While fine-tuned models like CodeLlama reached F1-scores of 0.82, general-purpose LLMs showed variable results, especially in rare or complex vulnerability types [124].

5.2 Future Directions

This subsection delineates future research trajectories aimed at enhancing LLM-based BCSD by focusing on the improvement of preprocessing techniques, representation models, and embedding strategies. These enhancements aim to address current challenges and augment the accuracy, robustness, and scalability of LLMs in binary code analysis.

Direction on Preprocessing. To tackle issues related to inaccurate data type identification and information loss, future research should concentrate on the development of advanced preprocessing techniques that improve the accuracy of decompilation and feature extraction. Employing hybrid approaches that amalgamate traditional static analysis with machine learning-based techniques may enhance the extraction of stable, low-level features and mitigate errors stemming from external dependencies and complex control flows. Moreover, enhancing the management of variable scope and context tracking in LLMs could reduce the risk of errors during preprocessing.

Direction on Representation. To address the limitations inherent in representing complex code structures, future research should investigate the development of more sophisticated LLMs architectures. These architectures need to be aptly designed to capture the structural and functional characteristics of binary code effectively. The incorporation of domain-specific knowledge, such as control flow graphs or abstract syntax trees, into LLMs would enhance the preservation of critical context and elevate the accuracy of vulnerability detection. Additionally, advancing techniques that map low-level binary data to high-level programming constructs more efficaciously could lead to more precise detection of subtle vulnerabilities, thereby ensuring a more accurate functional equivalence between the source and binary code.

Direction on Embedding. In light of the challenges posed by non-deterministic behavior and hallucinations in embeddings, future work should investigate methodologies to augment the consistency and reliability of LLM-based embedding models. Employing approaches such as reinforcement learning or adversarial training could mitigate the effects of hallucinations and enhance the robustness of LLMs against adversarial attacks. Moreover, addressing scalability issues by overcoming token limitations—possibly through the implementation of memory-augmented models or hierarchical embeddings—would permit LLMs to process larger and more complex codebases more effectively. Further research into few-shot and zero-shot learning techniques, especially for software security tasks, could facilitate more efficient fine-tuning and model adaptation specific to binary analysis tasks.

Direction on Data Scarcity. To address the challenge of data scarcity in BCSD, where the availability of labeled binary code datasets remains limited, future research should focus on exploring advanced techniques such as semi-supervised learning, data augmentation, and transfer learning. These approaches can help create more robust models by leveraging small amounts of labeled data and larger pools of unlabeled data, thereby improving the overall effectiveness of BCSD. Additionally, strategies for synthesizing realistic binary code samples through generative models or other innovative methods could further alleviate data scarcity and contribute to more effective model training.

Answering to RQ5

Concerning LLM-based BCSD, persistent challenges arise across preprocessing, representation, and embedding stages. In preprocessing, LLMs struggle to accurately identify complex binary data types, often causing information loss. During representation, they frequently fail to capture structural and functional characteristics, particularly in complex control flows. At the embedding stage, non-determinism and scalability constraints limit performance on large codebases. Addressing these issues requires refined preprocessing techniques, improved feature extraction, and scalable model architectures. Incorporating domain knowledge and advancing few-shot or zero-shot learning can further enhance vulnerability detection accuracy. Moreover, the scarcity of labeled binary datasets restricts effective training. Future research should therefore explore semi-supervised learning, data augmentation, and transfer learning to mitigate data limitations and improve model robustness.

6 Limitations

The current state of LLM-based BCSD remains in its early stages, and as such, this paper does not offer a comprehensive overview of all existing LLM-based BCSD methods. Given the rapidly evolving nature of this field, many methodologies are still under development, and their practical applicability has not yet been fully established. This review thus emphasizes significant advancements and foundational techniques, highlighting key studies that are most representative of the current research landscape.

Moreover, it is important to acknowledge several challenges in the field that are critical for future research. These include dataset bias, which can skew the results and limit the generalizability of methods; reproducibility challenges, which hinder the verification of findings and the adoption of new techniques; and the computational costs associated with large-scale LLM training, which remain a significant barrier to the widespread adoption of LLM-based BCSD approaches. These factors, often underexplored in current literature, are essential considerations for assessing the scalability and efficacy of LLM-based methods.

As the field progresses, it is anticipated that future research will address these limitations, refining existing methodologies and offering deeper insights into the application of LLM-based BCSD across diverse environments. The findings presented herein should be considered as a snapshot of the field at this particular juncture, with the understanding that new methods and solutions will likely emerge as the field matures.

7 Conclusion

BCSD has undergone substantial evolution, transitioning from elementary syntactic comparisons to sophisticated methodologies driven by AI. These methodologies are now capable of comprehensively capturing complex semantic relationships inherent in contemporary software systems. The integration of LLMs has been instrumental in this progression, significantly enhancing the capability to detect vulnerabilities and malware patterns. This enhancement stems from a deeper semantic understanding and increased scalability across various architectures. This review systematically delineates the evolution of BCSD, tracing its journey from its inception in code differencing to its present sophistication, which leverages deep learning and graph-based reasoning. Furthermore, we explore the intricately layered framework of AI-driven BCSD, comprising preprocessing, representation, and embedding stages, each stage playing a crucial role in achieving more accurate and scalable similarity detection.

Although LLMs demonstrate considerable potential in propelling BCSD forward, several challenges remain at the preprocessing, representation, and embedding stages. Issues such as data type identification, information loss, and scalability constraints continue to pose significant hurdles, particularly in the realm of complex software security tasks. Nonetheless, LLMs represent a promising pathway for enhancing BCSD, with forthcoming research focused on surmounting these challenges and refining their application in binary code analysis.

This paper provides an exhaustive review of the current state of BCSD and its integration with LLMs, addressing pivotal questions related to its evolution, frameworks, applicability, and prospective research directions. Looking ahead, continuous advancements in LLMs and their amalgamation with BCSD are anticipated to yield more robust, efficient, and scalable solutions, ultimately fortifying the security and reliability of modern software systems.

Acknowledgement: Not applicable.

Funding Statement: The authors received no specific funding for this study.

Author Contributions: Shengjia Chang contributed to the conceptualization, literature survey, methodology, formal analysis, investigation, original draft preparation, and visualization. Baojiang Cui was responsible for resources, supervision, project administration, and contributed to reviewing and editing the manuscript. Shaocong Feng contributed to the literature survey, methodology, formal analysis, investigation, and manuscript review and editing. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: This survey is based entirely on previously published scientific literature. All data and materials discussed or analyzed in this review were obtained from the publicly available research articles cited throughout the manuscript. No new primary data were generated or analyzed as part of this study.

Ethics Approval: This article does not contain any studies with human participants or animals performed by any of the authors.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Haq IU, Caballero J. A survey of binary code similarity. ACM Comput Surv. 2022;54(3):1–38. doi:10.1145/3446371. [Google Scholar] [CrossRef]

2. Gao J, Yang X, Fu Y, Jiang Y, Sun J. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering; 2018 Sep 3–7; Montpellier, France. p. 896–9. doi:10.1145/3238147.3240480. [Google Scholar] [CrossRef]

3. Xu X, Liu C, Feng Q, Yin H, Song L, Song D. Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security; 2017 Oct 30–Nov 3; Dallas, TX, USA. p. 363–76. doi:10.1145/3133956.3134018. [Google Scholar] [CrossRef]

4. David Y, Partush N, Yahav E. FirmUp: precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Notices. 2018;53(2):392–404. doi:10.1145/3296957.3177157. [Google Scholar] [CrossRef]

5. Shirani P, Collard L, Agba BL, Lebel B, Debbabi M, Wang L, et al. Binarm: scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices. In: Detection of Intrusions and Malware, and Vulnerability Assessment: 15th International Conference, DIMVA 2018; 2018 Jun 28–29; Saclay, France. Cham, Switzerland: Springer International Publishing; 2018. p. 114–38. [Google Scholar]

6. Liu B, Huo W, Zhang C, Li W, Li F, Piao A, et al. αDiff: cross-version binary code similarity detection with DNN. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering; 2018 Sep 3–7; Montpellier, France. p. 667–78. doi:10.1145/3238147.3238199. [Google Scholar] [CrossRef]

7. Bruschi D, Martignoni L, Monga M. Detecting self-mutating malware using control-flow graph matching. In: Detection of intrusions and malware & vulnerability assessment. Berlin/Heidelberg, Germany: Springer; 2006. p. 129–43. doi:10.1007/11790754_8. [Google Scholar] [CrossRef]

8. Cesare S, Xiang Y, Zhou W. Control flow-based malware VariantDetection. IEEE Trans Dependable Secure Comput. 2014;11(4):307–17. doi:10.1109/TDSC.2013.40. [Google Scholar] [CrossRef]

9. Lindorfer M, Di Federico A, Maggi F, Comparetti PM, Zanero S. Lines of malicious code: insights into the malicious software industry. In: Proceedings of the 28th Annual Computer Security Applications Conference; 2012 Dec 3–7; Orlando, FL, USA. p. 349–58. doi:10.1145/2420950.2421001. [Google Scholar] [CrossRef]

10. Jang J, Woo M, Brumley D. Towards automatic software lineage inference. In: 22nd USENIX Security Symposium (USENIX Security 13); 2013 Aug 14–16; Washington, DC, USA. p. 81–96. [Google Scholar]

11. Ming J, Xu D, Wu D. Memoized semantics-based binary diffing with application to malware lineage inference. In: ICT systems security and privacy protection. Cham, Switzerland: Springer International Publishing; 2015. p. 416–30. doi:10.1007/978-3-319-18467-8_28. [Google Scholar] [CrossRef]

12. Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015;349(6245):261–6. doi:10.1126/science.aaa8685. [Google Scholar] [PubMed] [CrossRef]

13. Manning CD, Schütze H. Foundations of statistical natural language processing. London, UK: MIT Press; 1999. [Google Scholar]

14. Chen F, Wang YC, Wang B, Kuo CJ. Graph representation learning: a survey. APSIPA Trans Signal Inf Process. 2020;9(1):e15. doi:10.1017/atsip.2020.13. [Google Scholar] [CrossRef]

15. Wang X, Jiang Y, Bach N, Wang T, Huang Z, Huang F, et al. Automated concatenation of embeddings for structured prediction. arXiv:2010.05006. 2020. [Google Scholar]

16. Duan Y, Li X, Wang J, Yin H. DeepBinDiff: learning program-wide code representations for binary diffing. In: Proceedings of the 2020 Network and Distributed System Security Symposium; 2020 Feb 23–26; San Diego, CA, USA. doi:10.14722/ndss.2020.24311. [Google Scholar] [CrossRef]

17. Fu L, Ji S, Liu C, Liu P, Duan F, Wang Z, et al. Focus: function clone identification on cross-platform. Int J Intelligent Sys. 2022;37(8):5082–112. doi:10.1002/int.22752. [Google Scholar] [CrossRef]

18. Ling X, Wu L, Wang S, Ma T, Xu F, Liu AX, et al. Multilevel graph matching networks for deep graph similarity learning. IEEE Trans Neural Netw Learning Syst. 2023;34(2):799–813. doi:10.1109/tnnls.2021.3102234. [Google Scholar] [PubMed] [CrossRef]

19. Peng D, Zheng S, Li Y, Ke G, He D, Liu T. How could neural networks understand programs? In: Proceedings of the 38th International Conference on Machine Learning; 2021 Jul 18–24; Online. [Google Scholar]

20. Yang J, Fu C, Liu XY, Yin H, Zhou P. Codee: a tensor embedding scheme for binary code search. IIEEE Trans Software Eng. 2022;48(7):2224–44. doi:10.1109/tse.2021.3056139. [Google Scholar] [CrossRef]

21. Fan W, Ma Y, Li Q, He Y, Zhao E, Tang J, et al. Graph neural networks for social recommendation. In: The World Wide Web Conference. New York, NY, USA: The Association for Computing Machinery (ACM); 2019. p. 417–26. doi:10.1145/3308558.3313488. [Google Scholar] [CrossRef]

22. Tian D, Jia X, Ma R, Liu S, Liu W, Hu C. BinDeep: a deep learning approach to binary code similarity detection. Expert Syst Appl. 2021;168:114348. doi:10.1016/j.eswa.2020.114348. [Google Scholar] [CrossRef]

23. Feng Q, Wang M, Zhang M, Zhou R, Henderson A, Yin H. Extracting conditional formulas for cross-platform bug search. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2017. p. 346–59. doi:10.1145/3052973.3052995. [Google Scholar] [CrossRef]

24. Gao D, Reiter MK, Song D. BinHunt: automatically finding semantic differences in binary programs. In: Information and communications security. Berlin/Heidelberg, Germany: Springer; 2008. p. 238–55. doi:10.1007/978-3-540-88625-9_16. [Google Scholar] [CrossRef]

25. Jin W, Chaki S, Cohen C, Gurfinkel A, Havrilla J, Hines C, et al. Binary function clustering using semantic hashes. In: 2012 11th International Conference on Machine Learning and Applications; 2012 Dec 12–15;Boca Raton, FL, USA. p. 386–91. doi:10.1109/ICMLA.2012.70. [Google Scholar] [CrossRef]

26. Lakhotia A, Dalla Preda M, Giacobazzi R. Fast location of similar code fragments using semantic ‘juice’. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. New York, NY, USA: The Association for Computing Machinery (ACM); 2013. p. 1–6. doi:10.1145/2430553.2430558. [Google Scholar] [CrossRef]

27. Luo L, Ming J, Wu D, Liu P, Zhu S. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York, NY, USA: The Association for Computing Machinery (ACM); 2014. p. 389–400. doi:10.1145/2635868.2635900. [Google Scholar] [CrossRef]

28. Wang H, Gao Z, Zhang C, Sha Z, Sun M, Zhou Y, et al. CLAP: learning transferable binary code representations with natural language supervision. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: The Association for Computing Machinery (ACM); 2024. p. 503–15. doi:10.1145/3650212.3652145. [Google Scholar] [CrossRef]

29. Wan B, Wang S, Wei Z, Huang J, Hu C. Binary code similarity detection via LLM-based source code conversion. IEEE Internet Things J. 2025. doi:10.1109/JIOT.2025.3579231. [Google Scholar] [CrossRef]

30. Lu S, Guo D, Ren S, Huang J, Svyatkovskiy A, Blanco A, et al. CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. arXiv:2102.04664. 2021. [Google Scholar]

31. Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto HP, Kaplan J, et al. Evaluating large language models trained on code. arXiv:2107.03374. 2021. [Google Scholar]

32. Xu FF, Alon U, Neubig G, Hellendoorn VJ. A systematic evaluation of large language models of code. In: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. p. 1–10. doi:10.1145/3520312.3534862. [Google Scholar] [CrossRef]

33. Rozière B, Gehring J, Gloeckle F, Sootla S, Gat I, Tan XE, et al. Code llama: open foundation models for code. arXiv:2308.12950. 2023. [Google Scholar]

34. Li R, Allal LB, Zi Y, Muennighoff N, Kocetkov D, Mou C, et al. Starcoder: may the source be with you! arXiv:2305.06161. 2023. [Google Scholar]

35. Marcelli A, Graziano M, Ugarte-Pedrero X, Fratantonio Y, Cisco Systems, Inc., Mansouri M. How machine learning is solving the binary function similarity problem. In: 31st USENIX Security Symposium (USENIX Security 22); 2022 Aug 10–12; Boston, MA, USA. p. 2099–116. [Google Scholar]

36. Ye G, Tang Z, Tan SH, Huang S, Fang D, Sun X, et al. Automated conformance testing for JavaScript engines via deep compiler fuzzing. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. New York, NY, USA: The Association for Computing Machinery (ACM); 2021. p. 435–50. doi:10.1145/3453483.3454054. [Google Scholar] [CrossRef]

37. Lemieux C, Inala JP, Lahiri SK, Sen S. CodaMosa: escaping coverage plateaus in test generation with pre-trained large language models. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE); 2023 May 14–20; Melbourne, VIC, Australia. p. 919–31. doi:10.1109/ICSE48619.2023.00085. [Google Scholar] [CrossRef]

38. Liu Z, Chen C, Wang J, Che X, Huang Y, Hu J, et al. Fill in the blank: context-aware automated text input generation for mobile GUI testing. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE); 2023 May 14–20; Melbourne, VIC, Australia. p. 1355–67. doi:10.1109/ICSE48619.2023.00119. [Google Scholar] [CrossRef]

39. Deng Y, Xia CS, Peng H, Yang C, Zhang L. Large language models are zero-shot fuzzers: fuzzing deep-learning libraries via large language models. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: The Association for Computing Machinery (ACM); 2023. p. 423–35. doi:10.1145/3597926.3598067. [Google Scholar] [CrossRef]

40. Fu M, Tantithamthavorn C, Le T, Nguyen V, Phung D. VulRepair: a T5-based automated software vulnerability repair. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. p. 935–47. doi:10.1145/3540250.3549098. [Google Scholar] [CrossRef]

41. Fan Z, Gao X, Mirchev M, Roychoudhury A, Tan SH. Automated repair of programs from large language models. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE); 2023 May 14–20; Melbourne, VIC, Australia. p. 1469–81. doi:10.1109/ICSE48619.2023.00128. [Google Scholar] [CrossRef]

42. Nashid N, Sintaha M, Mesbah A. Retrieval-based prompt selection for code-related few-shot learning. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE); 2023 May 14–20; Melbourne, VIC, Australia. p. 2450–62. doi:10.1109/ICSE48619.2023.00205. [Google Scholar] [CrossRef]

43. Cheshkov A, Zadorozhny P, Levichev R. Evaluation of ChatGPT model for vulnerability detection. arXiv:2304.07232. 2023. [Google Scholar]

44. Das Purba M, Ghosh A, Radford BJ, Chu B. Software vulnerability detection using large language models. In: 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW); 2023 Oct 9–12; Florence, Italy. p. 112–9. doi:10.1109/ISSREW60843.2023.00058. [Google Scholar] [CrossRef]

45. Wen XC, Gao C, Gao S, Xiao Y, Lyu MR. SCALE: constructing structured natural language comment trees for software vulnerability detection. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: The Association for Computing Machinery (ACM); 2024. p. 235–47. doi:10.1145/3650212.3652124. [Google Scholar] [CrossRef]

46. Xie D, Zhang Z, Jiang N, Xu X, Tan L, Zhang X. ReSym: harnessing LLMs to recover variable and data structure symbols from stripped binaries. In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2024. p. 4554–68. doi:10.1145/3658644.3670340. [Google Scholar] [CrossRef]

47. Su Z, Xu X, Huang Z, Zhang K, Zhang X. Source code foundation models are transferable binary analysis knowledge bases. In: The Thirty-Eighth Annual Conference on Neural Information Processing Systems; 2024 Dec 10–15; Vancouver, BC, Canada. p. 112624–55. [Google Scholar]

48. Xu X, Zhang Z, Su Z, Huang Z, Feng S, Ye Y, et al. Unleashing the power of generative model in recovering variable names from stripped binary. In: Proceedings of the 2025 Network and Distributed System Security Symposium; 2025 Feb 24–28; San Diego, CA, USA. doi:10.14722/ndss.2025.240276. [Google Scholar] [CrossRef]

49. Jiang L, Jin X, Lin Z. Beyond classification: inferring function names in stripped binaries via domain adapted LLMs. In: Proceedings of the 2025 Network and Distributed System Security Symposium; 2025 Feb 24–28; San Diego, CA, USA. doi:10.14722/ndss.2025.240797. [Google Scholar] [CrossRef]

50. Baker Brenda S, Manber U, Muth R. Compressing differences of executable code. In: ACMSIGPLAN Workshop on Compiler Support for System Software (WCSS). Princeton, NJ, USA: Citeseer; 1999. [Google Scholar]

51. Flake H. Structural comparison of executable objects. In: Detection of Intrusions and Malware and Vulnerability Assessment, GI SIG SIDAR Workshop, DIMVA 2004. Bonn, Germany: Köllen Verlag; 2004. p. 161–74. [Google Scholar]

52. Dullien T, Rolles R. Graph-based comparison of executable objects (English version). In: Actes du Symposium SSTIC05; 2005. p. 1–13. [Google Scholar]

53. Hu Y, Zhang Y, Li J, Gu D. Cross-architecture binary semantics understanding via similar code comparison. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER); 2016 Mar 14–18; Osaka, Japan. p. 57–67. doi:10.1109/SANER.2016.50. [Google Scholar] [CrossRef]

54. Huang H, Youssef AM, Debbabi M. BinSequence: fast, accurate and scalable binary code reuse detection. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2017. p. 155–66. doi:10.1145/3052973.3052974. [Google Scholar] [CrossRef]

55. Xu Z, Chen B, Chandramohan M, Liu Y, Song F. SPAIN: security patch analysis for binaries towards understanding the pain and pills. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE); 2017 May 20–28; Buenos Aires, Argentina. p. 462–72. doi:10.1109/ICSE.2017.49. [Google Scholar] [CrossRef]

56. Kargén U, Shahmehri N. Towards robust instruction-level trace alignment of binary code. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE); 2017 Oct 30–Nov 3; Urbana, IL, USA. p. 342–52. doi:10.1109/ASE.2017.8115647. [Google Scholar] [CrossRef]

57. David Y, Yahav E. Tracelet-based code search in executables. ACM SIGPLAN Notices. 2014;49(6):349–60. doi:10.1145/2666356.2594343. [Google Scholar] [CrossRef]

58. Pewny J, Schuster F, Bernhard L, Holz T, Rossow C. Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference. New York, NY, USA: The Association for Computing Machinery (ACM); 2014. p. 406–15. doi:10.1145/2664243.2664269. [Google Scholar] [CrossRef]

59. Pewny J, Garmany B, Gawlik R, Rossow C, Holz T. Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy; 2015 May 17–21; San Jose, CA, USA. p. 709–24. doi:10.1109/SP.2015.49. [Google Scholar] [CrossRef]

60. Eschweiler S, Yakdan K, Gerhards-Padilla E. discovRE: efficient cross-architecture identification of bugs in binary code. In: Proceedings of the 2016 Network and Distributed System Security Symposium; 2016 Feb 21–24; San Diego, CA, USA. doi:10.14722/ndss.2016.23185. [Google Scholar] [CrossRef]

61. David Y, Partush N, Yahav E. Statistical similarity of binaries. ACM SIGPLAN Notices. 2016;51(6):266–80. doi:10.1145/2980983.2908126. [Google Scholar] [CrossRef]

62. Feng Q, Zhou R, Xu C, Cheng Y, Testa B, Yin H. Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2016. p. 480–91. doi:10.1145/2976749.2978370. [Google Scholar] [CrossRef]

63. Chandramohan M, Xue Y, Xu Z, Liu Y, Cho CY, Tan HBK. BinGo: cross-architecture cross-OS binary search. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York, NY, USA: The Association for Computing Machinery (ACM); 2016. p. 678–89. doi:10.1145/2950290.2950350. [Google Scholar] [CrossRef]

64. David Y, Partush N, Yahav E. Similarity of binaries through re-optimization. In: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, NY, USA: The Association for Computing Machinery (ACM); 2017. p. 79–94. doi:10.1145/3062341.3062387. [Google Scholar] [CrossRef]

65. Ding SHH, Fung BCM, Charland P. Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: 2019 IEEE Symposium on Security and Privacy (SP); 2019 May 19–23; San Francisco, CA, USA. p. 472–89. doi:10.1109/sp.2019.00003. [Google Scholar] [CrossRef]

66. Massarelli L, Antonio Di Luna G, Petroni F, Querzoni L, Baldoni R. Safe: self-attentive function embeddings for binary similarity. In: Detection of Intrusions and Malware, and Vulnerability Assessment: 16th International Conference, DIMVA 2019. Cham, Switzerland: Springer International Publishing; 2019. [Google Scholar]

67. Hu X, Chiueh TC, Shin KG. Large-scale malware indexing using function-call graphs. In: Proceedings of the 16th ACM Conference on Computer and Communications Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2009. p. 611–20. doi:10.1145/1653662.1653736. [Google Scholar] [CrossRef]

68. Hu X, Bhatkar S, Griffin K, Shin KG. MutantX-S: scalable malware clustering based on static features. In: USENIX ATC’13: Proceedings of the 2013 USENIX conference on Annual Technical Conference; 2013 Jun 26–28; San Jose, CA, USA. p. 187–98. [Google Scholar]

69. Kim T, Lee YR, Kang B, Im EG. Binary executable file similarity calculation using function matching. J Supercomput. 2019;75(2):607–22. doi:10.1007/s11227-016-1941-2. [Google Scholar] [CrossRef]

70. Rochkind MJ. The source code control system. IEEE Trans Softw Eng. 1975;SE-1(4):364–70. doi:10.1109/TSE.1975.6312866. [Google Scholar] [CrossRef]

71. Reichenberger C. Delta storage for arbitrary non-text files. In: Proceedings of the 3rd International Workshop on Software Configuration Management. New York, NY, USA: The Association for Computing Machinery (ACM); 1991. p. 144–52. doi:10.1145/111062.111081. [Google Scholar] [CrossRef]

72. Valdes A, Zamboni D. Recent advances in intrusion detection. In: 8th International Symposium, RAID 2005. Berlin/Heidelberg, Germany: Springer; 2006.doi:10.1007/11663812. [Google Scholar] [CrossRef]

73. Khoo WM, Mycroft A, Anderson R. Rendezvous: a search engine for binary code. In: 2013 10th Working Conference on Mining Software Repositories (MSR); 2013 May 18–19; San Francisco, CA, USA; 2013. p. 329–38. doi:10.1109/MSR.2013.6624046. [Google Scholar] [CrossRef]

74. Tencent. BinaryAI Python SDK [Internet]. [cited 2025 Sep 9]. Available from: https://github.com/binaryai/sdk. [Google Scholar]

75. Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, et al. jTrans: jump-aware transformer for binary code similarity detection. In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. p. 1–13. doi:10.1145/3533767.3534367. [Google Scholar] [CrossRef]

76. Zhu W, Wang H, Zhou Y, Wang J, Sha Z, Gao Z, et al. kTrans: knowledge-aware transformer for binary code embedding. arXiv:2308.12659. 2023. [Google Scholar]

77. He H, Lin X, Weng Z, Zhao R, Gan S, Chen L, et al. Code is not natural language: unlock the power of semantics-oriented graph representation for binary code similarity detection. In: 33rd USENIX Security Symposium (USENIX Security 24); 2024 Aug 14–16; Philadelphia, PA, USA. [Google Scholar]

78. Alrabaee S, Shirani P, Wang L, Debbabi M. FOSSIL: a resilient and efficient system for identifying FOSS functions in malware binaries. ACM Trans Priv Secur. 2018;21(2):1–34. doi:10.1145/3175492. [Google Scholar] [CrossRef]

79. Li Y, Gu C, Dullien T, Vinyals O, Kohli P. Graph matching networks for learning the similarity of graph structured objects. In: Proceedings of the 36th International Conference on Machine Learning; 2019 Jun 9–15; Long Beach, CA, USA. p. 3835–45. [Google Scholar]

80. Shalev N, Partush N. Binary similarity detection using machine learning. In: Proceedings of the 13th Workshop on Programming Languages and Analysis for Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2018. p. 42–7. doi:10.1145/3264820.3264821. [Google Scholar] [CrossRef]

81. Shirani P, Wang L, Debbabi M. BinShape: scalable and robust binary library function identification using function shape. In: Detection of intrusions and malware, and vulnerability assessment. Cham, Switzerland: Springer International Publishing; 2017. p. 301–24. doi:10.1007/978-3-319-60876-1_14. [Google Scholar] [CrossRef]

82. Ding SHH, Fung BCM, Charland P. Kam1n0: MapReduce-based assembly clone search for reverse engineering. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: The Association for Computing Machinery (ACM); 2016. p. 461–70. doi:10.1145/2939672.2939719. [Google Scholar] [CrossRef]

83. Farhadi MR, Fung BCM, Charland P, Debbabi M. BinClone: detecting code clones in malware. In: 2014 Eighth International Conference on Software Security and Reliability (SERE); 2014 Jun 30–Jul 2; San Francisco, CA, USA. p. 78–87. doi:10.1109/SERE.2014.21. [Google Scholar] [CrossRef]

84. Nouh L, Rahimian A, Mouheb D, Debbabi M, Hanna A. Binsign: fingerprinting binary functions to support automated analysis of code executables. In: ICT Systems Security and Privacy Protection: 32nd IFIP TC 11 International Conference, SEC 2017. Cham, Switzerland: Springer International Publishing; 2017. p. 341–55. [Google Scholar]

85. Pagani F, Dell’Amico M, Balzarotti D. Beyond precision and recall: understanding uses (and misuses) of similarity hashes in binary analysis. In: Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy. New York, NY, USA: The Association for Computing Machinery (ACM); 2018. p. 354–65. doi:10.1145/3176258.3176306. [Google Scholar] [CrossRef]

86. Onieva JA, Pérez Jiménez P, López J. Malware similarity and a new fuzzy hash: compound Code Block Hash (CCBHash). Comput Secur. 2024;142(1):103856. doi:10.1016/j.cose.2024.103856. [Google Scholar] [CrossRef]

87. Sarantinos N, Benzaïd C, Arabiat O, Al-Nemrat A. Forensic malware analysis: the value of fuzzy hashing algorithms in identifying similarities. In: 2016 IEEE Trustcom/BigDataSE/ISPA; 2016 Aug 23–26; Tianjin, China. p. 1782–7. doi:10.1109/TrustCom.2016.0274. [Google Scholar] [CrossRef]

88. Dullien T. Searching statically-linked vulnerable library functions in executable code [Internet]. [cited 2025 Sep 9]. Available from: https://googleprojectzero.blog. [Google Scholar]

89. Massarelli L, Di Luna GA, Petroni F, Querzoni L, Baldoni R. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In: Proceedings of the 2019 Workshop on Binary Analysis Research; 2019 Feb 24; San Diego, CA, USA. doi:10.14722/bar.2019.23020. [Google Scholar] [CrossRef]

90. Yu Z, Cao R, Tang Q, Nie S, Huang J, Wu S. Order matters: semantic-aware neural networks for binary code similarity detection. Proc AAAI Conf Artif Intell. 2020;34(1):1145–52. doi:10.1609/aaai.v34i01.5466. [Google Scholar] [CrossRef]

91. Pei K, Xuan Z, Yang J, Jana S, Ray B. Trex: learning execution semantics from micro-traces for binary similarity. arXiv:2012.08680. 2020. [Google Scholar]

92. Yang S, Cheng L, Zeng Y, Lang Z, Zhu H, Shi Z. Asteria: deep learning-based AST-encoding for cross-platform binary code similarity detection. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2021 Jun 21–24; Taipei, Taiwan. p. 224–36. doi:10.1109/DSN48987.2021.00036. [Google Scholar] [CrossRef]

93. Li X, Qu Y, Yin H. PalmTree: learning an assembly language model for instruction embedding. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: The Association for Computing Machinery (ACM); 2021. p. 3236–51. doi:10.1145/3460120.3484587. [Google Scholar] [CrossRef]

94. Ahn S, Ahn S, Koo H, Paek Y. Practical binary code similarity detection with BERT-based transferable similarity learning. In: Proceedings of the 38th Annual Computer Security Applications Conference. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. p. 361–74. doi:10.1145/3564625.3567975. [Google Scholar] [CrossRef]

95. Kim G, Hong S, Franz M, Song D. Improving cross-platform binary analysis using representation learning via graph alignment. In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. p. 151–63. doi:10.1145/3533767.3534383. [Google Scholar] [CrossRef]

96. Luo Z, Wang P, Wang B, Tang Y, Xie W, Zhou X, et al. VulHawk: cross-architecture vulnerability detection with entropy-based binary code search. In: Proceedings of the 2023 Network and Distributed System Security Symposium; 2023 Feb 27–Mar 3; San Diego, CA, USA. doi:10.14722/ndss.2023.24415. [Google Scholar] [CrossRef]

97. Collyer J, Watson T, Phillips I. FASER: binary code similarity search through the use of intermediate representations. arXiv:2310.03605. 2023. [Google Scholar]

98. Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, et al. Asteria-pro: enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodol. 2024;33(1):1–40. doi:10.1145/3604611. [Google Scholar] [CrossRef]

99. Wang H, Ma P, Wang S, Tang Q, Nie S, Wu S. sem2vec: semantics-aware assembly tracelet embedding. ACM Trans Softw Eng Methodol. 2023;32(4):1–34. doi:10.1145/3569933. [Google Scholar] [CrossRef]

100. Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, et al. CodeBERT: a pre-trained model for programming and natural languages. arXiv:2002.08155. 2020. [Google Scholar]

101. Kanade A, Maniatis P, Balakrishnan G, Shi K. Learning and evaluating contextual embedding of source code. In: ICML’20: Proceedings of the 37th International Conference on Machine Learning; 2020 Jul 13–18; Online. p. 5110–21. [Google Scholar]

102. He J, Zhou X, Xu B, Zhang T, Kim K, Yang Z, et al. Representation learning for stack overflow posts: how far are we? ACM Trans Softw Eng Methodol. 2024;33(3):1–24. doi:10.1145/3635711. [Google Scholar] [CrossRef]

103. Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, et al. GraphCodeBERT: pre-training code representations with data flow. arXiv:2009.08366. 2020. [Google Scholar]

104. Ahmad WU, Chakraborty S, Ray B, Chang KW. Unified pre-training for program understanding and generation. arXiv:2103.06333. 2021. [Google Scholar]

105. Wang Y, Wang W, Joty S, Hoi SCH. CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv:2109.00859. 2021. [Google Scholar]

106. CodeGPT. CodeGPT: AI agents for software development; 2025 [Internet]. [cited 2025 Sep 9]. Available from: https://codegpt.co/. [Google Scholar]

107. OpenAI. OpenAI Codex; 2021 [Internet]. [cited 2025 Sep 9]. Available from: https://openai.com/codex/. [Google Scholar]

108. Microsoft. Microsoft Copilot; 2023 [Internet]. [cited 2025 Sep 9]. Available from: https://copilot.microsoft.com/. [Google Scholar]

109. Hanif H, Maffeis S. VulBERTa: simplified source code pre-training for vulnerability detection. In: International Joint Conference on Neural Networks (IJCNN); 2022 Jul 18–23; Padua, Italy. p. 1–8. doi:10.1109/IJCNN55064.2022.9892280. [Google Scholar] [CrossRef]

110. Guo D, Lu S, Duan N, Wang Y, Zhou M, Yin J. UniXcoder: unified cross-modal pre-training for code representation. arXiv:2203.03850. 2022. [Google Scholar]

111. Chakraborty S, Ahmed T, Ding Y, Devanbu PT, Ray B. NatGen: generative pre-training by “naturalizing” source code. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. p. 18–30. doi:10.1145/3540250.3549162. [Google Scholar] [CrossRef]

112. Fried D, Aghajanyan A, Lin J, Wang S, Wallace E, Shi F, et al. InCoder: a generative model for code infilling and synthesis. arXiv:2204.05999. 2022. [Google Scholar]

113. Nijkamp E, Pang B, Hayashi H, Tu L, Wang H, Zhou Y, et al. CodeGen: an open large language model for code with multi-turn program synthesis. arXiv:2203.13474. 2022. [Google Scholar]

114. Zhou X, Xu B, Han D, Yang Z, He J, Lo D. CCBERT: self-supervised code change representation learning. In: 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME); 2023 Oct 1–6; Bogotá, Colombia. p. 182–93. doi:10.1109/ICSME58846.2023.00028. [Google Scholar] [CrossRef]

115. OpenAI; GPT-4. 2023 Mar 14 [Internet]. [cited 2025 Sep 9]. Available from: https://openai.com/index/gpt-4-research/. [Google Scholar]

116. Meta. Code Llama. 2023 [Internet]. [cited 2025 Sep 9]. Available from: https://codellama.dev/about. [Google Scholar]

117. DeepSeek. 2025 [Internet]. [cited 2025 Sep 9]. Available from: https://www.deepseek.com/. [Google Scholar]

118. Tan H, Luo Q, Li J, Zhang Y. LLM4Decompile: decompiling binary code with large language models. arXiv:2403.05286. 2024. [Google Scholar]

119. Hu P, Liang R, Chen K. DeGPT: optimizing decompiler output with LLM. In: Proceedings of the 2024 Network and Distributed System Security Symposium; 2024 Feb 26–Mar 1; San Diego, CA, USA. doi:10.14722/ndss.2024.24401. [Google Scholar] [CrossRef]

120. Shang X, Chen G, Cheng S, Wu B, Hu L, Li G, et al. BinMetric: a comprehensive binary analysis benchmark for large language models. arXiv:2505.07360. 2025. [Google Scholar]

121. Liu P, Sun C, Zheng Y, Feng X, Qin C, Wang Y, et al. LLM-powered static binary taint analysis. ACM Trans Softw Eng Methodol. 2025;34(3):1–36. doi:10.1145/3711816. [Google Scholar] [CrossRef]

122. Shang X, Cheng S, Chen G, Zhang Y, Hu L, Yu X, et al. How far have we gone in binary code understanding using large language models. In: 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME); 2024 Oct 6–11; Flagstaff, AZ, USA. p. 1–12. doi:10.1109/ICSME58944.2024.00012. [Google Scholar] [CrossRef]

123. Fang C, Miao N, Srivastav S, Liu J, Zhang R, Fang R, et al. Large language models for code analysis: do LLMs really do their job? In: 33rd USENIX Security Symposium (USENIX Security 24); 2024 Aug 14–16; Philadelphia, PA, USA. p. 829–46. [Google Scholar]

124. Ullah S, Han M, Pujar S, Pearce H, Coskun A, Stringhini G. LLMs cannot reliably identify and reason about security vulnerabilities (yet?a comprehensive evaluation, framework, and benchmarks. In: 2024 IEEE Symposium on Security and Privacy (SP); 2024 May 19–23; San Francisco, CA, USA. p. 862–80. doi:10.1109/SP54263.2024.00210. [Google Scholar] [CrossRef]

125. Liu F, Liu Y, Shi L, Huang H, Wang R, Yang Z, et al. Exploring and evaluating hallucinations in LLM-powered code generation. arXiv:2404.00971. 2024. [Google Scholar]

126. Chen M, Zhu T, Zhang M, He Y, Lin M, Li P, et al. Explainer-guided targeted adversarial attacks against binary code similarity detection models. arXiv:2506.05430. 2025. [Google Scholar]

Cite This Article

APA Style

Chang, S., Cui, B., Feng, S. (2025). Binary Code Similarity Detection: Retrospective Review and Future Directions. Computers, Materials & Continua, 85(3), 4345–4374. https://doi.org/10.32604/cmc.2025.070195

Vancouver Style

Chang S, Cui B, Feng S. Binary Code Similarity Detection: Retrospective Review and Future Directions. Comput Mater Contin. 2025;85(3):4345–4374. https://doi.org/10.32604/cmc.2025.070195

IEEE Style

S. Chang, B. Cui, and S. Feng, “Binary Code Similarity Detection: Retrospective Review and Future Directions,” Comput. Mater. Contin., vol. 85, no. 3, pp. 4345–4374, 2025. https://doi.org/10.32604/cmc.2025.070195

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Binary Code Similarity Detection: Retrospective Review and Future Directions

Abstract

Keywords

References

Cite This Article

1021

401

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link