Open Access
ARTICLE
Graph-Augmented Multi-Agent Robust Root Cause Analysis in AIOps
1 Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co., Ltd., Nanjing, China
2 School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
* Corresponding Authors: Haodong Zou. Email: ; Long Yuan. Email:
(This article belongs to the Special Issue: Multimodal Learning for Big Data)
Computers, Materials & Continua 2026, 88(1), 40 https://doi.org/10.32604/cmc.2026.077908
Received 19 December 2025; Accepted 13 March 2026; Issue published 08 May 2026
Abstract
Root cause analysis (RCA), which leverages multi-modal observability data (including metrics, traces, and logs) to identify the fundamental source of system failures, is critical for ensuring the reliability of complex microservice systems. Traditionally, RCA has relied on human engineers to manually correlate these fragmented signals, which is a labor-intensive and error-prone process. Although recent AIOps advancements, particularly those leveraging Large Language Models (LLMs), aim to automate this workflow, they remain constrained by limitations. Existing methods often rely on single-modal data, restricting diagnostic comprehensiveness. Furthermore, approaches that utilize multi-modal data typically depend on simplistic temporal alignment, which fails to capture complex semantic relationships, or directly employ LLMs, which are prone to hallucinations and lack reliability. To address these issues, we propose a novel Graph-Augmented Multi-Agent Framework that synergizes the structural rigor of graph topology with the advanced semantic reasoning capabilities of LLMs. Our approach operates in two distinct phases designed to mimic human expert problem-solving. First, in the Anomaly Fusion Graph Construction phase, we employ a hybrid alignment strategy to bridge the gap between unstructured logs and structured traces. An LLM serves as a “semantic arbitrator” to resolve ambiguities in high-concurrency scenarios, creating a unified knowledge environment where each node is enriched with comprehensive health insights. Second, the Multi-Agent Collaborative Reasoning phase deploys a team of specialized agents to simulate human Site Reliability Engineering (SRE) workflows. A Navigator Agent efficiently guides the search space via calculated fault gradients, while a Diagnoser Agent performs deep semantic analysis. Crucially, a Verifier Agent enforces an Adversarial Validation Protocol to mitigate hallucinations through rigorous counterfactual reasoning. Extensive experiments conducted on five diverse datasets demonstrate the robustness and effectiveness of our approach. The results show that our framework achieves an average F1-score of 88.4%, significantly outperforming state-of-the-art baselines by 4.6%, proving its ability to synthesize multi-modal information into actionable diagnostic insights.Keywords
Modern microservice systems have become increasingly complex due to dynamic interactions and evolving runtime environments [1–4]. These systems often consist of hundreds or even thousands of fine-grained, interdependent subsystems, where issues in any one component can easily lead to performance problems at the top level. Therefore, to ensure system reliability, it is crucial to localize the root cause of these issues in a timely manner [5,6]. However, localizing the root cause is challenging due to the intricate dependencies between subsystems [7,8]. Each request typically follows complex invocation chains involving multiple components, such as services, service instances, and hosts, and the interactions between them. Moreover, the dynamic nature of these interactions, combined with the heterogeneity of system components, makes pinpointing the root cause a highly non-trivial task [9,10]. As illustrated in Fig. 1, the traditional RCA workflow relies heavily on human engineers to manually correlate fragmented signals from multi-modal data, a process that is often labor-intensive and error-prone.

Figure 1: The traditional, human-centric workflow for microservice root cause analysis (RCA). Multi-modal data (Traces, Logs, Metrics) from multiple infrastructure layers (Service, Container, Physical) triggers alerts. These alerts are converted into tickets for on-call engineers, who must manually correlate fragmented signals to localize the root cause.
To mitigate the inefficiencies of manual diagnosis, Artificial Intelligence for IT Operations (AIOps) has emerged as a paradigm shift, aiming to automate anomaly detection and root cause localization. Extensive research has been devoted to automated root cause analysis based on multi-modal data [7,10–13]. These approaches leverage multi-modal signals to capture different aspects of system behavior: metrics provide quantitative indicators of system health (e.g., CPU utilization, response time), traces reveal the invocation paths and dependencies across services, and logs record detailed event sequences and error messages. By integrating these multi-modal data sources, AIOps methods can enable comprehensive diagnosis of complex failures in microservice environments. Recent advancements include graph-based approaches that model service dependencies [14,15], machine learning techniques for anomaly detection [16,17], and causal inference methods that distinguish root causes from symptoms [6,9].
Challenges. Despite these advancements, achieving robust RCA remains challenging, primarily due to two key obstacles:
• The Intricacy of Multi-Modal Semantic Alignment. While multi-modal data is available, effectively fusing it remains an open problem due to the significant semantic gap between multi-modal signals [18,19]. Metrics provide quantitative trends, traces offer structural dependencies, while logs contain unstructured semantic details. Existing approaches often rely on rigid timestamp matching or shallow feature concatenation, which are insufficient to resolve the ambiguity when mapping unstructured logs to specific trace spans. This misalignment fails to capture the true causal dependencies between system anomalies and error logs, resulting in a fragmented understanding of the failure context [14,20].
• Cognitive Deficits of LLMs in Structural Reasoning. Although LLMs demonstrate impressive general knowledge, they fundamentally lack the spatial reasoning and rigorous logic required for traversing complex service topologies [2]. In scenarios with deep fault propagation chains, unconstrained LLMs often suffer from “causal confusion”—mistaking high-level symptoms for root causes due to their reliance on surface-level textual patterns rather than underlying structural logic. Furthermore, without a mechanism for self-correction or adversarial verification, LLMs are prone to confident hallucinations, generating plausible but non-existent failure explanations that are unacceptable in critical IT operations [5,8].
Our Approach. To address these challenges, we propose a Graph-Augmented Multi-Agent Framework for automated Root Cause Analysis. Unlike black-box models that directly predict root causes from raw data, our approach simulates the cognitive workflow of a human SRE team, combining the structural rigor of graph topology with the semantic reasoning of LLMs. The framework operates in two phases:
• Anomaly Fusion Graph Construction. To overcome the semantic gap between multi-modal data sources, we construct the Anomaly Fusion Graph (AFG) as a unified knowledge environment. This phase employs a novel multi-modal alignment strategy that integrates semantic similarity with spatio-temporal consistency to precisely map unstructured error logs to their corresponding trace spans. An LLM serves as a “semantic arbitrator” to resolve ambiguities in high-concurrency scenarios, ensuring accurate data association. The resulting graph transforms fragmented telemetry into a navigable topology, where each node is enriched with an LLM-synthesized health report that encapsulates both statistical anomalies and semantic failure contexts.
• Multi-Agent Collaborative Reasoning. We model the root cause localization as a sequential decision-making process executed by a team of specialized agents. The Diagnoser Agent performs deep-dive analysis on local nodes, utilizing Chain-of-Thought reasoning to differentiate between root causes and propagated symptoms. The Navigator Agent functions as a topology scout, calculating a “fault gradient” based on latency, error rates, and traffic volatility to prioritize the search path. To mitigate LLM hallucinations, the Verifier Agent adopts an Adversarial Validation Protocol, challenging hypotheses through consistency checks and counterfactual reasoning. Orchestrating this collaboration is the Coordinator Agent, which manages a global shared context and implements a stack-based backtracking mechanism to dynamically prune incorrect branches and ensure robust convergence.
Contributions. The main contributions of this paper are summarized as follows:
• We propose a novel Graph-Augmented Multi-Agent Framework that integrates the structural rigor of graph topology with the semantic reasoning of LLMs for robust RCA.
• We introduce an Adversarial Verification Mechanism and Dynamic Backtracking Strategy, significantly reducing LLM hallucinations and preventing the search from getting trapped in local optima.
• We construct a high-quality Anomaly Fusion Graph via a hybrid Log-Trace alignment technique, effectively solving the data fragmentation problem in high-concurrency scenarios.
• Extensive experiments demonstrate that our framework achieves superior accuracy and robustness compared to state-of-the-art baselines.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces the preliminaries and formalizes the RCA problem. Section 4 details the AFG construction and Multi-Agent reasoning framework. Section 5 presents the experimental evaluation. Finally, Section 6 concludes the paper.
2.1 Multi-Modal Data Fusion and Graph Construction
In microservice observability, the integration of multi-modal data—metrics, logs, and traces—is fundamental for constructing a comprehensive system view. Early approaches often analyzed these modalities in isolation, which limits the ability to capture complex fault propagation across service boundaries. To address this, recent research has increasingly focused on graph-based representations that fuse structural and behavioral data.
DeepTraLog [7] and LMGD [11] represent a class of methods that construct dependency graphs from distributed traces and log sequences, employing Graph Neural Networks (GNNs) to detect abnormal execution patterns. Twin-Graph [14] advances this paradigm by introducing attention mechanisms to weigh the importance of different observability sources within the graph. Similarly, Wang et al. [10] explore hybrid graph architectures augmented with multi-modal data to enhance detection accuracy. However, a common limitation of these approaches is their reliance on implicit fusion. Data sources are often concatenated into feature vectors for black-box deep learning models, lacking an explicit, navigable topology that preserves the semantic relationships between unstructured logs and structured traces. In contrast, our Anomaly Fusion Graph (AFG) explicitly aligns logs with trace spans using LLM Arbitration, creating a transparent knowledge environment that supports interpretable reasoning.
2.2 Automated Reasoning Strategies
Once the system data is modeled, the core challenge of Root Cause Analysis (RCA) becomes the reasoning strategy used to pinpoint the fault origin. Traditional AIOps methods largely rely on statistical correlation and causal inference. Approaches like Chain-of-Event [8] and MULAN [9] automatically derive event graphs or causal graphs to infer fault propagation paths. Yu et al. [21] employ sophisticated causal learning to localize root causes in multi-modal data. Han et al. [5] and Chen et al. [6] further advocate for holistic analysis, applying causal inference across the complete observability spectrum.
While these methods excel at processing numerical metrics and structured events, they often struggle with the semantic gap inherent in unstructured data. They can identify where a correlation break occurs but often fail to explain why, as they cannot comprehend the rich semantic context provided by error logs. Furthermore, they typically lack the interactive reasoning capabilities of human operators. Our work bridges this gap by introducing a Multi-Agent Framework. Unlike static causal models, our agents simulate the dynamic cognitive processes of SREs, employing Chain-of-Thought reasoning, hypothesis verification, and backtracking, to provide not just a localization result, but a semantically grounded diagnosis.
In this section, we formally define the multi-modal data generated by microservice systems, including metrics, logs, and traces. We then introduce the construction of the Anomaly Fusion Graph (
We consider a microservice system
Definition 1 (Metric Data): Let
Definition 2 (Log Data): Let
where
Definition 3 (Trace Data): A trace
where
The aforementioned data modalities provide complementary views of the system status: metrics and logs capture the local health of individual services, while traces reveal the dynamic invocation dependencies between them. However, analyzing these data in isolation is insufficient for effective Root Cause Analysis (RCA), as faults often propagate across services through these dependencies.
Problem Definition. Given the multi-modal observability data
We propose a Graph-Augmented Multi-Agent Framework for root cause analysis (RCA) in microservice systems. Unlike traditional pipeline approaches that treat data processing and reasoning as isolated stages, our framework establishes a dynamic symbiosis between a structured knowledge environment—the Anomaly Fusion Graph (
The framework operates on a Perceive-Reason-Act loop. The architecture is composed of two core layers:
• Anomaly Fusion Graph Construction: This layer transforms raw, multi-modal observability streams (metrics, logs, traces) into a structured graph
• Multi-Agent Collaborative Reasoning: We employ a team of specialized agents to emulate human SRE roles: the Diagnoser assesses local service states, the Navigator determines the propagation path, the Verifier enforces robust decision-making through multi-perspective consensus, and the Coordinator manages the global shared context and backtracking logic. This collaborative architecture enables the system to autonomously navigate the complex dependency topology, perform root cause analysis, and accurately pinpoint the origin of cascading failures.
4.1 Anomaly Fusion Graph Construction
To facilitate the agents’ diagnostic process, we construct the Anomaly Fusion Graph (

Figure 2: The construction process of the anomaly fusion graph. It illustrates the transformation from raw abnormal data to a structured graph via multi-modal data alignment (integrating semantic similarity and spatio-temporal consistency with LLM arbitration) and graph construction (Topology extraction and LLM-based summarizer).
Definition 4 (Anomaly Fusion Graph): The Anomaly Fusion Graph is defined as a directed attributed graph
Finally,
4.1.1 Multi-Modal Data Alignment
The primary objective of data alignment is to overcome the fragmentation inherent in raw observability data, where logs, traces, and metrics lack explicit cross-references. For instance, a log entry often does not natively point to its corresponding trace span, and metric anomalies are not automatically linked to the requests that triggered them. To bridge this semantic gap, we employ a two-stage alignment strategy. This approach aims to fuse scattered signals into unified anomaly events, thereby providing a complete context for the subsequent graph construction and agent reasoning. The strategy progressively narrows down associations from coarse spatial-temporal grouping to fine-grained semantic matching.
Proximity Evaluation. To efficiently handle high volumes of telemetry data, we first perform a coarse-grained grouping based on service locality and temporal proximity. Let
where
Hybrid Score Calculation & LLM Arbitration. While proximity grouping is effective for aggregating continuous metric signals, it is insufficient for discrete log events. In high-concurrency environments, multiple spans often execute simultaneously on the same service, as shown in Fig. 3. A simple temporal overlap could erroneously link a failure log to an unrelated successful span, thereby corrupting the causal evidence required for diagnosis. To resolve this log-span association problem and map logs to their exact execution context, we calculate a hybrid association score

Figure 3: The challenge of log-trace alignment in high-concurrency scenarios. Multiple spans (e.g., Span 2, 3, 4) execute concurrently on the same service. A simple temporal match is ambiguous; for instance, Log 4 could temporally belong to either Span 1 or Span 5, necessitating semantic analysis for precise association.
Matches with scores exceeding a high-confidence threshold
Algorithm 1 details the execution flow of this multi-modal alignment strategy. The process begins with Anomaly Extraction and Coarse-grained Grouping (Lines 1–14). The algorithm first identifies anomaly instances from raw metrics, logs, and traces to form a unified anomaly set

With the fused anomaly groups established, we synthesize the final graph structure and attributes. Our framework constructs a semantic graph where nodes and edges store explicit natural language descriptions and statistical facts. This design allows LLM-based agents to reason directly over the graph content without information loss.
Topology Extraction. The graph topology is derived directly from trace data
Node Attribute Synthesis. We enrich each node
This report serves as the primary context for the Diagnoser Agent to analyze the service’s local health state.
Edge Attribute Extraction. For each dependency edge
Algorithm 2 formalizes the complete graph construction pipeline. It starts by invoking the Multi-modal Data Alignment (Line 1) to generate fused anomaly groups

4.2 Multi-Agent Collaborative Reasoning
In complex microservice environments, manual root cause analysis typically relies on the expertise of Site Reliability Engineers (SREs). These engineers iteratively traverse service dependencies, analyze local logs, and verify hypotheses. While effective, this human-centric process is labor-intensive and time-consuming, often leading to prolonged Mean Time To Repair (MTTR). To automate this cognitive workflow, we propose a Multi-Agent Collaborative Reasoning framework that simulates human expert decision-making. Building upon the constructed Anomaly Fusion Graph
We decompose the reasoning task into four specialized roles, effectively separating state evaluation, path selection, and hypothesis verification, as illustrated in Fig. 4. Detailed prompts are provided in the Appendix A.

Figure 4: The multi-agent collaborative reasoning architecture. The coordinator serves as the central hub, orchestrating the diagnostic workflow through three key interactions: (1) dispatching the diagnoser for state evaluation, (2) invoking the verifier for hypothesis validation, and (3) directing the navigator for topology traversal. This collaborative loop simulates a human SRE team’s decision-making process.
Diagnoser Agent (State Evaluator). This agent functions as the termination condition of the search, emulating a clinical analyst who examines a patient’s symptoms. Upon visiting a node
Note that the score
Navigator Agent (Transition Policy). This agent implements the transition policy
This mechanism ensures that the search path strictly follows the causal chain of fault propagation, avoiding irrelevant branches.
Verifier Agent (Consensus-Based Validator). To mitigate LLM hallucination risks, the Verifier Agent operates under an Adversarial Validation Protocol. Unlike the Diagnoser which seeks to confirm a hypothesis, the Verifier actively attempts to falsify it. It employs a Multi-Perspective Ensemble strategy with three cognitive angles: (1) Consistency Check: verifying if the hypothesis contradicts any “Golden Signals” (e.g., claiming CPU overload when CPU usage is flat); (2) Causal Sufficiency: checking if the root cause sufficiently explains the propagation chain
The diagnosis is finalized only if the consensus score
Coordinator Agent (Memory & Control). Acting as the central orchestrator, the Coordinator manages the Shared Diagnostic Context (SDC), a dynamic blackboard where agents exchange insights. This shared memory ensures all agents reason over a consistent global view. The Coordinator maintains a Global State Board tracking: a Visited Set to prevent cycles, a Hypothesis Stack (LIFO) recording decision points for backtracking, and a Contextual Summary providing a running narrative of findings (e.g., “Service A is slow, but Service B is healthy”).
The Coordinator operates as a State-Aware Controller, executing the reasoning loop in Algorithm 3. It implements two key mechanisms. (1) Dynamic Context Injection: assembling prompts with the Contextual Summary, ensuring agents are aware of previous findings (e.g., preventing the Navigator from suggesting ruled-out paths). (2) Stack-Based Backtracking: when hitting a dead end, the Coordinator pops the last branching point from the Hypothesis Stack and redirects the search to the next-best candidate, pruning incorrect branches.
The collaboration follows a structured Investigate-and-Walk protocol, mimicking “tracing the error back to its source,” as formalized in Algorithm 3. At each step, the Coordinator first invokes the Diagnoser (Step 1) for local diagnosis on

Fig. 5 illustrates the agent collaboration through a real failure case from the AIOps dataset. When order-service reports timeouts, the workflow proceeds as follows: (1) The Diagnoser identifies extrinsic symptoms (“Connection timeout to db-pool”) and labels it as a Propagator (

Figure 5: Running example of tracing a database connection failure.
This section presents a comprehensive evaluation of our Graph-Augmented Multi-Agent Framework for root cause localization. We assess the framework’s ability to accurately pinpoint fault origins in complex microservice environments, validating the effectiveness of the Anomaly Fusion Graph and the collaborative reasoning mechanism. All experiments were conducted on a server equipped with an Intel Xeon Gold 6248R 3.0 GHz CPU, 512 GB of RAM, and four NVIDIA A100 GPUs.
Datasets. We evaluate our framework on five complementary datasets, shown in Table 1, covering different scales and scenarios of microservice and distributed systems. The first dataset is from the widely-used Train Ticket benchmark [1,7,22], utilizing its V0.2.0 release with 45 services to generate a dataset with 73 faults across 14 cases, resulting in over 132,000 traces and 7.7 million log messages. The second dataset utilizes the Online Boutique system [21], into which they inject 56 faults (42 resource issues and 14 code defects) to obtain traces, logs, and metrics. The third dataset is MSDS [23], a multi-modal dataset collected from an OpenStack distributed system with over 80 fault instances, containing metrics from 5 physical nodes (7 metrics per node including CPU and RAM usage), distributed logs with 23 features, and traces with complete span information. The fourth dataset is from the AIOps Challenge 2025 [24], which provides a comprehensive collection of observability data from a microservice system with 400 fault instances. It includes fine-grained infrastructure metrics (e.g., CPU, memory, disk I/O) from Prometheus, application performance monitoring (APM) metrics from DeepFlow, distributed traces from Jaeger, and container logs from Filebeat. The fifth dataset is self-collected from the OpenTelemetry Demo [25], which contains 45 fault scenarios (CPU contention, network delays, service crashes) injected over 7 days with 5-min durations each.

Baselines. To rigorously evaluate the contribution of our proposed framework, we compare it against two representative baseline categories. The first is DeepTraLog [7], a state-of-the-art graph-based root cause analysis framework. It constructs a service dependency graph and utilizes Graph Neural Networks (GNNs) to fuse traces and logs for anomaly detection and localization. This baseline represents the current advanced deep learning approach in the field. The second baseline is Direct LLM Diagnosis. This baseline serves as a naive upper bound for unstructured LLM usage, demonstrating the limitations of direct LLM application (context window constraints, noise sensitivity, hallucinations) and highlighting the necessity of our AFG construction and multi-agent framework. We note that while recent causal-graph methods (e.g., Nezha [21]) exist, they focus on statistical causal discovery from numerical metrics, whereas our framework emphasizes LLM-driven semantic reasoning over unstructured content. These represent fundamentally different technical approaches. Our baseline selection emphasizes the contrast between graph-based deep learning (DeepTraLog) and LLM-based semantic reasoning approaches.
Evaluation Metrics. We employ standard information retrieval metrics to evaluate the accuracy of root cause localization. For each fault instance, we compare the root cause service identified by the framework against the ground truth. Specifically, we calculate Precision as the proportion of correctly identified root causes out of all diagnoses produced by the system (
Implementation Details. We implement our framework in Python 3.9. The Anomaly Fusion Graph is constructed using Drain3 for log parsing and OpenTelemetry for trace processing. For the agent backend, we utilize the OpenAI GPT-4 API. The default thresholds (
5.2.1 Root Cause Localization Performance
We compare the performance of our Graph-Augmented Multi-Agent Framework against the baselines across all five datasets. The results are summarized in Table 2.

Results Analysis. As shown in Table 2, our framework significantly outperforms both baselines. DeepTraLog achieves an average F1-score of 83.8% (
5.2.2 Anomaly Fusion Graph Construction Quality
The quality of the Anomaly Fusion Graph is a prerequisite for effective multi-agent reasoning. We evaluate the accuracy of associating unstructured logs with distributed traces, which is the core step in graph construction. We compare our method against two baselines: Temporal Window, which links logs to traces based solely on timestamps, and Semantic-Temporal, which combines time windows with keyword matching.
Analysis. Table 3 presents log-trace association results. Our method achieves 97.5% average F1-Score, significantly outperforming baselines. Temporal Window suffers from low precision (65.9%) due to high concurrency, where multiple requests overlap temporally. Semantic-Temporal improves but fails when logs lack explicit identifiers. Our approach leverages semantic embedding and structural correlation to resolve ambiguities, ensuring high-quality graph construction.

5.2.3 Robustness Analysis: Efficacy of Consensus-Based Verification
To evaluate the impact of our Consensus-based Validator on system robustness, we compare three verification strategies across the AIOps dataset. This experiment specifically isolates the contribution of the multi-perspective ensemble mechanism against simpler alternatives.
• No Verification: The system relies solely on the Diagnoser’s initial judgment.
• Single-Pass Verification: The Verifier employs a standard, single-prompt validation (“Is this the root cause?”).
• Consensus-Based Verification (Ours): The Verifier aggregates results from Symptom Consistency, Causal Sufficiency, and Counterfactual Reasoning checks.
Analysis. As shown in Table 4, No Verification suffers from low Precision (79.5%), indicating high false positives where the Diagnoser is misled by local symptoms. Single-Pass Verification improves Precision to 84.2% by filtering obvious errors. Our Consensus-based Verification achieves the highest performance (88.1% Precision, 89.1% F1-Score). The multi-perspective ensemble resolves ambiguous cases by requiring counterfactual and causal consistency checks simultaneously. This demonstrates that the consensus mechanism significantly enhances robustness against hallucinations.

5.2.4 Ablation Study: Component Contribution
To quantify the contribution of each core component in our multi-agent framework, we conduct a comprehensive ablation study. We report the average performance across all five datasets to provide a holistic view. The variants are defined as follows:
• Full Framework: The complete system with all agents and mechanisms enabled.
• w/o Verifier Agent: The secondary validation step is removed; the Diagnoser’s decision is considered final.
• w/o Dynamic Backtracking: The Coordinator disables the backtracking mechanism; the search terminates if a dead end is reached.
• w/o Heuristic Navigation: The Navigator selects the next service node randomly from the dependencies, ignoring the edge attribute-based suspicion ranking.
Analysis. Table 5 demonstrates that every component is essential. (1) Verifier Agent: Removal causes the most significant Precision drop (−8.5%), confirming its role in filtering false positives from local symptom noise. (2) Dynamic Backtracking: Disabling primarily hurts Recall (−13.6%), as the system cannot recover from wrong turns, missing root causes. (3) Heuristic Navigation: Replacing with random walking degrades F1 to 75.3%, highlighting the importance of edge attributes in guiding fault propagation path.

5.2.5 Runtime and Cost Analysis
We report inference latency and token consumption across all datasets. Table 6 summarizes the results.

The multi-agent reasoning phase dominates the total latency (avg. 18.1 s), involving 3–5 agent calls per fault. The AFG construction phase (avg. 3.3 s) includes data alignment and LLM summarization. Token consumption averages 9.7K tokens per fault, with the AIOps dataset requiring the most (12.3K) due to its larger scale. LLM arbitration is triggered for 13.8%–20.3% of log-trace pairs across datasets, with the highest frequency (20.3%) in the high-concurrency AIOps dataset. While arbitration increases token consumption compared to pure statistical methods, it improves alignment precision by 18–22 percentage points (see Table 3). At GPT-4 Turbo pricing, the average cost per fault is approximately $0.15.
5.2.6 Limitations and Failure Case Analysis
We analyze failure cases where our framework produced incorrect diagnoses to understand its limitations. Table 7 summarizes the failure distribution.

The primary failure mode is ambiguous propagation chains (59% of failures), where multiple services fail simultaneously with similar symptoms such as cascading timeouts, causing the Diagnoser to misidentify a propagator as the root cause. The second major cause is insufficient log semantics (28%), where generic log messages like “Error occurred” lack discriminative features, leading to inconclusive verification results. Regarding backtracking efficiency, in dense topologies with fan-out greater than 10, backtracking degrades to near-exhaustive search; on average it is triggered 1.8 times per fault, but in 12% of AIOps instances with complex dependencies, it exceeds 5 iterations and increases latency by approximately 40%. For LLM arbitration, which achieves 94.2% accuracy overall, errors mainly occur with domain-specific jargon absent from training data and extremely short time windows (<10 ms) where temporal features become unreliable.
In this paper, we proposed a Graph-Augmented Multi-Agent Framework for robust root cause analysis in AIOps. By constructing an Anomaly Fusion Graph through LLM-arbitrated log-trace alignment, we transform scattered telemetry signals into a unified semantic topology. Building upon this graph, a collaborative team of specialized agents—Diagnoser, Navigator, Verifier, and Coordinator—operates via a structured Diagnose-Propagate-Decide protocol, employing Chain-of-Thought reasoning, adversarial verification, and dynamic backtracking to ensure diagnostic robustness. Extensive experiments on multiple datasets demonstrate that our framework achieves superior accuracy compared to baselines.
Acknowledgement: This work is supported by the Science and Technology Project of State Grid Corporation of Jiangsu Electric Power—Research on “Key Technologies Research and Application Service for Large-Scale Heterogeneous Monitoring Data Processing in Information Systems” (No. SGJSXT00XTJS2500275).
Funding Statement: This work is supported by the Science and Technology Project of State Grid Corporation of Jiangsu Electric Power—Research on “Key Technologies Research and Application Service for Large-Scale Heterogeneous Monitoring Data Processing in Information Systems” (No. SGJSXT00XTJS2500275).
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Haodong Zou; methodology, Haodong Zou; software, Haodong Zou and Yichen Zhao; validation, Xin Chen, Ling Wang and Jinghang Yu; formal analysis, Haodong Zou and Long Yuan; investigation, Yichen Zhao and Xin Chen; resources, Jinghang Yu; data curation, Haodong Zou and Yichen Zhao; writing—original draft preparation, Haodong Zou; writing—review and editing, Haodong Zou, Ling Wang and Jinghang Yu; visualization, Yichen Zhao and Xin Chen; supervision, Jinghang Yu; project administration, Jinghang Yu; funding acquisition, Jinghang Yu. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: Data are unavailable due to privacy or ethical restrictions.
Ethics Approval: Not applicable.
Conflicts of Interest: Authors Yichen Zhao, Xin Chen, Ling Wang and Jinghang Yu were employed by the company State Grid Jiangsu Electric Power Co., Ltd. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
| RCA | Root Cause Analysis |
| AIOps | Artificial Intelligence for IT Operations |
| LLM | Large Language Model |
| AFG | Anomaly Fusion Graph |
This appendix presents the prompt templates used by each agent in our framework.
LLM Arbitration Prompt Template
You are a semantic arbitrator for microservice log-trace alignment. Given an error log and candidate trace spans, determine which span most likely corresponds to the log.
Error Log: {log_content}
Timestamp: {log_timestamp}
Service: {log_service}
Candidate Spans:
{span_1}: {span_1_details} (timestamp: {span_1_time})
{span_2}: {span_2_details} (timestamp: {span_2_time})
...
Based on semantic similarity, spatio-temporal consistency (same service and close time), and error/status consistency, assign a suspicion score to each candidate span (0–1) and select the most likely span.
Output format (JSON only):
{
“span_id”: “<best_span_id>”,
“score”: 0.0-1.0,
“rationale”: [“brief reason 1”, “brief reason 2”]
}
Node Health Summarization Prompt Template
Summarize the following anomaly events for service {service_name} into a concise health report:
Metric Anomalies: {metric_alerts}
Error Logs: {error_logs}
Trace Anomalies: {trace_spans}
Generate a structured report that summarizes the observed symptoms:
1. Primary symptoms (what is failing/degrading)
2. Resource/utilization anomalies (if any)
3. Error patterns (types, frequency, and representative messages)
4. Trace/latency signals (timeouts, retries, slow spans) if present
Format: [Service: {service_name}] [Health: Normal/Degraded/Anomalous]
[Summary: {concise_description}]
Coordinator Agent Prompt Template
You are the Coordinator (Memory & Control). Maintain the investigation state and decide the next action.
State:
- Current Service: {current_service}
- Current Path: {current_path}
- Visited Services: {visited_services}
- Hypothesis Stack: {stack_state}
- Contextual Summary: {current_summary}
Inputs (optional):
- Diagnoser Output: {diagnoser_output}
- Verifier Output: {verifier_output}
- Navigator Output: {navigator_output}
Task:
1. Update the contextual summary using the newest agent outputs.
2. Decide the next action:
- DIAGNOSE: run Diagnoser on current service
- VERIFY: run Verifier on current root-cause hypothesis
- NAVIGATE: pick next service from Navigator ranking
- BACKTRACK: pop hypothesis stack and try next candidate
- STOP: finalize root cause
Output (JSON only):
{
“action”: “DIAGNOSE|VERIFY|NAVIGATE|BACKTRACK|STOP”,
“next_service”: “<service_name_or_null>”,
“update_summary”: “<one-sentence summary>”,
“update_stack”: “<push/pop/none>”,
“final_root”: “<service_name_or_null>”
}
Diagnoser Agent Prompt Template
You are a root cause diagnosis agent (State Evaluator). Analyze the service health report and local error logs, then classify the service state as Root Cause/Propagator/Healthy.
Service: {service_name}
Health Report: {health_report}
Local Error Logs: {local_logs}
Current Path: {current_path}
Contextual Summary: {contextual_summary}
Reasoning steps (Chain-of-Thought): (1) Examine the health report for intrinsic errors (stack traces, unhandled exceptions, resource saturation e.g. CPU/Memory) indicating this node is the culprit; (2) Identify extrinsic symptoms (timeouts, connection refusals, circuit breaker events) indicating this node is a victim of upstream failures; (3) Correlate with local error logs and filter noise; (4) Determine state: Root Cause (origin of failure), Propagator (victim passing on the error), or Healthy.
Output (JSON only):
{
“state”: “Root Cause|Propagator|Healthy”,
“confidence”: 0.0-1.0,
“reasoning”: “step-by-step analysis”,
“evidence”: [“short indicator 1”, “short indicator 2”]
}
Navigator Agent Prompt Template
You are a navigation agent (Transition Policy). You are invoked when the current node is classified as a Propagator. Based on the outgoing edges and their attributes, determine the most likely source of the fault by ranking candidate services—construct a fault gradient toward the direction of steepest anomaly intensity.
Current Service: {current_service}
Outgoing Edges (traffic statistics: request count, latency, error rate;
interaction logs per edge):
- {service_1}: Latency = {latency_1}, ErrorRate = {err_rate_1},
Traffic = {traffic_1}, Logs = {relevant_logs_1}
- {service_2}: Latency = {latency_2}, ErrorRate = {err_rate_2},
Traffic = {traffic_2}, Logs = {relevant_logs_2}
...
Contextual Summary: {contextual_summary}
Visited Services: {visited_set}
Rank candidates by suspicion score using: (1) Latency Deviation (response time degradation vs. baseline); (2) Error Propagation Rate (ratio of failed requests matching downstream error signature); (3) Traffic Volatility (spikes or drops in request volume). Respond with the ranked list of service names (most suspicious first).
Verifier Agent Prompt: Consistency Check
Act as an adversarial validator: try to falsify the hypothesis by checking whether it contradicts the observed “Golden Signals” (CPU, memory, latency, error rate). If you find any contradiction, mark as inconsistent.
Hypothesis: Service {service_name} is the root cause.
Reasoning: {diagnoser_reasoning}
Golden Signals: {metric_data}
Example: If CPU usage is normal but the hypothesis claims CPU overload, mark as inconsistent.
Output: {“consistent”: true/false, “contradictions”: [...]}
Verifier Agent Prompt: Causal Sufficiency
Evaluate if the identified root cause is theoretically sufficient to explain the entire observed failure propagation chain (path P):
Root Cause: {service_name}
Propagation Chain: {path}
Downstream Failures: {downstream_services}
Determine if the root cause can theoretically cause all observed downstream failures. Consider dependency relationships and error types.
Output: {“sufficient”: true/false, “explanation”: “reasoning”}
Verifier Agent Prompt: Counterfactual Reasoning
Adopt a Devil’s Advocate stance: test whether the root cause is necessary for the observed failures. If the hypothesized root cause service were healthy, would the downstream failures still occur?
Hypothesis: Service {service_name} is the root cause.
Downstream Services: {downstream_list}
Error Patterns: {error_patterns}
Answer: If {service_name} were healthy, would {downstream_service_1} still fail? Why or why not? Repeat for each downstream service.
Output: {“necessary”: true/false, “counterfactual_analysis”: “...”}
References
1. Zhou X, Peng X, Xie T, Sun J, Ji C, Li W, et al. Fault analysis and debugging of microservice systems: industrial survey, benchmark system, and empirical study. IEEE Trans Softw Eng. 2018;47(2):243–60. doi:10.1109/tse.2018.2887384. [Google Scholar] [CrossRef]
2. Wang T, Qi G. A comprehensive survey on root cause analysis in (micro) services: methodologies, challenges, and trends. arXiv:2408.00803. 2024. [Google Scholar]
3. Meng L, Shao Y, Yuan L, Lai L, Cheng P, Li X, et al. A survey of distributed graph algorithms on massive graphs. ACM Comput Surv. 2024;57(2):1–39. doi:10.1145/3694966. [Google Scholar] [CrossRef]
4. Zhang S, Xia S, Fan W, Shi B, Xiong X, Zhong Z, et al. Failure diagnosis in microservice systems: a comprehensive survey and analysis. ACM Trans Softw Eng Methodol. 2025;35(1):1–55. doi:10.1145/3715005. [Google Scholar] [CrossRef]
5. Han Y, Du Q, Huang Y, Li P, Shi X, Wu J, et al. Holistic root cause analysis for failures in cloud-native systems through observability data. IEEE Trans Serv Comput. 2024;17(6):3789–802. doi:10.1109/tsc.2024.3478759. [Google Scholar] [CrossRef]
6. Chen S, Long X, Fan J, Jin G. A causal inference-based root cause analysis framework using multi-modal data in large-complex system. Reliab Eng Syst Saf. 2026;265(A):111520. [Google Scholar]
7. Zhang C, Peng X, Sha C, Zhang K, Fu Z, Wu X, et al. DeeptraLog: trace-log combined microservice anomaly detection through graph-based deep learning. In: 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE); 2022 May 25–27; Pittsburgh, PA, USA. Piscataway, NJ, USA: IEEE; 2022. p. 623–34. [Google Scholar]
8. Yao Z, Pei C, Chen W, Wang H, Su L, Jiang H, et al. Chain-of-event: interpretable root cause analysis for microservices through automatically learning weighted event causal graph. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. New York, NY, USA: ACM; 2024. p. 50–61. [Google Scholar]
9. Zheng L, Chen Z, He J, Chen H. MULAN: multi-modal causal structure learning and root cause analysis for microservice systems. In: Proceedings of the ACM Web Conference 2024. New York, NY, USA: ACM; 2024. p. 4107–16. [Google Scholar]
10. Wang P, Zhang X, Cao Z. Anomaly detection for microservice system via augmented multimodal data and hybrid graph representations. Inf Fusion. 2025;118:103017. doi:10.1016/j.inffus.2025.103017. [Google Scholar] [CrossRef]
11. Liu X, Liu Y, Wei M, Xu P. LMGD: log-metric combined microservice anomaly detection through graph-based deep learning. IEEE Access. 2024;12:186510–9. [Google Scholar]
12. Zhang C, Dong Z, Peng X, Zhang B, Chen M. Trace-based multi-dimensional root cause localization of performance issues in microservice systems. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE; 2024. p. 1–12. [Google Scholar]
13. Gu S, Rong G, Ren T, Zhang H, Shen H, Yu Y, et al. TrinityRCL: multi-granular and code-level root cause localization using multiple types of telemetry data in microservice systems. IEEE Trans Softw Eng. 2023;49(5):3071–88. [Google Scholar]
14. Huang J, Yang Y, Yu H, Li J, Zheng X. Twin graph-based anomaly detection via attentive multi-modal learning for microservice system. In: 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). Piscataway, NJ, USA: IEEE; 2023. p. 66–78. [Google Scholar]
15. Pazho AD, Noghre GA, Purkayastha AA, Vempati J, Martin O, Tabkhi H. A survey of graph-based deep learning for anomaly detection in distributed systems. IEEE Trans Knowl Data Eng. 2023;36(1):1–20. doi:10.1109/tkde.2023.3282898. [Google Scholar] [CrossRef]
16. Xu H, Pang G, Wang Y, Wang Y. Deep isolation forest for anomaly detection. IEEE Trans Knowl Data Eng. 2023;35(12):12591–604. doi:10.1109/tkde.2023.3270293. [Google Scholar] [CrossRef]
17. Xiang H, Zhang X, Hu H, Qi L, Dou W, Dras M, et al. OptIForest: optimal isolation forest for anomaly detection. arXiv:2306.12703. 2023. [Google Scholar]
18. Chalapathy R, Chawla S. Deep learning for anomaly detection: a survey. arXiv:1901.03407. 2019. [Google Scholar]
19. Kwon D, Kim H, Kim J, Suh SC, Kim I, Kim KJ. A survey of deep learning-based network anomaly detection. Cluster Comput. 2019;22(Suppl 1):949–61. doi:10.1007/s10586-017-1117-8. [Google Scholar] [CrossRef]
20. Lee C, Yang T, Chen Z, Su Y, Lyu MR. Eadro: an end-to-end troubleshooting framework for microservices on multi-source data. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). Piscataway, NJ, USA: IEEE; 2023. p. 1750–62. [Google Scholar]
21. Yu G, Chen P, Li Y, Chen H, Li X, Zheng Z. Nezha: interpretable fine-grained root causes analysis for microservices on multi-modal observability data. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: ACM; 2023. p. 553–65. [Google Scholar]
22. Zhou X, Peng X, Xie T, Sun J, Xu C, Ji C, et al. Benchmarking microservice systems for software engineering research. In: ICSE ’18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. New York, NY, USA: ACM; 2018. p. 323–4. [Google Scholar]
23. Soldani J, Brogi A. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: a survey. ACM Comput Surv (CSUR). 2022;55(3):1–39. doi:10.1145/3501297. [Google Scholar] [CrossRef]
24. AIOps Challenge 2025. 2025 [cited 2026 Mar 10]. Available from: https://www.aiops.cn/gitlab/aiops-live-benchmark/aiopschallenge2025. [Google Scholar]
25. Blanco DG. Practical OpenTelemetry. New York, NY, USA: Apress; 2023. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools