Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.077908
Special Issues
Table of Content

Open Access

ARTICLE

Graph-Augmented Multi-Agent Robust Root Cause Analysis in AIOps

Haodong Zou1,*, Yichen Zhao1, Xin Chen1, Ling Wang1, Jinghang Yu1, Long Yuan2,*
1 Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co., Ltd., Nanjing, China
2 School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
* Corresponding Author: Haodong Zou. Email: email; Long Yuan. Email: email
(This article belongs to the Special Issue: Multimodal Learning for Big Data)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.077908

Received 19 December 2025; Accepted 13 March 2026; Published online 03 April 2026

Abstract

Root cause analysis (RCA), which leverages multi-modal observability data (including metrics, traces, and logs) to identify the fundamental source of system failures, is critical for ensuring the reliability of complex microservice systems. Traditionally, RCA has relied on human engineers to manually correlate these fragmented signals, which is a labor-intensive and error-prone process. Although recent AIOps advancements, particularly those leveraging Large Language Models (LLMs), aim to automate this workflow, they remain constrained by limitations. Existing methods often rely on single-modal data, restricting diagnostic comprehensiveness. Furthermore, approaches that utilize multi-modal data typically depend on simplistic temporal alignment, which fails to capture complex semantic relationships, or directly employ LLMs, which are prone to hallucinations and lack reliability. To address these issues, we propose a novel Graph-Augmented Multi-Agent Framework that synergizes the structural rigor of graph topology with the advanced semantic reasoning capabilities of LLMs. Our approach operates in two distinct phases designed to mimic human expert problem-solving. First, in the Anomaly Fusion Graph Construction phase, we employ a hybrid alignment strategy to bridge the gap between unstructured logs and structured traces. An LLM serves as a “semantic arbitrator” to resolve ambiguities in high-concurrency scenarios, creating a unified knowledge environment where each node is enriched with comprehensive health insights. Second, the Multi-Agent Collaborative Reasoning phase deploys a team of specialized agents to simulate human Site Reliability Engineering (SRE) workflows. A Navigator Agent efficiently guides the search space via calculated fault gradients, while a Diagnoser Agent performs deep semantic analysis. Crucially, a Verifier Agent enforces an Adversarial Validation Protocol to mitigate hallucinations through rigorous counterfactual reasoning. Extensive experiments conducted on five diverse datasets demonstrate the robustness and effectiveness of our approach. The results show that our framework achieves an average F1-score of 88.4%, significantly outperforming state-of-the-art baselines by 4.6%, proving its ability to synthesize multi-modal information into actionable diagnostic insights.

Keywords

Data fusion; anomaly fusion graph; AIOps
  • 452

    View

  • 350

    Download

  • 0

    Like

Share Link