A Grounded Multi-Agent Multimodal Large Language Model Framework for Interpretable Risk Assessment in Driving Scenes
Chien-Hao Tseng1, Min-Yu Chen1, Meng-Wei Lin1, Jyh-Horng Wu1, Chung-I Huang2,*
1 National Center for High-Performance Computing, National Institutes of Applied Research, Hsinchu City, Taiwan
2 Department of Management Information Systems, National Chung Hsing University, Taichung City, Taiwan
* Corresponding Author: Chung-I Huang. Email:
Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.083337
Received 02 April 2026; Accepted 25 May 2026; Published online 23 June 2026
Abstract
Context-aware driving assistance must do more than detect objects: it has to identify the cues that materially affect risk, separate observable evidence from inference, and produce recommendations that humans can audit. This paper presents a grounded multi-agent multimodal large language model (MLLM) framework for interpretable risk assessment in driving scenes. The framework decomposes reasoning into four stages—context relevance evaluation, visual interpretation, factual verification with anomaly extraction, and risk assessment with action recommendation—so that the final advisory is generated only from a verified intermediate representation rather than directly from a free-form scene description. We evaluate the framework on a manually labeled benchmark derived from BDD100K covering traffic-sign interpretation, traffic-density assessment, and pedestrian–vehicle interaction risk. The benchmark contains 600 frames with three-rater annotation and majority-vote labels (Fleiss’
κ=0.79 on risk levels); we explicitly discuss the implications of this scale for generalization and complement it with a multi-backbone stress test. Across five independent runs, the proposed framework improves risk accuracy from
74.3±0.9% to
84.8±0.6% and macro-F1 from
72.8±1.1% to
83.1±0.7% over a single-agent MLLM baseline. The hallucination rate—defined as the fraction of outputs containing at least one entity, attribute, or relation that has no visual support in the source frame—drops from
18.7% to
8.9%, and the actionability score—a five-point human rating averaged over usefulness, specificity, and visual consistency—rises from
3.62 to
4.28. McNemar tests confirm that the gain in risk accuracy is statistically significant (
p<0.001). The framework is intended as a semantic decision-support layer for explainable advanced driver-assistance systems and human-centered autonomous-driving interfaces.
Keywords
Autonomous driving; context-aware risk assessment; multimodal large language model; multi-agent system; interpretable reasoning; driving-scene understanding