A Grounded Multi-Agent Multimodal Large Language Model Framework for Interpretable Risk Assessment in Driving Scenes

Chien-Hao Tseng¹, Min-Yu Chen¹, Meng-Wei Lin¹, Jyh-Horng Wu¹, Chung-I Huang^2,*
1 National Center for High-Performance Computing, National Institutes of Applied Research, Hsinchu City, Taiwan
2 Department of Management Information Systems, National Chung Hsing University, Taichung City, Taiwan
* Corresponding Author: Chung-I Huang. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.083337

Received 02 April 2026; Accepted 25 May 2026; Published online 23 June 2026

Download PDF

Abstract

Context-aware driving assistance must do more than detect objects: it has to identify the cues that materially affect risk, separate observable evidence from inference, and produce recommendations that humans can audit. This paper presents a grounded multi-agent multimodal large language model (MLLM) framework for interpretable risk assessment in driving scenes. The framework decomposes reasoning into four stages—context relevance evaluation, visual interpretation, factual verification with anomaly extraction, and risk assessment with action recommendation—so that the final advisory is generated only from a verified intermediate representation rather than directly from a free-form scene description. We evaluate the framework on a manually labeled benchmark derived from BDD100K covering traffic-sign interpretation, traffic-density assessment, and pedestrian–vehicle interaction risk. The benchmark contains 600 frames with three-rater annotation and majority-vote labels (Fleiss’ κ=0.79 on risk levels); we explicitly discuss the implications of this scale for generalization and complement it with a multi-backbone stress test. Across five independent runs, the proposed framework improves risk accuracy from 74.3±0.9% to 84.8±0.6% and macro-F1 from 72.8±1.1% to 83.1±0.7% over a single-agent MLLM baseline. The hallucination rate—defined as the fraction of outputs containing at least one entity, attribute, or relation that has no visual support in the source frame—drops from 18.7% to 8.9%, and the actionability score—a five-point human rating averaged over usefulness, specificity, and visual consistency—rises from 3.62 to 4.28. McNemar tests confirm that the gain in risk accuracy is statistically significant (p<0.001). The framework is intended as a semantic decision-support layer for explainable advanced driver-assistance systems and human-centered autonomous-driving interfaces.

Keywords

Autonomous driving; context-aware risk assessment; multimodal large language model; multi-agent system; interpretable reasoning; driving-scene understanding

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

248

View
34

Download
0

Like

A Survey on Image Semantic Segmentation Using Deep Learning Techniques
Jieren Cheng, Hua Li, Dengbo Li,...
Adapted Speed System in a Road Bend Situation in VANET Environment
Said Benkirane, Azidine Guezzaz,...
Load Balancing Based on Multi-Agent Framework to Enhance Cloud Environment
Shrouk H. Hessen, Hatem M. Abdul-kader,...
Lightning Search Algorithm with Deep Transfer Learning-Based Vehicle Classification
Mrim M. Alnfiai
Modeling and TOPSIS-GRA Algorithm for Autonomous Driving Decision-Making Under 5G-V2X Infrastructure
Shijun Fu, Hongji Fu

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

A Grounded Multi-Agent Multimodal Large Language Model Framework for Interpretable Risk Assessment in Driving Scenes

Abstract

Keywords

248

34

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link