TY  - EJOU
AU  - Borovčak, Karlo 
AU  - Babac, Marina Bagić 
AU  - Mornar, Vedran 

TI  - Evaluating Open-Source LLM Agents for SQL Generation and Structured Analytics on Relational Databases
T2  - Computers, Materials \& Continua

PY  - 
VL  - 
IS  - 
SN  - 1546-2226

AB  - This study examines the potential of open-source foundation models for structured data analytics, with particular emphasis on SQL generation and business-oriented interpretation in single-agent and multi-agent large language model (LLM) systems. The proposed framework addresses a practical problem in analytics-intensive environments, where natural-language requests must be translated into executable, semantically appropriate SQL queries and subsequently interpreted in a form useful for business decision-making. The system is evaluated in two complementary settings: a custom SQL test suite designed around realistic marketing and e-commerce analytics tasks, and the public Spider benchmark, which supports comparison with prior text-to-SQL research and enables assessment of cross-domain generalization. The analysis includes Mistral, Devstral, Qwen2.5-Coder, and Qwen3. On the custom SQL test suite, performance was assessed using exact match, safe SQL rate, and an independent semantic judge score. Qwen2.5-Coder achieved the strongest overall result, reaching an independent semantic score of 90.14% while maintaining a 98.59% safe SQL rate. Qwen3 followed with a semantic score of 77.46% and completely safe SQL generation. These results indicate that in domain-specific analytics settings, strict query-level matching alone is too conservative to capture practical model usefulness, since semantically appropriate SQL queries may differ substantially from the reference formulation. The Spider benchmark results provide complementary evidence regarding broader model behavior. Qwen2.5-Coder achieved the highest single-agent execution accuracy (72.44%), whereas Devstral obtained the strongest single-agent exact-match score (28.14%). Qwen3 remained competitive and delivered the lowest single-agent latency (0.41 s) among the evaluated models. At the architectural level, the effect of multi-agent decomposition was not uniform: it yielded modest gains in execution accuracy for some model families, but reduced performance for others, while consistently increasing latency and token consumption. Taken together, the findings show that open-source LLM agents can provide effective support for structured analytics, but that their performance depends strongly on model family, prompting strategy, and agent architecture. More broadly, the study demonstrates that the evaluation of text-to-SQL systems benefits from combining benchmark-based metrics, domain-oriented semantic assessment, and efficiency-aware analysis, thereby offering a more realistic basis for the deployment of open-source LLM systems in analytics-intensive environments.
KW  - Large language models (LLMs); multi-agent systems; SQL agents; text-to-SQL

DO  - 10.32604/cmc.2026.078330