Open Access
ARTICLE
Rethinking Chart Understanding Using Multimodal Large Language Models
Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, no. 6 Piaţa Romană, Bucharest, 010374, Romania
* Corresponding Author: Simona-Vasilica Oprea. Email:
Computers, Materials & Continua 2025, 84(2), 2905-2933. https://doi.org/10.32604/cmc.2025.065421
Received 12 March 2025; Accepted 06 June 2025; Issue published 03 July 2025
Abstract
Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges, including layout complexity in unstructured formats, limitations in recognizing visual elements, and the correlation between different parts of the documents, as well as domain-specific semantics. Simply extracting text is not sufficient; advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately. This paper aims to evaluate the ability of the Large Language Models (LLMs) to correctly answer questions about various types of charts, comparing their performance when using images as input versus directly parsing PDF files. To retrieve the images from the PDF, ColPali, a model leveraging state-of-the-art visual language models, is used to identify the relevant page containing the appropriate chart for each question. Google’s Gemini multimodal models were used to answer a set of questions through two approaches: 1) processing images derived from PDF documents and 2) directly utilizing the content of the same PDFs. Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding (VrDU) and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks. Through structured benchmarking of chart question answering (CQA) across input formats, our work contributes to the advancement of chart understanding (CU) and the broader field of multimodal document analysis. Using two diverse and information-rich sources: the World Health Statistics 2024 report by the World Health Organisation and the Global Banking Annual Review 2024 by McKinsey & Company, we examine the performance of multimodal LLMs across different input modalities, comparing their effectiveness in processing charts as images versus parsing directly from PDF content. These documents were selected due to their multimodal nature, combining dense textual analysis with varied visual representations, thus presenting realistic challenges for vision-language models. This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.Keywords
Cite This Article

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.