Open Access iconOpen Access

ARTICLE

Rethinking Chart Understanding Using Multimodal Large Language Models

Andreea-Maria Tanasă, Simona-Vasilica Oprea*

Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, no. 6 Piaţa Romană, Bucharest, 010374, Romania

* Corresponding Author: Simona-Vasilica Oprea. Email: email

Computers, Materials & Continua 2025, 84(2), 2905-2933. https://doi.org/10.32604/cmc.2025.065421

Abstract

Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges, including layout complexity in unstructured formats, limitations in recognizing visual elements, and the correlation between different parts of the documents, as well as domain-specific semantics. Simply extracting text is not sufficient; advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately. This paper aims to evaluate the ability of the Large Language Models (LLMs) to correctly answer questions about various types of charts, comparing their performance when using images as input versus directly parsing PDF files. To retrieve the images from the PDF, ColPali, a model leveraging state-of-the-art visual language models, is used to identify the relevant page containing the appropriate chart for each question. Google’s Gemini multimodal models were used to answer a set of questions through two approaches: 1) processing images derived from PDF documents and 2) directly utilizing the content of the same PDFs. Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding (VrDU) and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks. Through structured benchmarking of chart question answering (CQA) across input formats, our work contributes to the advancement of chart understanding (CU) and the broader field of multimodal document analysis. Using two diverse and information-rich sources: the World Health Statistics 2024 report by the World Health Organisation and the Global Banking Annual Review 2024 by McKinsey & Company, we examine the performance of multimodal LLMs across different input modalities, comparing their effectiveness in processing charts as images versus parsing directly from PDF content. These documents were selected due to their multimodal nature, combining dense textual analysis with varied visual representations, thus presenting realistic challenges for vision-language models. This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.

Keywords

Chart understanding; large language models; multimodal models; PDF extraction

Cite This Article

APA Style
Tanasă, A., Oprea, S. (2025). Rethinking Chart Understanding Using Multimodal Large Language Models. Computers, Materials & Continua, 84(2), 2905–2933. https://doi.org/10.32604/cmc.2025.065421
Vancouver Style
Tanasă A, Oprea S. Rethinking Chart Understanding Using Multimodal Large Language Models. Comput Mater Contin. 2025;84(2):2905–2933. https://doi.org/10.32604/cmc.2025.065421
IEEE Style
A. Tanasă and S. Oprea, “Rethinking Chart Understanding Using Multimodal Large Language Models,” Comput. Mater. Contin., vol. 84, no. 2, pp. 2905–2933, 2025. https://doi.org/10.32604/cmc.2025.065421



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 273

    View

  • 102

    Download

  • 0

    Like

Share Link