Open Access
ARTICLE
KG-HoT: Knowledge-Grounded Hybrid Chain-of-Thought for Geometry Problem Solving
1 School of Artificial Intelligence and Computer Science (School of Software), Northwest Normal University, Lanzhou, China
2 Beijing Jinghang Research Institute of Computing and Communication, Beijing, China
* Corresponding Author: Meihuizi Jia. Email:
Computers, Materials & Continua 2026, 88(2), 81 https://doi.org/10.32604/cmc.2026.080333
Received 06 February 2026; Accepted 06 May 2026; Issue published 15 June 2026
Abstract
Large language models (LLMs) have demonstrated considerable ability in solving various tasks via Chain-of-Thought (CoT) prompting, which has precipitated extensive research into their application for complex mathematical reasoning problems. However, current research on mathematical reasoning with CoT predominantly focuses on textual mathematical tasks, such as math word problems, while paying limited attention to multimodal geometric scenarios. To bridge this gap, we propose KG-HoT, a model that harnesses the generative and comprehension capabilities of Multimodal large language models (MLLMs) to enhance complex geometric problem-solving in multimodal systems. Our knowledge-grounded approach enables MLLMs to generate hybrid chains-of-thought operating on dual tracks—language-based reasoning and program-based reasoning—which serve as teaching signals for smaller models. Furthermore, we design an instruction tuning framework that trains these dual reasoning tracks collaboratively within a unified architecture, enabling mutual enhancement and efficient knowledge distillation for complex geometric problem solving. Extensive experimental results demonstrate that KG-HoT achieves superior performance compared to existing approaches on multiple geometry problem-solving benchmarks.Keywords
Geometry Problem Solving (GPS) represents a pivotal and long-standing challenge within artificial intelligence, requiring the integration of advanced mathematical reasoning, geometric visual understanding, and domain knowledge application. Its importance and complexity have drawn increasing attention from both Computer Vision (CV) and Natural Language Processing (NLP) communities [1–3].
Research in geometry problem solving has evolved from rule-based matching and symbolic reasoning to deep learning approaches. Symbolic solvers [2,4–6] typically employ syntactic parsers to translate problems and diagrams into formal languages, then execute symbolic reasoning using techniques such as path searching and condition matching. As shown in Fig. 1a, the geometric elements, relational constraints, and target objective are first transformed into formal language representations, followed by systematic derivation to yield the final solution. Despite their interpretability, these methods suffer from complex rule engineering requirements and poor generalization to real-world unstructured data. To address these limitations, neural solvers [1,3,7,8] employ hybrid encoders to jointly embed diagrams and text, generating solution programs through sequence modeling. As illustrated in Fig. 1b, the approach begins with predefining basic geometric relations (e.g., Equal, Double), arithmetic operators (e.g., Add, Minus), and constants. Subsequently, the model generates a solution program, for instance “minus C_3 N_0 minus C_2 V_0”, which upon execution yields the final result. Despite achieving some success, annotating solution programs requires domain expert guidance, making the process expensive and time-consuming. Moreover, these models demonstrate limited understanding of geometric knowledge and generate solution programs without explicit reasoning processes.

Figure 1: Comparison of three approaches for geometric reasoning. (a) symbolic solvers using predefined rules, (b) neural solvers with solution programs, (c) human solving with step-by-step theorem-based reasoning.
Recent breakthroughs in large language models [9,10] have introduced new research paradigms for mathematical reasoning problems [11–14]. These models advance mathematical reasoning tasks, such as math word problems, through Chain-of-Thought (CoT) reasoning techniques that generate intermediate reasoning steps before deriving final answers. However, text-only processing approaches exhibit inherent limitations in geometric diagrams understanding and spatial relationship reasoning. While several studies have explored Multimodal Chain-of-Thought (MCoT) reasoning [15–18], these efforts primarily target natural science domains. Multimodal geometry problems present unique challenges due to their complex textual descriptions, specialized geometric information, and implicit domain knowledge. When solving complex geometry problems, human experts seamlessly integrate textual and diagrammatic analysis, apply relevant geometric theorems, and construct solution paths. As shown in Fig. 1c, humans start from the given condition
To enhance multimodal geometry problem-solving capabilities and effectively bridge the gap between existing methods and human cognitive processes, we propose the Knowledge-grounded Hybrid Chain-of-Thought (KG-HoT) paradigm for Geometry Problem Solving. Our approach consists of two complementary stages: 1) generating knowledge-grounded hybrid reasoning chains from multimodal large language models as teaching signals; 2) distilling these signals into lightweight student models for efficient geometric reasoning. In the teaching signal generation stage, we construct diverse reasoning chains through zero-shot instructions. Specifically, guided by geometric knowledge, we build knowledge paths from concepts to theorems to visual mappings, and implement Python functions for programmatic theorem invocation. By integrating geometric problems, diagram information, standard answers, and geometric knowledge, we guide MLLMs to generate two complementary reasoning chains: 1) Knowledge-grounded Language Chain-of-Thought (KG-CoT), featuring a three-tier progressive structure comprising geometric concept chains (identifying fundamental concepts), theorem chains (determining reasoning rules), and mapping chains (establishing knowledge-diagram associations), ultimately producing solution paths; 2) Knowledge-grounded Program Chain-of-Thought (KG-PoT), transforming reasoning into executable Python sequences that combine semantic clarity with computational precision. To achieve knowledge transfer, we propose a multi-chain learning strategy that jointly optimizes KG-CoT and KG-PoT within a unified framework, enabling synergistic enhancement through instruction tuning. The complementary reasoning paths–natural language and programmatic–grounded in shared geometric knowledge collectively strengthen the student model’s problem-solving capabilities. Experimental results demonstrate significant performance improvements across multiple geometric benchmarks. Our contributions are summarized as follows:
• We propose KG-HoT framework, a knowledge distillation framework that transfers geometric reasoning capabilities from large models to lightweight models.
• We design a multi-chain joint learning strategy that combines the three-tier progressive reasoning of KG-CoT with the programmatic verification of KG-PoT, enhancing both the accuracy and interpretability of lightweight models in geometry problem solving.
• We conduct comprehensive experiments across multiple geometry benchmarks to evaluate our approach. Results demonstrate that KG-HoT achieves significant performance improvements over baseline methods and delivers competitive results compared to existing advanced models.
Multimodal geometric reasoning is a challenging mathematical reasoning task, which can be categorized into symbolic solvers [2,4–6] and neural solvers [1,3,7,8,19]. Most neural solvers follow the encoder-decoder framework and focus on dataset construction. Many studies [1,3,7] leverage pre-training and fine-tuning to improve performance. Chen et al. [1] build the GeoQA dataset and propose NGS, the first deep learning method for GPS, with two pre-training tasks for better cross-modal alignment. Chen et al. [7] present the unified benchmark UniGeo and introduce Geoformer, a multi-task transformer for geometric calculation and proving, enhanced by math expression pre-training. Zhang et al. [3] construct the fine-grained PGPS9K dataset and design PGPSNet with an MLM-based semantic pre-training strategy, which converts diagrams into text clauses for effective feature representation. Liang et al. [8] propose UniMath to handle diverse multimodal math problems, and augment the vocabulary with tokenized image representations from a trainable VQ-VAE [20,21]. Ning et al. [19] develop a symbol-character aware model, using self-supervised learning and masked image modeling to improve diagram understanding. These methods generate fixed solution programs, but expert annotation leads to high costs. Recently, large language models have advanced geometric problem solving. Zhao et al. [22] systematically survey GPS in the large model era, covering benchmarks, multimodal parsing, and reasoning paradigms. Gao et al. [23] release the Geo170K dataset with over 170K geometric image-caption and question-answer pairs, and develop G-LLaVA for multimodal GPS. In addition, chain-of-thought prompting has become a key technique for multi-step logical reasoning. GeomVerse [24] validates that CoT fine-tuning significantly boosts reasoning ability on complex geometry problems, and GNS [25] further improves accuracy and interpretability by combining symbolic parsing with CoT. How to more effectively apply geometric knowledge for GPS remains an important open direction.
2.2 Chain-of-Thought Reasoning
Chain-of-Thought (CoT) reasoning has emerged as a crucial paradigm for enhancing complex reasoning capabilities in large language models. Wei et al. [11] pioneer CoT prompting, significantly improving model performance on arithmetic, commonsense, and symbolic reasoning tasks through intermediate reasoning steps. Kojima et al. [26] advance this with zero-shot CoT, showing that simply adding “Let’s think step by step” can elicit reasoning capabilities. To enhance CoT robustness, researchers explore multiple directions. Wang et al. [27] propose Self-Consistency, which improves answer reliability through multi-path sampling and voting mechanisms. Zhang et al. [28] construct diverse exemplars via automatic clustering. Yao et al. [29] and Besta et al. [30] introduce Tree-of-Thoughts (ToT) and Graph-of-Thoughts (GoT), respectively, extending linear reasoning to tree and graph structures. In multimodal scenarios, Zhang et al. [18] propose Multimodal-CoT, employing a two-stage framework for sequential rationale generation and answer inference. Rose et al. [16] introduce Visual Chain-of-Thought, incorporating visual information to address logical gaps in text-based reasoning. However, existing work primarily focuses on optimizing single reasoning chains, with limited exploration of synergies between different chain types. We propose a hybrid chain-of-thought approach for more effective geometry problem solving through the complementary fusion of language and program reasoning chains.
We observe that state-of-the-art MLLMs, such as GPT, Claude, and DeepSeek, although capable of understanding general visual tasks, have difficulty with geometric problems, even those simple for humans. As shown in Fig. 2, GPT-4o and DeepSeek yield erroneous results of 90∘ and 60∘, respectively. Taking GPT-4o as an example, when supplemented with geometric theorems (Fig. 2, left), this model achieves improved reasoning but still struggles with visual interpretation; when given explicit mappings between theorems and visual elements (Fig. 2, right), this model successfully produces correct answers. This progressive improvement indicates that model performance hinges on the integration of geometric domain knowledge and visual cues. Therefore, developing effective geometric problem-solving frameworks requires deep integration of geometric knowledge with visual reasoning.

Figure 2: Geometric reasoning by multimodal large language models with progressive knowledge enhancement. Left: Models with geometric theorems show improved but imperfect reasoning. Right: Models with theorem-visual mappings achieve correct solutions.
We propose KG-HoT, a knowledge distillation framework that leverages a teacher-student architecture to transfer geometric reasoning capabilities effectively. As depicted in Fig. 3, the framework operates in two distinct stages. (1) Knowledge-grounded hybrid chain-of-thought generation, where a multimodal large language model serves as the teacher model to extract problem-relevant theorems and properties, subsequently generating two complementary reasoning chains based on domain knowledge: a language-based chain-of-thought that describes reasoning steps in natural language, and a program-based chain-of-thought that encodes geometric relationships into python programs for computational solving. (2) Hybrid chain-based knowledge distillation, where the hybrid chains serve as supervision signals to train a lightweight student model through instruction tuning, enabling resource-efficient geometric reasoning.

Figure 3: KG-HoT framework.
Large language models have accumulated comprehensive knowledge bases and superior linguistic comprehension through extensive pre-training on large-scale corpora. Leveraging this foundation, our KG-HoT model employs geometric knowledge to guide multimodal large language models in generating high-quality hybrid chains-of-thought through carefully designed prompt templates. Specifically, we first collect key geometric knowledge as follows:
• Knowledge concepts: Basic theoretical knowledge essential for solving specific problem types (e.g., Parallel Lines, Inscribed Angles, Pythagorean theorem). The knowledge concept candidate set is denoted as
• Theorems: Rigorously proven mathematical propositions with general validity (e.g., Alternate interior angles formed by parallel lines are equal). All collected theorems are denoted as:
• Python function candidates: For each knowledge concept and its corresponding theorems, we predefine a series of python functions whose names consist of the knowledge concept name and a theorem summary (e.g.,
We collect 48 candidate geometric knowledge concepts, 137 candidate geometric theorems, and 137 candidate Python functions. All collected knowledge concepts, theorems, and corresponding Python functions are available in the data/collection folder at https://github.com/jmhz24/HG-HoT. Fig. 4 illustrates representative examples of the collected knowledge.

Figure 4: Samples of geometric knowledge mapping between knowledge concepts, theorems, and python functions.
3.2.1 Knowledge-Grounded Language-Based CoT Generation
Given the extensive knowledge scope required for geometric problems, we adopt a two-step zero-shot prompting strategy to mitigate potential interference from irrelevant knowledge in multimodal large language models. This approach comprises: (1) knowledge concept identification and (2) high-quality chain-of-thought generation. Knowledge concepts refer to fundamental geometric principles (e.g., Inscribed Angle; Parallel Lines) that are essential for problem-solving. For knowledge concept identification, we design a knowledge selection/generation prompt template as follows, into which the
[Instruction]; Question: [
In this template,
Subsequently, we construct the following prompt template for language-based chain-of-thought generation based on the identified knowledge concepts set:
[Basic Instruction]; Question: [
The prompt template design incorporates multiple key slots. Note that each theorem set

Figure 5: Sample of language-based chain-of-thought prompt template.
Ultimately, this template guides MLLMs to produce a language-based chain
3.2.2 Knowledge-Grounded Program-Based CoT Generation
The program-based chain-of-thought is designed to facilitate student models’ learning of reasoning processes through program abstraction while establishing multimodal mappings between program variables and visual elements. Building upon the knowledge concepts from Section 3.2.1, our program-based chain-of-thought prompt template incorporates the following core components:
[Basic Instruction]; Question: [
The template reuses the basic slots from the language-based version (problem, simple solution process, answer, knowledge concepts) but replaces theorem candidates with Python function candidates. Note that each Python function set

Figure 6: Sample of program-based chain-of-thought prompt template.
This template guides MLLMs to produce an executable programmatic chain
To fully leverage the complementary advantages of different reasoning paradigms, we integrate language-based and program-based chains-of-thought into unified training samples

Figure 7: Sample of hybrid chain-of-thought combining KG-CoT (language-based reasoning) and KG-PoT (program-based reasoning).
Inspired by Orca2 [31], we employ the Prompt Erasure technique, where student models receive only task descriptions and teacher-generated hybrid chains-of-thought, without exposure to the detailed prompts used to guide the teacher model. This design aims to enable student models not only to learn reasoning steps but also, more importantly, to autonomously select appropriate reasoning strategies. Therefore, during training, we provide only generic instructions such as: “Please analyze the given problem, incorporate the image information, and provide solution steps with the answer.”
The student model training follows this format: inputs comprise generic instructions
where
During inference, the model receives generic instructions, problem descriptions, and geometric diagrams. The visual encoder (CLIP ViT-L/14 in LLaVA-v1.5) extracts diagram features and maps them to the same embedding space dimension as the large language model through a two-layer MLP projection layer, which are then concatenated with text embeddings to form multimodal representations. Based on these representations, the model autoregressively generates output sequences, as shown in Eq. (2):
where
We validate KG-HoT through comprehensive experiments on GeoQA and GeoQA+ datasets, conducting baseline comparisons, ablation, and further analysis.
• GeoQA [1] contains 5010 geometry multiple-choice problems from Chinese mathematics examinations for grades 6–12, covering angle calculation, length calculation, and other types (e.g., area calculation). Each problem is annotated with simple solution guidance and manually crafted solution programs. The dataset is split into training/validation/test sets with a 7:1.5:1.5 ratio. We utilize the English version provided by [7].
• GeoQA+ [32] extends the GeoQA training set with an additional 2518 problems, totaling 7528, while retaining the original validation and test sets. Compared to GeoQA, this dataset introduces more challenging problems with expanded knowledge coverage and difficulty gradients, maintaining consistent annotation schemes.
Table 1 presents the data splits for both datasets.

Following [1,7,32], we employ accuracy as the evaluation metric and beam search (beam size = 10) for prediction generation. To ensure a fair comparison, answer options are not provided during either training or inference. GPT-4V (gpt-4-vision-preview, temperature = 0.7) serves as the teacher model for generating hybrid chains-of-thought, while the 7B-parameter LLaVA-v1.5 [33] functions as the student model, featuring an architecture comprising the Vicuna-7B-v1.5 language model, CLIP ViT-L/14 visual encoder, and a two-layer MLP projection layer. Both datasets and prompt templates are in English, with the input image resolution set to
To evaluate the comprehensiveness of our 48-concept knowledge base, we analyzed its coverage on the test sets. The knowledge base achieves 98.54% sample-level coverage (98.54% of test samples contain at least one relevant concept) and 99.94% concept-level coverage (99.94% of all required concept instances are covered). For the small fraction of out-of-scope cases, our prompt templates explicitly instruct models to leverage their inherent knowledge, ensuring robust performance across all test problems.
We employ LoRA (Low-Rank Adaptation) [34] for efficient fine-tuning on all linear transformation layers (rank = 128, alpha = 256, dropout = 0.05) and the AdamW optimizer [35] (lr = 2e−4,
We select the following 10 baseline models and provide human performance on the dataset:
• FiLM [36], which introduces Feature-wise Linear Modulation layers applying affine transformations to intermediate features based on input conditions for conditional visual reasoning.
• RN [37], which designs a plug-and-play Relation Networks module for relational reasoning across visual QA, textual QA, and complex reasoning tasks.
• MCAN [38], which enhances visual question answering performance through deep Modular Co-Attention Networks.
• Seq2Prog+Diagram [1], which builds on the Seq2Prog framework inspired by [39], using attention-based GRU for text encoding, ResNet for image features, and feature concatenation for multimodal fusion.
• BERT2Prog + Diagram [1], which replaces the encoder in Seq2Prog+Diagram with BERT, as derived from Chen et al. [1].
• NGS [1], which introduces Neural Geometry Solver fusing multimodal features via attention and employing two pre-training tasks (geometric jigsaw position and element prediction) for enhanced text-diagram representation.
• Geoformer [7], which unifies geometric calculation and proof as sequence generation with mathematical expression pre-training.
• DPE-GPS [32], which augments training data and employs Dual Parallel text Encoders for processing problem texts of varying lengths.
• SCA-GPS [19], which proposes Symbol Character-Aware modeling enhanced through self-supervised learning and masked image modeling for geometric diagram understanding.
• LLaVA*, which directly applies LLaVA [33] to geometry datasets, trained on GeoQA/GeoQA+ for solution program generation with identical parameters to KG-HoT.
Tables 2 and 3 present performance comparisons across models on GeoQA and GeoQA+, respectively. Bold values in the tables indicate the best performance in each column. KG-HoT achieves optimal performance across all problem types, with improvements of 7.2% and 7.0% over the same architecture LLaVA* baseline, validating the effectiveness of hybrid chain-of-thought learning. The comparative results on GeoQA reveal distinct differences between two modeling paradigms among baseline models. FiLM, RN, and MCAN employ VQA classification approaches with inferior performance, indicating that classification paradigms are unsuitable for complex geometric reasoning. In contrast, solution program generation models (BERT2Prog+Diagram through LLaVA*) demonstrate significant performance gains. Among these, early feature concatenation methods (Seq2Prog, BERT2Prog) show limited effectiveness; NGS and Geoformer utilize the T5 framework, achieving 4.0% and 1.5% improvements through pre-training, respectively; DPE-GPS and SCA-GPS design specialized modules for long problem sequences and symbolic problems, respectively, further enhancing performance. LLaVA*, leveraging powerful foundational capabilities, surpasses all specialized models without custom design, outperforming DPE-GPS and SCA-GPS by 4.1% and 2.7%, respectively. Compared to these models, our proposed KG-HoT achieves a 7.2% improvement. GeoQA+ augments the original dataset with 2518 challenging problems, forming a mix-training set. As shown in Table 3, KG-HoT achieves 75.1% accuracy on this dataset, leading all other methods. Compared to manual annotation of solution programs by experts, our method simply uses prompt templates to generate training data from large models, significantly reducing annotation costs. This approach enables models to learn both human-readable reasoning steps and formalized computational processes while maintaining performance, facilitating broader application deployment.


To validate the contributions of different chain-of-thought components, we conduct ablation study by removing particular component from it. Table 4 presents results when removing language-based and program-based chains-of-thought, respectively. When using only program-based chains (removing language-based reasoning), accuracy drops by 6.0% and 5.8% on GeoQA and GeoQA+, respectively, indicating the importance of natural language reasoning for semantic understanding. Similarly, when using only language-based chains (removing program-based reasoning), we observe more significant decreases, highlighting the necessity of structured computation for precise numerical calculations. These consistent results across both datasets confirm that our proposed hybrid chain-of-thought training strategy requires both reasoning paradigms as essential components, with each contributing unique capabilities that jointly enhance the model’s geometric reasoning performance.

4.3.1 Analysis of Hybrid Chain-of-Thought Effectiveness
To verify the synergistic effects of hybrid chains of thought, we design two comparative experiments: 1) KG-HoT with hybrid training (decoding programs only) vs. KG-PoT with program-only training; 2) KG-HoT with hybrid training (decoding language only) vs. KG-CoT with language-only training. As shown in Table 5, hybrid chain-of-thought training outperforms single-type chain training even under single-output paradigm: for program generation, KG-HoT surpasses KG-PoT by an average of 3.4% (with angle problems showing the highest improvement at 4.6%); for language generation, KG-HoT exceeds KG-CoT by an average of 2.7%. This bidirectional improvement demonstrates synergistic gains between the two chains of thought during training, where learning from both paradigms simultaneously enhances each individual reasoning pathway.

4.3.2 Analysis of Multi-Chain Joint Learning Effectiveness
To validate the effectiveness of end-to-end hybrid chain-of-thought training, we compare it with two-stage training approaches where auxiliary information is first generated then used for final reasoning. As shown in Tables 6 and 7 the results indicate that end-to-end KG-HoT significantly outperforms two-stage methods across both output paradigms. For program generation, KG-HoT exceeds “Q-K; QK-PoT” (first generating knowledge concepts). “Q-KT; QKT-PoT” (first generating knowledge concepts and theorems), and “Q-CoT; QCoT-PoT” (first generating language-based chain-of-thought) by 2.5%, 3.7%, and 5.3%, respectively. For language-based generation, KG-HoT similarly surpasses “Q-PoT; QPoT-CoT”. The performance degradation in two-stage methods primarily stems from error propagation, where generation errors in the first stage accumulate and impair second-stage reasoning. In contrast, our end-to-end hybrid training avoids such cascading errors through mutual guidance between different chain-of-thought paradigms, effectively enhancing geometric reasoning capabilities. To illustrate these pipeline variants, we include examples of each permutation’s outputs in the examples folder of https://github.com/jmhz24/HG-HoT.


4.3.3 Generalization to Additional Benchmarks
To evaluate generalizability beyond GeoQA/GeoQA+, we conduct zero-shot tests on two external benchmarks: Geometry3K (601 questions) [40] and MathVista-Geometry (208 geometry problems from MathVista testmini) [41]. As shown in Table 8, our model achieves accuracies of 51.2% on Geometry3K and 62.0% on MathVista-Geometry. We further dissect source-wise performance within MathVista-Geometry in Fig. 8. The subset includes 62 problems each from GeoQA+, UniGeo and Geometry3K, and 22 from GEOS. Model performance declines consistently across sources, dropping from 72.6% on GeoQA+ to 45.5% on GEOS, due to inherent discrepancies in problem design and presentation. Two main reasons account for this performance disparity. First, Geometry3K contains minimal-text questions (e.g., “Find x”) with geometric information embedded solely in diagrams, whereas our training data from GeoQA/GeoQA+ features rich textual descriptions that effectively trigger knowledge retrieval. Second, GEOS features comparative questions (e.g., “Which is greatest?”) demanding multi-choice reasoning. As we exclude answer options during training to avoid implicit guidance, our model is not optimized for such comparative reasoning formats, resulting in degraded performance. These results verify the superiority of our method on text-rich geometric problems. Meanwhile, they highlight the necessity of breaking the reliance on text-only retrieval triggers. Future work will further enhance the model capability to cope with visually implicit problems and comparative reasoning tasks, so as to strengthen generalization across diverse geometric benchmarks.


Figure 8: Question distribution and generalization performance across four geometry datasets in the MathVista-Geometry subset.
While our hybrid chain-of-thought framework combines linguistic and programmatic reasoning with elaborate data generation and quality control, several limitations remain to be discussed. First, the current framework prioritizes reliable training data construction but lacks explicit self-correction during inference. Although our knowledge-guided generation and multi-stage verification (answer checking, length filtering, iterative validation and manual annotation) effectively mitigate hallucinations in data creation, the model cannot perform real-time error detection for intermediate reasoning steps at inference time. Second, all experiments are conducted on high-quality public datasets with well-organized problem descriptions and standard geometric diagrams. Our framework is therefore not equipped to handle ill-defined, ambiguous or incomplete real-world cases with vague statements, missing diagrams or imprecise visual inputs. Third, the model adopts a simple heuristic to select outputs between language and programmatic reasoning based on answer-option validation, without dynamically evaluating the reliability of each reasoning chain during inference. These limitations point to promising future directions for improving the framework’s adaptability, reasoning reliability and practical applicability in real-world scenarios.
In this work, we propose a knowledge-guided hybrid chain-of-thought (KG-HoT) framework to tackle high annotation costs and poor interpretability in multimodal geometry problem solving. Our method leverages knowledge-grounded prompts to generate two complementary reasoning chains. The language-based chain conducts progressive three-level reasoning consisting of knowledge recall, theorem application, and solution derivation for interpretability, while the program-based chain enables precise mathematical computation. The two chains are end-to-end distilled into lightweight models for mutual reinforcement and synergistic learning. Experimental results validate that KG-HoT achieves competitive performance across benchmarks and greatly reduces annotation costs. Future work will incorporate self-correction and real-time error detection for intermediate reasoning steps, adapt the framework to ill-defined and ambiguous geometric problems, develop advanced selection strategies for linguistic and programmatic reasoning, and explore knowledge representation to support complex geometric reasoning and cross-domain generalization.
Acknowledgement: The authors sincerely thank their co-authors for their continuous technical support, guidance, and assistance with the experimental equipment used in this work.
Funding Statement: This work is supported by the Youth Science and Technology of Gansu Province (26JRRA493, 24JRRA148), the Northwest Normal University Young Teachers Research Capacity Promotion Plan (NWNU-LKQN2027-19, NWNU-LKQN2024-22).
Author Contributions: Meihuizi Jia: Conceptualization, Methodology, Software, Writing—original draft, Funding acquisition. Hongyan Ran: Supervision, Writing—review & editing, Funding acquisition. Shanshan Li: Supervision. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The code and supplementary resources, including geometric knowledge concepts, theorems, and corresponding Python functions, are publicly available at https://github.com/jmhz24/HG-HoT.
Ethics Approval: This study involved no human participants or animal experiments, and thus ethical review and approval were not required.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Chen J, Tang J, Qin J, Liang X, Liu L, Xing E, et al. GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. In: Findings of the Association for Computational Linguistics (ACL-IJCNLP). Kerrville, TX, USA: Association for Computational Linguistics; 2021. p. 513–23. [Google Scholar]
2. Seo M, Hajishirzi H, Farhadi A, Etzioni O, Malcolm C. Solving geometry problems: combining text and diagram interpretation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2015 Sep 17–21; Lisbon, Portugal. p. 1466–76. [Google Scholar]
3. Zhang ML, Yin F, Liu CL. A multi-modal neural geometric solver with textual clauses parsed from diagram. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI); 2023 Aug 19–25; Macao, China. p. 3374–82. [Google Scholar]
4. Sachan M, Dubey K, Xing E. From textbooks to knowledge: a case study in harvesting axiomatic knowledge from textbooks to solve geometry problems. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2017 Sep 9–11; Copenhagen, Denmark. p. 773–84. [Google Scholar]
5. Zhang X, Zhu N, He Y, Zou J, Huang Q, Jin X, et al. Formalgeo: the first step toward human-like IMO-level geometric automated reasoning. arXiv:2310.18021. 2023. [Google Scholar]
6. Trinh TH, Wu Y, Le QV, He H, Luong T. Solving olympiad geometry without human demonstrations. Nature. 2024;625:476–82. [Google Scholar] [PubMed]
7. Chen J, Li T, Qin J, Lu P, Lin L, Chen C, et al. Unifying geometry logical reasoning via reformulating mathematical expression. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2022 Dec 7–11; Abu Dhabi, United Arab Emirates. p. 3313–23. [Google Scholar]
8. Liang Z, Yang T, Zhang J, Zhang X. Unimath: a foundational and multimodal mathematical reasoner. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2023 Dec 6–10; Singapore. p. 7126–33. [Google Scholar]
9. Yue X, Qu X, Zhang G, Fu Y, Huang W, Sun H, et al. MAmmoTH: building math generalist models through hybrid instruction tuning. In: The Twelfth International Conference on Learning Representations (ICLR); 2024 May 7–11; Vienna, Austria. [Google Scholar]
10. Comanici G, Bieber E, Schaekermann M, Pasupat I, Sachdeva N, Dhillon I, et al. Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv:2507.06261. 2025. [Google Scholar]
11. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824–37. doi:10.52202/068431-1800. [Google Scholar] [CrossRef]
12. Chen W, Ma X, Wang X, Cohen WW. Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks. Trans Mach Learn Res. 2023 [cited 2026 Apr 1]. Available from: https://openreview.net/forum?id=YfZ4ZPt8zd. [Google Scholar]
13. Yu J, He R, Ying Z. Thought propagation: an analogical approach to complex reasoning with large language models. In: The Twelfth International Conference on Learning Representations (ICLR); 2024 May 7–11; Vienna, Austria. [Google Scholar]
14. Wang PY, Liu TS, Wang C, Li Z, Wang Y, Yan S, et al. A survey on large language models for mathematical reasoning. ACM Comput Surv. 2026;58(8):209. doi:10.1145/3786333. [Google Scholar] [CrossRef]
15. Yao Y, Li Z, Zhao H. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv:2305.16582. 2023. [Google Scholar]
16. Rose D, Himakunthala V, Ouyang A, He R, Mei A, Lu Y, et al. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv:2305.02317. 2023. [Google Scholar]
17. Wang L, Hu Y, He J, Xu X, Liu N, Liu H, et al. T-SCIQ: teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In: Thirty-Eighth AAAI Conference on Artificial Intelligence; 2024 Feb 20–27; Vancouver, BC, Canada. p. 19162–70. [Google Scholar]
18. Zhang Z, Zhang A, Li M, Zhao H, Karypis G, Smola A. Multimodal chain-of-thought reasoning in language models. Trans Mach Learn Res. 2024. doi:10.59350/73qcj-wyt28. [Google Scholar] [CrossRef]
19. Ning M, Wang QF, Huang K, Huang X. A symbolic characters aware model for solving geometry problems. In: Proceedings of the 31st ACM International Conference on Multimedia; 2023 Oct 29–Nov 3; Ottawa, ON, Canada. p. 7767–75. [Google Scholar]
20. van den OA, Vinyals O, Kavukcuoglu K. Neural discrete representation learning. In: 31st Conference on Neural Information Processing Systems (NeurIPS); 2017 Dec 4–9; Long Beach, CA, USA. p. 6309–18. [Google Scholar]
21. Razavi A, van den OA, Vinyals O. Generating diverse high-fidelity images with VQ-VAE-2. In: 33rd Conference on Neural Information Processing Systems (NeurIPS); 2019 Dec 8–14; Vancouver, BC, Canada. p. 14866–76. [Google Scholar]
22. Zhao Y, Wang X, Liu J, King I, Huang Z. Towards geometry problem solving in the large model era: a survey. arXiv:2506.02690. 2025. [Google Scholar]
23. Gao J, Pi R, Zhang J, Ye J, Zhong W, Wang Y, et al. G-LLaVA: solving geometric problem with multi-modal large language model. arXiv:2312.11370. 2023. [Google Scholar]
24. Kazemi M, Alvari H, Anand A, Wu J, Chen X, Soricut R. GeomVerse: a systematic evaluation of large models for geometric reasoning. arXiv:2312.12241. 2023. [Google Scholar]
25. Ning M, Zhou Z, Wang Q, Huang X, Huang K. GNS: solving plane geometry problems by neural-symbolic reasoning with multi-modal LLMS. In: Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Educational Advances in Artificial Intelligence; 2025 Feb 25–Mar 4; Philadelphia, PA, USA. p. 24957–65. [Google Scholar]
26. Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. Adv Neural Inf Process Syst. 2022;35:22199–213. doi:10.52202/068431-1613. [Google Scholar] [CrossRef]
27. Wang X, Wei J, Schuurmans D, Le QV, Chi EH, Narang S, et al. Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (ICLR); 2023 May 1–5; Kigali, Rwanda. [Google Scholar]
28. Zhang Z, Zhang A, Li M, Smola A. Automatic chain of thought prompting in large language models. In: The Eleventh International Conference on Learning Representations (ICLR); 2023 May 1–5; Kigali, Rwanda. [Google Scholar]
29. Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. Tree of thoughts: deliberate problem solving with large language models. In: 37th Conference on Neural Information Processing Systems 2023 (NeurIPS); 2023 Dec 10–16; New Orleans, LA, USA. p. 11809–22. [Google Scholar]
30. Besta M, Blach N, Kubicek A, Gerstenberger R, Podstawski M, Gianinazzi L, et al. Graph of thoughts: solving elaborate problems with large language models. In: Thirty-Eighth AAAI Conference on Artificial Intelligence; 2024 Feb 20–27; Vancouver, BC, Canada. p. 17682–90. [Google Scholar]
31. Magister LC, Mallinson J, Adamek J, Malmi E, Severyn A. Teaching small language models to reason. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 2023 Jul 9–14; Toronto, ON, Canada. p. 1773–81. [Google Scholar]
32. Cao J, Xiao J. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In: Proceedings of the 29th International Conference on Computational Linguistics (COLING); 2022 Oct 12–17; Gyeongju, Republic of Korea. p. 1511–20. [Google Scholar]
33. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. In: 37th Conference on Neural Information Processing Systems (NeurIPS); 2023 Dec 10–16; New Orleans, LA, USA. [Google Scholar]
34. Yu Y, Yang CHH, Kolehmainen J, Shivakumar PG, Gu Y, Ren SRR, et al. Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU); 2023 Dec 16–20; Taipei, Taiwan. p. 1–8. [Google Scholar]
35. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR); 2015 May 7–9; San Diego, CA, USA. [Google Scholar]
36. Perez E, Strub F, De Vries H, Dumoulin V, Courville A. Film: visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2018 Feb 2–7; New Orleans, LA, USA. p. 3942–51. [Google Scholar]
37. Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, et al. A simple neural network module for relational reasoning. In: 31st Conference on Neural Information Processing Systems (NeurIPS); 2017 Dec 4–9; Long Beach, CA, USA. p. 4967–76. [Google Scholar]
38. Yu Z, Yu J, Cui Y, Tao D, Tian Q. Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15–20; Long Beach, CA, USA. p. 6281–90. [Google Scholar]
39. Amini A, Gabriel S, Lin S, Koncel-Kedziorski R, Choi Y, Hajishirzi H. Mathqa: towards interpretable math word problem solving with operation-based formalisms. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); 2019 Jun 2–7; Minneapolis, MN, USA. p. 2357–67. [Google Scholar]
40. Lu P, Gong R, Jiang S, Qiu L, Huang S, Liang X, et al. Inter-GPS: interpretable geometry problem solving with formal language and symbolic reasoning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP); 2021 Aug 1–6; Virtual. p. 6774–86. [Google Scholar]
41. Lu P, Bansal H, Xia T, Liu J, Li C, Hajishirzi H, et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts. In: The Twelfth International Conference on Learning Representations (ICLR); 2024 May 7–11; Vienna, Austria. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools