Open Access
ARTICLE
Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering
1 Department of Computing and Technology, Islamabad Campus, Iqra University, Islamabad, Pakistan
2 School of Computing, Ulster University, Belfast, UK
3 Department of Computer Networks and Communication, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa, Saudi Arabia
4 Department of Computer Science, CTL Eurocollege, Limassol, Cyprus
5 Department of Computer Science, American University of Cyprus, Larnaca, Cyprus
6 International Digital Economy College, Minjiang University, Fuzhou, China
* Corresponding Author: Munam Ali Shah. Email:
(This article belongs to the Special Issue: Generative Artificial Intelligence and Large Language Models: Methods, Architectures, and Applications)
Computers, Materials & Continua 2026, 88(2), 96 https://doi.org/10.32604/cmc.2026.081534
Received 09 March 2026; Accepted 21 April 2026; Issue published 15 June 2026
Abstract
Lecture videos are widely used in modern education, yet answering questions from them remains challenging. Relevant information is often distributed across time and expressed through multiple modalities, including speech, slides, and visual content. Existing VideoQA approaches, including recent retrieval-augmented generation (RAG) methods, typically rely on static text representations or global video features. Consequently, they may retrieve evidence that is semantically relevant but temporally misaligned, leading to inaccurate or weakly grounded responses. In addition, dependence on external knowledge sources can introduce hallucinations and reduce reliability in educational settings. To address these limitations, we propose a temporally aware, intra-video RAG framework tailored for lecture videos. The approach aligns automatic speech transcripts and visual captions into timestamped segments and performs retrieval constrained by temporal boundaries. Retrieved segments are further refined using a cross-encoder before answer generation, ensuring that responses are grounded in the correct portions of the video. We evaluate the proposed method on the LectQA-Vid dataset, consisting of 100 lecture videos and 3000 temporally annotated questions. Experimental results demonstrate improved factual alignment and robustness over non-temporal baselines, highlighting the importance of temporal grounding in lecture VideoQA.Keywords
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools