Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering

Sumaira Shafiq; Naveed Ejaz; Munam Shah; Rashid Kamal; Adnan Sohail; Sheraz Aslam

doi:10.32604/cmc.2026.081534

Open Access icon Open Access

ARTICLE

Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering

Sumaira Shafiq¹, Naveed Ejaz², Munam Ali Shah^3,*, Rashid Kamal², Adnan Sohail¹, Sheraz Aslam^4,5,6

1 Department of Computing and Technology, Islamabad Campus, Iqra University, Islamabad, Pakistan
2 School of Computing, Ulster University, Belfast, UK
3 Department of Computer Networks and Communication, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa, Saudi Arabia
4 Department of Computer Science, CTL Eurocollege, Limassol, Cyprus
5 Department of Computer Science, American University of Cyprus, Larnaca, Cyprus
6 International Digital Economy College, Minjiang University, Fuzhou, China

* Corresponding Author: Munam Ali Shah. Email: email

(This article belongs to the Special Issue: Generative Artificial Intelligence and Large Language Models: Methods, Architectures, and Applications)

Computers, Materials & Continua 2026, 88(2), 96 https://doi.org/10.32604/cmc.2026.081534

Received 09 March 2026; Accepted 21 April 2026; Issue published 15 June 2026

Abstract

Lecture videos are widely used in modern education, yet answering questions from them remains challenging. Relevant information is often distributed across time and expressed through multiple modalities, including speech, slides, and visual content. Existing VideoQA approaches, including recent retrieval-augmented generation (RAG) methods, typically rely on static text representations or global video features. Consequently, they may retrieve evidence that is semantically relevant but temporally misaligned, leading to inaccurate or weakly grounded responses. In addition, dependence on external knowledge sources can introduce hallucinations and reduce reliability in educational settings. To address these limitations, we propose a temporally aware, intra-video RAG framework tailored for lecture videos. The approach aligns automatic speech transcripts and visual captions into timestamped segments and performs retrieval constrained by temporal boundaries. Retrieved segments are further refined using a cross-encoder before answer generation, ensuring that responses are grounded in the correct portions of the video. We evaluate the proposed method on the LectQA-Vid dataset, consisting of 100 lecture videos and 3000 temporally annotated questions. Experimental results demonstrate improved factual alignment and robustness over non-temporal baselines, highlighting the importance of temporal grounding in lecture VideoQA.

Keywords

Video question answering; retrieval-augmented generation; temporal grounding; multimodal retrieval; educational videos; whisper ASR; visual captioning; large language models; explainable AI; timestamped evidence

Cite This Article

APA Style

Shafiq, S., Ejaz, N., Shah, M.A., Kamal, R., Sohail, A. et al. (2026). Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering. Computers, Materials & Continua, 88(2), 96. https://doi.org/10.32604/cmc.2026.081534

Vancouver Style

Shafiq S, Ejaz N, Shah MA, Kamal R, Sohail A, Aslam S. Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering. Comput Mater Contin. 2026;88(2):96. https://doi.org/10.32604/cmc.2026.081534

IEEE Style

S. Shafiq, N. Ejaz, M. A. Shah, R. Kamal, A. Sohail, and S. Aslam, “Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering,” Comput. Mater. Contin., vol. 88, no. 2, pp. 96, 2026. https://doi.org/10.32604/cmc.2026.081534

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering

Abstract

Keywords

Cite This Article

492

189

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link