Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.081534
Special Issues
Table of Content

Open Access

ARTICLE

Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering

Sumaira Shafiq1, Naveed Ejaz2, Munam Ali Shah3,*, Rashid Kamal2, Adnan Sohail1, Sheraz Aslam4,5,6
1 Department of Computing and Technology, Islamabad Campus, Iqra University, Islamabad, Pakistan
2 School of Computing, Ulster University, Belfast, UK
3 Department of Computer Networks and Communication, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa, Saudi Arabia
4 Department of Computer Science, CTL Eurocollege, Limassol, Cyprus
5 Department of Computer Science, American University of Cyprus, Larnaca, Cyprus
6 International Digital Economy College, Minjiang University, Fuzhou, China
* Corresponding Author: Munam Ali Shah. Email: email
(This article belongs to the Special Issue: Generative Artificial Intelligence and Large Language Models: Methods, Architectures, and Applications)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081534

Received 09 March 2026; Accepted 21 April 2026; Published online 03 June 2026

Abstract

Lecture videos are widely used in modern education, yet answering questions from them remains challenging. Relevant information is often distributed across time and expressed through multiple modalities, including speech, slides, and visual content. Existing VideoQA approaches, including recent retrieval-augmented generation (RAG) methods, typically rely on static text representations or global video features. Consequently, they may retrieve evidence that is semantically relevant but temporally misaligned, leading to inaccurate or weakly grounded responses. In addition, dependence on external knowledge sources can introduce hallucinations and reduce reliability in educational settings. To address these limitations, we propose a temporally aware, intra-video RAG framework tailored for lecture videos. The approach aligns automatic speech transcripts and visual captions into timestamped segments and performs retrieval constrained by temporal boundaries. Retrieved segments are further refined using a cross-encoder before answer generation, ensuring that responses are grounded in the correct portions of the video. We evaluate the proposed method on the LectQA-Vid dataset, consisting of 100 lecture videos and 3000 temporally annotated questions. Experimental results demonstrate improved factual alignment and robustness over non-temporal baselines, highlighting the importance of temporal grounding in lecture VideoQA.

Keywords

Video question answering; retrieval-augmented generation; temporal grounding; multimodal retrieval; educational videos; whisper ASR; visual captioning; large language models; explainable AI; timestamped evidence
  • 116

    View

  • 26

    Download

  • 0

    Like

Share Link