Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering

Sumaira Shafiq¹, Naveed Ejaz², Munam Ali Shah^3,*, Rashid Kamal², Adnan Sohail¹, Sheraz Aslam^4,5,6
1 Department of Computing and Technology, Islamabad Campus, Iqra University, Islamabad, Pakistan
2 School of Computing, Ulster University, Belfast, UK
3 Department of Computer Networks and Communication, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa, Saudi Arabia
4 Department of Computer Science, CTL Eurocollege, Limassol, Cyprus
5 Department of Computer Science, American University of Cyprus, Larnaca, Cyprus
6 International Digital Economy College, Minjiang University, Fuzhou, China
* Corresponding Author: Munam Ali Shah. Email: email
(This article belongs to the Special Issue: Generative Artificial Intelligence and Large Language Models: Methods, Architectures, and Applications)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081534

Received 09 March 2026; Accepted 21 April 2026; Published online 03 June 2026

Download PDF

Abstract

Lecture videos are widely used in modern education, yet answering questions from them remains challenging. Relevant information is often distributed across time and expressed through multiple modalities, including speech, slides, and visual content. Existing VideoQA approaches, including recent retrieval-augmented generation (RAG) methods, typically rely on static text representations or global video features. Consequently, they may retrieve evidence that is semantically relevant but temporally misaligned, leading to inaccurate or weakly grounded responses. In addition, dependence on external knowledge sources can introduce hallucinations and reduce reliability in educational settings. To address these limitations, we propose a temporally aware, intra-video RAG framework tailored for lecture videos. The approach aligns automatic speech transcripts and visual captions into timestamped segments and performs retrieval constrained by temporal boundaries. Retrieved segments are further refined using a cross-encoder before answer generation, ensuring that responses are grounded in the correct portions of the video. We evaluate the proposed method on the LectQA-Vid dataset, consisting of 100 lecture videos and 3000 temporally annotated questions. Experimental results demonstrate improved factual alignment and robustness over non-temporal baselines, highlighting the importance of temporal grounding in lecture VideoQA.

Keywords

Video question answering; retrieval-augmented generation; temporal grounding; multimodal retrieval; educational videos; whisper ASR; visual captioning; large language models; explainable AI; timestamped evidence

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

116

View
26

Download
0

Like

Developing Transparent IDS for VANETs Using LIME and SHAP: An Empirical Study
Fayaz Hassan, Jianguo Yu, Zafi...
Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review
Ekanayake Mudiyanselage Chulabhaya...
MAIPFE: An Efficient Multimodal Approach Integrating Pre-Emptive Analysis, Personalized Feature Selection, and Explainable AI
Moshe Dayan Sirapangi, S. Gopikrishnan
Enhancing Relational Triple Extraction in Specific Domains: Semantic Enhancement and Synergy of Large Language Models and Small Pre-Trained Language Models
Jiakai Li, Jianpeng Hu, Geng Zhang
LKPNR: Large Language Models and Knowledge Graph for Personalized News Recommendation Framework
Hao Chen, Runfeng Xie, Xiangyang...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Intra-Video Temporal-Aware RAG: A Self-Contained Framework for Video-Based Question Answering

Abstract

Keywords

116

26

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link