Submission Deadline: 31 January 2027 View: 8 Submit to Special Issue
Dr. Bacha Rehman
Email: bacha.rehman@solent.ac.uk
Affiliation: School of Technology and Maritime Industries, Southampton Solent University, Southampton, United Kingdom
Research Interests: affective computing, hybrid DNNs, multimodal systems, computer vision applications

Dr. Sadaqat Rehman
Email: s.rehman15@salford.ac.uk
Affiliation: School of Science, Engineering & Environment, Manchester, UK
Research Interests: deep learning, machine learning, image classification

Foundation models—large-scale neural networks trained on broad data via self-supervision—have emerged as a transformative paradigm in artificial intelligence. In computer vision, models such as CLIP, DINO, SAM, and BLIP have demonstrated unprecedented generalization by learning transferable visual representations from massive unlabelled or weakly labelled data. Concurrently, the integration of vision with language, audio, and action modalities is giving rise to large multimodal systems that enable robust visual intelligence across diverse real-world scenarios. These advances are reshaping fields from autonomous systems and robotics to healthcare and human-computer interaction.
This Special Issue aims to advance the theory, architecture, and application of foundation models and large multimodal systems specifically oriented toward visual intelligence. We seek high-quality contributions on novel pre-training strategies, efficient adaptation mechanisms (e.g., prompt learning, parameter-efficient fine-tuning), cross-modal representation learning, trustworthy and explainable visual AI, and deployment on resource-constrained devices. The issue particularly welcomes work that connects core visual perception with human-cantered applications—including affective computing, emotion recognition, assistive technologies, and human-robot interaction—while remaining open to broader computer vision and multimodal learning research.
· Self-supervised and weakly supervised visual pre-training at scale
· Vision-language, vision-audio, and vision-action foundation models
· Efficient adaptation, prompt engineering, and in-context learning for visual tasks
· Multimodal fusion, alignment, and unified representation learning
· Trustworthiness, explainability, fairness, and bias mitigation in visual foundation models
· Model compression, quantization, and edge-AI deployment for visual systems
· Applications in human-robot interaction, emotion recognition, smart environments, and healthcare imaging


Submit a Paper
Propose a Special lssue