Special Issues
Table of Content

Foundation Models and Large Multimodal Systems for Visual Intelligence

Submission Deadline: 31 January 2027 View: 8 Submit to Special Issue

Guest Editor(s)

Dr. Bacha Rehman

Email: bacha.rehman@solent.ac.uk

Affiliation: School of Technology and Maritime Industries, Southampton Solent University, Southampton, United Kingdom

Homepage:

Research Interests: affective computing, hybrid DNNs, multimodal systems, computer vision applications

image2.jpg


Dr. Sadaqat Rehman

Email: s.rehman15@salford.ac.uk

Affiliation: School of Science, Engineering & Environment, Manchester, UK

Homepage:

Research Interests: deep learning, machine learning, image classification

image3 (1).jpeg


Summary

Foundation models—large-scale neural networks trained on broad data via self-supervision—have emerged as a transformative paradigm in artificial intelligence. In computer vision, models such as CLIP, DINO, SAM, and BLIP have demonstrated unprecedented generalization by learning transferable visual representations from massive unlabelled or weakly labelled data. Concurrently, the integration of vision with language, audio, and action modalities is giving rise to large multimodal systems that enable robust visual intelligence across diverse real-world scenarios. These advances are reshaping fields from autonomous systems and robotics to healthcare and human-computer interaction.

This Special Issue aims to advance the theory, architecture, and application of foundation models and large multimodal systems specifically oriented toward visual intelligence. We seek high-quality contributions on novel pre-training strategies, efficient adaptation mechanisms (e.g., prompt learning, parameter-efficient fine-tuning), cross-modal representation learning, trustworthy and explainable visual AI, and deployment on resource-constrained devices. The issue particularly welcomes work that connects core visual perception with human-cantered applications—including affective computing, emotion recognition, assistive technologies, and human-robot interaction—while remaining open to broader computer vision and multimodal learning research.

· Self-supervised and weakly supervised visual pre-training at scale
· Vision-language, vision-audio, and vision-action foundation models
· Efficient adaptation, prompt engineering, and in-context learning for visual tasks
· Multimodal fusion, alignment, and unified representation learning
· Trustworthiness, explainability, fairness, and bias mitigation in visual foundation models
· Model compression, quantization, and edge-AI deployment for visual systems
· Applications in human-robot interaction, emotion recognition, smart environments, and healthcare imaging


Keywords

foundation models, multimodal learning, visual intelligence, computer vision, vision-language models, self-supervised learning, affective computing, human-robot interaction, neural architecture, transfer learning.

Share Link