Journal Menu

Special Issues

Table of Content

Foundation Models and Large Multimodal Systems for Visual Intelligence

Submission Deadline: 31 January 2027 View: 183 Submit to Special Issue

Guest Editor(s)

Dr. Bacha Rehman

Email: bacha.rehman@solent.ac.uk

Affiliation: School of Technology and Maritime Industries, Southampton Solent University, Southampton, United Kingdom

Homepage:

Research Interests: affective computing, hybrid DNNs, multimodal systems, computer vision applications

Dr. Sadaqat Rehman

Email: s.rehman15@salford.ac.uk

Affiliation: School of Science, Engineering & Environment, University of Salford, Manchester, UK

Homepage:

Research Interests: deep learning, machine learning, image classification

image3 (1).jpeg

Summary

Foundation models—large-scale neural networks trained on broad data via self-supervision—have emerged as a transformative paradigm in artificial intelligence. In computer vision, models such as CLIP, DINO, SAM, and BLIP have demonstrated unprecedented generalization by learning transferable visual representations from massive unlabelled or weakly labelled data. Concurrently, the integration of vision with language, audio, and action modalities is giving rise to large multimodal systems that enable robust visual intelligence across diverse real-world scenarios. These advances are reshaping fields from autonomous systems and robotics to healthcare and human-computer interaction.

This Special Issue aims to advance the theory, architecture, and application of foundation models and large multimodal systems specifically oriented toward visual intelligence. We seek high-quality contributions on novel pre-training strategies, efficient adaptation mechanisms (e.g., prompt learning, parameter-efficient fine-tuning), cross-modal representation learning, trustworthy and explainable visual AI, and deployment on resource-constrained devices. The issue particularly welcomes work that connects core visual perception with human-cantered applications—including affective computing, emotion recognition, assistive technologies, and human-robot interaction—while remaining open to broader computer vision and multimodal learning research.

· Self-supervised and weakly supervised visual pre-training at scale
· Vision-language, vision-audio, and vision-action foundation models
· Efficient adaptation, prompt engineering, and in-context learning for visual tasks
· Multimodal fusion, alignment, and unified representation learning
· Trustworthiness, explainability, fairness, and bias mitigation in visual foundation models
· Model compression, quantization, and edge-AI deployment for visual systems
· Applications in human-robot interaction, emotion recognition, smart environments, and healthcare imaging

Keywords

foundation models, multimodal learning, visual intelligence, computer vision, vision-language models, self-supervised learning, affective computing, human-robot interaction, neural architecture, transfer learning.

Show export options

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Foundation Models and Large Multimodal Systems for Visual Intelligence

Guest Editor(s)

Summary

Keywords

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link