Journal Menu

Special Issues

Table of Content

Multimodal Vision with Large Language Models

Submission Deadline: 31 December 2026 View: 2708 Submit to Special Issue

Guest Editor(s)

Prof. Dr. Ahmad Taher Azar

Email: aazar@psu.edu.sa

Affiliation: Automated Systems and Computing Lab (ASCL), Prince Sultan University, Riyadh, Saudi Arabia

Homepage:

Research Interests: artificial intelligence, robotics, control theory & applications, reinforcement learning, computational intelligence

图片5.png

Dr. Weiwei Jiang

Email: jww@bupt.edu.cn

Affiliation: School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, China

Homepage:

Research Interests: artificial intelligence, machine learning, big data, wireless communication and edge computing

微信图片_20251202172046_557_155.png

Summary

The field of artificial intelligence is undergoing a revolutionary shift with the convergence of large language models (LLMs) and computer vision. This fusion, known as multimodal learning, is pushing the boundaries of how machines perceive, interpret, and interact with the visual world. This special issue, titled "Multimodal Vision with Large Language Models," is dedicated to exploring this transformative synergy. It aims to collate cutting-edge research that moves beyond traditional siloed approaches, instead creating unified models that can jointly reason over visual and textual data to achieve a deeper, more contextual, and semantically grounded understanding.

We seek to investigate the full spectrum of this integration, from novel neural architectures that seamlessly blend visual and linguistic features to innovative training paradigms that leverage the complementary strengths of both modalities. This includes enabling complex capabilities such as generating natural language descriptions from images, answering intricate questions about visual content, following open-ended linguistic instructions to manipulate visual data, and creating realistic imagery from text prompts. This issue will also address the critical challenges inherent in this endeavor, including computational efficiency, the mitigation of hallucinations in model outputs, data scarcity, and the development of robust evaluation metrics.

Topics of Interest:
We invite the submission of high-quality, original research articles and comprehensive reviews on topics including, but not limited to:
· Architectures for Multimodal Fusion: Novel models for integrating visual encoders (e.g., ViT, CNN) with large language models (e.g., GPT, LLaMA).
· Vision-Language Pre-training (VLP): Strategies for large-scale pre-training on aligned image-text data and efficient fine-tuning for downstream tasks.
· Generative Multimodal Models: Advanced text-to-image generation, text-guided image/video editing, and controllable synthesis.
· Interpretable and Explainable Multimodal AI: Techniques to understand and visualize the reasoning processes of complex vision-language models.
· Efficiency and Scalability: Methods for compressing, distilling, and accelerating large-scale multimodal models for practical deployment.
· Reasoning and Knowledge Grounding: Enhancing models with commonsense reasoning, factual knowledge, and spatial understanding for complex question answering (VQA) and dialogue.
· Domain-Specific Applications:
Healthcare: Automated radiology report generation, medical visual question answering.
Autonomous Systems: Enhanced scene understanding and decision-making for robotics and self-driving cars.
Accessibility: Advanced tools for image captioning and visual assistance for the visually impaired.

This special issue will serve as a pivotal platform for researchers at the forefront of multimodal AI. It provides an international forum to present groundbreaking work that bridges the computer vision and natural language processing communities. By bringing together diverse expertise, this collection aims to define the state of the art, address fundamental challenges, and chart the future direction of intelligent systems that can truly see and talk about our world. Contributions to this issue will be instrumental in shaping the next wave of AI applications and research.

Show export options

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

Multimodal Vision with Large Language Models

Guest Editor(s)

Summary

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link