Huu-Tuong Ho1,#, Luong Vuong Nguyen1,#, Minh-Tien Pham1, Quang-Huy Pham1, Quang-Duong Tran1, Duong Nguyen Minh Huy2, Tri-Hai Nguyen3,*
CMC-Computers, Materials & Continua, Vol.82, No.2, pp. 1733-1756, 2025, DOI:10.32604/cmc.2025.060363
- 17 February 2025
Abstract In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. More >