TY  - EJOU
AU  - Elbedwehy, Samar 
AU  - Medhat, T. 
AU  - Hamza, Taher 
AU  - Alrahmawy, Mohammed F. 

TI  - Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding
T2  - Computer Systems Science and Engineering

PY  - 2023
VL  - 46
IS  - 3
SN  - 

AB  - One of the issues in Computer Vision is the automatic development of descriptions for images, sometimes known as image captioning. Deep Learning techniques have made significant progress in this area. The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. This paper aims to find optimized models for these two subsystems. For the image feature extraction subsystem, the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image. For the caption generation lingual subsystem, this paper tested three different pre-trained language embedding models: Glove (Global Vectors for Word Representation), BERT (Bidirectional Encoder Representations from Transformers), and TaCL (Token-aware Contrastive Learning), to select from them the most accurate pre-trained language embedding model. Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN (Shifted window) and PVT (Pyramid Vision Transformer) as an image feature extractor, combined with the TaCL language embedding model is the best result among the other combinations.
KW  - Image captioning; word embedding; concatenation; transformer

DO  - 10.32604/csse.2023.038376