Lirong Yin1, Lei Wang1, Siyu Lu2,*, Ruiyang Wang2, Youshuai Yang2, Bo Yang2, Shan Liu2, Ahmed AlSanad3, Salman A. AlQahtani3, Zhengtong Yin4, Xiaolu Li5, Xiaobing Chen6, Wenfeng Zheng3,*
CMES-Computer Modeling in Engineering & Sciences, Vol.141, No.1, pp. 87-106, 2024, DOI:10.32604/cmes.2024.051083
Abstract This study addresses the limitations of Transformer models in image feature extraction, particularly their lack of inductive bias for visual structures. Compared to Convolutional Neural Networks (CNNs), the Transformers are more sensitive to different hyperparameters of optimizers, which leads to a lack of stability and slow convergence. To tackle these challenges, we propose the Convolution-based Efficient Transformer Image Feature Extraction Network (CEFormer) as an enhancement of the Transformer architecture. Our model incorporates E-Attention, depthwise separable convolution, and dilated convolution to introduce crucial inductive biases, such as translation invariance, locality, and scale invariance, into the Transformer… More >