|Computer Systems Science & Engineering |
Image Inpainting Detection Based on High-Pass Filter Attention Network
1Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, Changsha University of Science and Technology, Changsha, 410114, China
2School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, 410114, China
3School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, 411004, China
4Department of Computer Science, Texas Tech University, Lubbock, 79409, USA
*Corresponding Author: Feng Li. Email: firstname.lastname@example.org
Received: 13 January 2022; Accepted: 11 March 2022
Abstract: Image inpainting based on deep learning has been greatly improved. The original purpose of image inpainting was to repair some broken photos, such as inpainting artifacts. However, it may also be used for malicious operations, such as destroying evidence. Therefore, detection and localization of image inpainting operations are essential. Recent research shows that high-pass filtering full convolutional network (HPFCN) is applied to image inpainting detection and achieves good results. However, those methods did not consider the spatial location and channel information of the feature map. To solve these shortcomings, we introduce the squeezed excitation blocks (SE) and propose a high-pass filter attention full convolutional network (HPACN). In feature extraction, we apply concurrent spatial and channel attention (scSE) to enhance feature extraction and obtain more information. Channel attention (cSE) is introduced in upsampling to enhance detection and localization. The experimental results show that the proposed method can achieve improvement on ImageNet.
Keywords: Image inpainting detection; spatial attention; channel attention; full convolutional network; high-pass filter
Image inpainting is to complete the damaged or lost parts according to the existing data information of the image or remove the target object to ensure that the inpainted region under different scenarios and background perfectly, so that the inpainted image remains naturally, It is hard for the observer to directly observe the damaged or modified marks with the naked eye. As shown in Fig. 1.
With the advancement of science and technology and the development of modern society, digital image information has become an inseparable part of people’s lives, so tampered images such as inpainting will flood in people’s life and study. Image inpainting plays an important role in many ways. For example, in the photography and film industries, it can achieve certain special effects; in the restoration of cultural relics, it can repair damaged frescoes; in medical imaging, it can also provide the basis for medical diagnosis. It is precise because image inpainting has penetrated all aspects of people’s lives that the use of image inpainting should be legalized, and inpainting detection is particularly important.
In recent years, image inpainting has made significant development [1–13]. For traditional inpainting methods based on diffusion or patch [1–5], either the area of the inpaintable hole is very limited, or it cannot inpaint complex scenes or missing objects that are not in the image. For the inpainting method based on deep learning [6–13], the shortcomings of the traditional methods that are perfectly solved, which not only the inpaintable area is enlarged, but also the details of the image can be well predicted and the target that does not exist in the missing image can be automatically generated.
Methods based on deep learning have achieved relatively good inpainting results in image inpainting tasks, and people use image editing software or technology to easily manipulate images. These make the forensics process more difficult, so inpainting detection remains to be further developed. If the processed images are maliciously used by criminals, such as deleting objects in evidence and removing watermarks visible in copyright. Because the manipulation operation distorts the real information of the image, it will cause harm to people’s daily life and normal political, economic and social order. Forensics for image inpainting and tampering can determine whether the image to be forensic has been tampered with and the specific inpainted and tampered area of the image to be forensic, which helps people to better identify the real information expressed by the image. Therefore, the study of image inpainting forensics has great social significance and value.
Although satisfactory achievements have been made in the field of image inpainting, there is still a lot of work worth studying about image tampering detection, especially deep image inpainting detection. Many detection methods either only detect for traditional inpainting, or the detection accuracy is not high enough [14–16]. With the continuous development and improvement of neural networks, it is also used in many studies. Li et al.  used Convolutional Neural Networks (CNN) combined with high-pass filtering architecture for image inpainting detection and achieved satisfactory performance, but the model did not consider the spatial and channel correlations of feature maps. On the other hand, attention mechanism has been widely used in many fields, which greatly encourages researchers to explore attention mechanism to further improve the performance of network models. Attention mechanism can excite more information from feature maps and can be well integrated into other studies [18–21]. Inspired by this, we combine high-pass filtered fully convolutional networks with spatial and channel attention for image inpainting detection.
The main innovations of this paper are roughly reflected in these aspects. First, we propose a deep learning-based high-pass filtered attention fully convolutional network to detect and localize image regions for deep inpainting operations. Second, the method uses scSE in the feature extraction stage to enhance feature extraction. During upsampling, more attention is paid to channel information to enhance detection and localization. Finally, our model achieves promising results on the public dataset ImageNet.
We arrange this in the subsequent chapters of this article. Section 2 introduces the research content of image inpainting, image inpainting detection methods and attention mechanism. Section 3 presents the proposed network framework and implementation process. Section 4 introduces the experimental environment, discusses and analyzes the experimental results. Section 5 makes a summary of existing work and discusses future research.
2 Related Work
2.1 Image Inpainting
Many inpainting methods have been developed for image inpainting. These include traditional methods based on diffusion or patches, as well as those based on deep learning. Based on the diffusion method, Li et al.  analyzed the diffusion process and found the variation of the Laplacian image, and determined the inpainting area according to the channel information, but the detailed texture of the image could not be completed. Sridevi et al.  proposed an image inpainting algorithm, which used Fourier transform to remove noise and blur and processed image boundaries well. But it cannot deal with areas of the curved structure. Based on the patch method, Barnes et al.  proposed a random algorithm using incremental update calculations to fill in missing regions by iteratively searching for similar patches. Ružić et al.  proposed contextual textures based on Markov random fields to speed up candidate patch search. Although the inpainting area has increased and can be completed by similar patches in the surrounding region, it cannot repair complex scenes or objects that are not in the missing image. Zeng et al.  proposed a method to determine the priority of patch based on significance mapping and gray entropy, but the inpainting effect was unclear for larger areas.
In order to solve the shortcomings of traditional methods, learning-based methods have been developed. Pathak et al.  were the first to introduce the use of neural networks for image inpainting. There are also some methods based on Generative Adversarial Networks (GAN) [7–9]. Partial convolution  only deals with the information of valid regions, and Yu et al.  introduced gated convolution on this basis. In a relatively new study, Wang et al.  performed image inpainting by adaptively selecting features and normalization. Wang et al.  introduced a parallel multi-resolution network for image inpainting.
2.2 Image Inpainting Detection
Some inpainting detection methods have been developed. For example, in some traditional methods, Chang et al.  search for inpainting regions by computing similar blocks between regions. Liang et al.  also search for inpainted regions by computing similarity hashes, but this approach is limited to detecting simple operations without post-processing. With the improvement of neural networks, Zhu et al.  used CNN for detection patch-based image inpainting. Li et al.  combined high-pass filtering on the basis of CNN for image inpainting detection, which has the highest correlation with the research content of this paper. As the structure of the network model continues to develop and improve, some networks in other fields are used for inpainting detection. Zhang et al.  used feature pyramid network for forensics, but the shortcomings of this method are also obvious, only the detection area is small and the inpainting is based on diffusion. Wang et al.  used Mask R-CNN which was originally used for object detection for inpainting detection, and the types of data that can be detected have increased. Although current methods can accomplish certain forensic tasks, there is still a long way to go to improve the generality of detection methods to datasets.
2.3 Attention Mechanism
Recent research shows that the attention mechanism in deep learning has been widely used in many fields [20–21,27,28]. However, Neural Networks is still a frequently used method in computer vision [29–35]. Self-attention is just beginning to slowly seep into the body of research, either complementing existing structures or replacing them entirely. But attention mechanisms have always been a popular technique.
The SE block is the simplest kind of attention mechanism. Hu et al.  did not use strict attention and recalibrated the weights in the cropping of feature maps. By using self-attention to model the interdependence between convolutional feature channels, we studied re-weighting the channel response in a certain layer of CNN. Roy et al.  introduced spatial excitation on the basis of channel excitation, and finally proposed parallel spatial and channel SE block (scSE), and recalibrated feature maps based on the spatial and channel to obtain the final result. As a result, more information of the feature maps is stimulated, and satisfactory results are achieved, which can be easily applied to other fields. Inspired by this process, we introduce the SE block into a high-pass filtered fully convolutional network for image inpainting detection.
3 Proposed Method
Recent studies have shown that there has been a lot of image forensics work based on convolutional neural networks. These methods are constantly improving and perfecting. The trained models have achieved very good detection results and can be well generalized to other fields. Since the outstanding performance of the HPFCN  in image inpainting detection, we consider introducing its main structure into our research and introduce an attention mechanism to propose a deep learning-based image inpainting detection network.
3.1 The Proposed Framework
The proposed model utilizes channel and spatial information in the feature extraction stage, and scSE is introduced to enhance feature extraction and obtain more information. We call it HPACN-sc. Another model pays more attention to the connection between channels in upsampling. We introduce channel attention (cSE) to enhance the effect of detection and location. We call it HPACN-c. The third model uses the method described above to add both scSE and cSE. We call it HPACN-scc. The network structure of our proposed inpainting detection method is shown in Fig. 2.
3.2 Three Kinds of SE Blocks
For the feature map X, a new feature map U is obtained by transforming Ftr, Ftr : X → U, , . where , , , C’ and C represent the corresponding channel number, and Ftr is the convolution operator. Use SE block to process U to obtain a new U’ and use it in subsequent layers.
The feature map is globally average pooled to obtain the statistic Z ε ℝ1×1×C, the k-th value of Z is shown in Eq. (1).
The vector Z passes through two fully-connected layers and a ReLU, and finally performs sigmoid activation to obtain Z′ = W1(σ(W2Z)). where W1ε ℝC×C/2 and W2 ε ℝC/2×C. Then squeeze along the spatial and excite the characteristic map U in the channel to obtain the recalibrated and excited UcSE. See Eq. (2).
where represents the importance of the i-th channel. This recalibration pays more attention to the more important channels while ignoring the unimportant ones.
Consider again, for the feature map U = [u1,1, …, ui,j, …, uH,W], ui,j represents the corresponding spatial position. The spatial projection is obtained by convolution, and then the sigmoid is mapped back to the original position. That is , where Wsq ε ℝ1×1×C×1, q ε ℝH×W. Then squeeze along the channel and excite the feature map U in spatial to obtain the recalibrated and excited UsSE. As shown in Eq. (3).
where σ(qi,j) represents the importance of the spatial position (i, j). This recalibration pays more attention to the more important spatial positions while ignoring the unimportant ones.
The scSE block is to get the squeeze and excitation of parallel spatial and channel by adding UsSE and UcSE. That is UscSE = UsSE + UcSE. In theory, this kind of recalibration will focus on important parts of the channel and spatial position at the same time, and it will achieve better results.
For a more detailed introduction, please refer to the content in  and . For these three types of SE blocks, we introduce scSE block and cSE block among them for the corresponding image forensic research.
3.3 Implementation Process
3.3.1 Preprocessing Module
The image of the dataset are used as input of the preprocessing module, and the traces left by the image tampering operation are enhanced by high-pass filtering, and the 9-channel image residual of the focused spatial information and channel information is obtained by using scSE. The output terminal of the preprocessing module is connected to the input terminal of the feature extraction module; here the high-pass filtering consists of three deep convolutions with a stride of 1. The size of the filter kernel is 3 × 3 and is learnable. More detailed content can be found in .
3.3.2 Feature Extraction Module
ResNet v2  made changes on the basis of ResNet v1 , first batch normalization and ReLu then convolution, and the feature extraction module is constructed according to this structure. This module is used to collect distinguishable features from the above image residuals to obtain a feature map, which consists of four identical ResNet blocks, each of which contains two bottleneck units, and each unit includes three convolution layers with 1 × 1, 3 × 3, 1 × 1 convolutional kernels respectively. The specific settings can be found in . After the previous scSE processing, feature extraction will pay attention to both spatial and channel information, and finally, 1024 feature maps are obtained. These obtained feature maps, after being processed by the cSE block, will be used as the input of the subsequent upsampling module.
3.3.3 Upsampling Location Module
After the processing of the previous module, the upsampling location module pays more attention to the dependence between channels. Use transposed convolution to improve the spatial resolution to obtain the category label of each pixel, and finally realize the output of tampering and location. Where the transpose convolution, including the first transpose convolution, the kernel size is 8 × 8, the output channel is 64 and the stride is 4; Second transpose convolution, kernel size is 8 × 8, the output channel is 4 and the stride is 4. The transposed convolution uses a learnable bilinear kernel to perform four times upsampling twice to keep the number of elements in the feature map before and after upsampling the same. Finally, a convolution with a kernel size of 5 × 5 and a stride of 1 is performed to further weaken the checkerboard artifacts, and finally realize the output of tampering and location.
The framework used in the experiments in this paper is TensorFlow. The configuration and environment of the experiment are as follows: Ubuntu 16.04, GPU: NVIDIA GeForce RTX 2080Ti, CUDA version is 10.0.
4.1 Dataset and Evaluation Criteria
The images in this paper are from ImageNet , the quality factor of these images is mostly 75, and the inpainting method in  is used to process the images. The dataset has a total of 60,000 images, in which training data and validation data are configured in a ratio of 5:1. The configuration of the dataset refers to HPFCN, because the model itself is not perfect for detecting complex situations, so the mask corresponding to the dataset is a 10% rectangular area in the center of the image.
As shown in Tab. 1. We compare HPACN with DCC , SCO , MFCN  and HPFCN . The settings and data of some of the methods are from , and use Recall, Precision, F1-score, and IoU to evaluate the effect of the model’s image inpainting detection.
4.2 Ablation Study
We compare FCN, ACN-c, HPFCN, HPSE-scc, HPSE-sc and HPSE-c. FCN in this paper represents a fully convolutional network. ACN-c represents an attention convolutional network with channel attention added in upsampling. HPFCN represents the high-pass filtered fully convolutional network in . HPACN-scc, HPACN-sc, and HPACN-c represent high-pass filtering attention fully convolutional networks that add both scSE and cSE, only scSE, and only cSE, respectively. Their comparison results are shown in Tab. 2.
We found that the processing of high-pass filtering is one of the important factors for the subsequent channel attention to play a role. The three models introduced have different degrees of improvement, especially HPACN-c, which had the most significant effect, the corresponding F1-score improved by 0.9%, IoU improved by 1.1%. It can be seen from the experimental results that without the preprocessing of high-pass filtering, the direct use of channel attention leads to performance degradation. The possible reason is that the high-pass residual after high-pass filtering preprocessing makes the difference between the inpainted area and the original area more obvious, and the feature difference between the corresponding channels may also be amplified. Exciting more important channels will significantly improve the performance.
Because the result of the forensic process in this paper is the generation of pixel-by-pixel localization maps, and spatial information is very important to the entire pixel-level forensics process. For the feature extraction module, both spatial location information and channel information are very important. One possible explanation is that in the upsampling module, where the forensic results of the class labels of each pixel are about to be generated, each spatial location information is irreplaceable, which makes the contribution of some channels less important and squeezing unimportant channels will significantly improve model performance.
The proposed method can effectively detect and locate the inpainting region in our dataset, and examples of detection is shown in Fig. 3. In addition, we also found that if channel attention is used at all stages, or used arbitrarily, it will lead to the decline of model performance. In future work, we will continue to modify the network structure to improve the generalization of the model to the dataset and further improve the performance of the model.
This paper propose a high-pass filtered attention full convolutional network to detect and locate the image inpainting. Taking into account the high-pass filter full convolutional network, the spatial location and channel dependence are not considered in the feature extraction and the upsampling stage. We combine spatial and channel information to enhance feature extraction, so as to obtain more information, and use channel attention to improve the localization of image inpainting detection. The experimental show that our method has achieved satisfactory results. In the following research, we will consider using more advanced inpainting methods to process the dataset, improve the generalization of the model to detect complex inpainting images.
Funding Statement: This project is supported by the National Natural Science Foundation of China under Grant 62172059, 61972057 and 62072055, Hunan Provincial Natural Science Foundations of China under Grant 2020JJ4626, Scientific Research Fund of Hunan Provincial Education Department of China under Grant 19B004, Postgraduate Scientific Research Innovation Project of Hunan Province under Grant CX20210811.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|