Robust Watermarking of Screen-Photography Based on JND

With the popularity of smartphones, it is often easy to maliciously leak important information by taking pictures of the phone. Robust watermarking that can resist screen photography can achieve the protection of information. Since the screen photo process can cause some irreversible distortion, the currently available screen photo watermarks do not consider the image content well and the visual quality is not very high. Therefore, this paper proposes a new screen-photography robust watermark. In terms of embedding region selection, the intensity-based Scale-invariant feature transform (SIFT) algorithm used for the construction of feature regions based on the density of feature points, which can make it more focused on the key content of the image; in terms of embedding strength, the Just noticeable difference (JND) model is applied to limit the intensity of the watermark embedding according to the luminance and texture of the picture to balance robustness and invisibility; after embedding watermark, the coefficients in the neighborhood are again adjusted with optimal constraints to improve the accuracy of watermark extraction. After experiments, it is shown that the method we proposed can improve the correct rate of watermark extraction, the quality of the visual aspect of the watermarked picture is also improved.

The screen photography process, which is the process of capturing a screen with a mobile phone and generating an image, undergoes analog-to-digital and digital-to-analog conversion attacks [11] and also causes many distortions. In past years, many researchers have worked on print-scanning resilient watermarking schemes [12,13] as well as print-camera resilient watermarking schemes [14][15][16], but due to the special nature of screen photography, experiments have shown that the majority of watermarking schemes do not work for screen photography processes, so some new watermarking schemes are needed.
Screen-photography watermarking algorithms have been investigated by several researchers in recent years. Fang et al. [17] presented a screen-photography watermarking scheme that uses an intensity-based SIFT algorithm to determine embedding region, uses the small size of the template to embed the watermark over and over into different areas, and extracts watermarks using crossvalidation. This method has high robustness to watermarks obtained from screen photography, but requires manually finding the four vertices of the image for correction; the scheme does not apply to images with simple textures due to the limitations of the SIFT algorithm; and the scheme causes large visual distortion for binary images. Chen et al. [18] proposed screenshot robust watermarking for satellite images, proposing to use DFT to embed to deal with the quality decline due to screenshots, and using the synchronous response index for estimating the appropriate scale level as well as the location of the synchronous watermark during extraction; the scheme is robust to common attacks and screenshot attacks. Chen et al. [19] presented a feature synchronization based watermarking scheme for screenshots, using Gaussian functions, improved Harris-Laplace detectors and accelerated robust feature orientation descriptors to construct embedding regions for watermark synchronization, and also using non-rotational embedding methods and preprocessing methods to modulate the DFT coefficients, which is robust to screenshot attacks and also to attacks with additional common desynchronization attacks on screenshots is effective. Chen et al. [20] proposed a new scheme combining encryption and screen protection. The scheme first embeds the watermark in the DFT domain and then generates the second watermark based on two-dimensional coding and inverse DFT. A segmentation encryption algorithm based on chaotic mapping is proposed to enhance the robustness of the encrypted watermark, and the watermark synchronization algorithm based on frame position is used in the watermark extraction process. The scheme is safe and reliable, and has high robustness.
Several of the above methods embed watermarks by performing modifications in the frequency domain, such as the DCT and DFT domains. In recent years, deep learning methods have also been shown to be able to undertake watermark embedding and extraction. Fujihashi et al. [21] proposed to embed watermarks into images using a deep Convolutional Neural Networks encoder and decoder for watermarking, which has higher throughput than DCT based neural network watermarking. Zhong et al. [22] proposed a watermarking system based on Deep Neural Networks(DNN) using an unsupervised structure and a new loss calculation method, which is capable of extracting watermarks in camera captured images as well, with good practicality and robustness. Jia et al. [23] proposed a new method for embedding hyperlinks into ordinary images that can detect watermarks by cameraequipped mobile devices. Incorporation of a distortion network which employs distinguishable 3D rendering operations among encoder and decoder can emulate distortion caused by the introduction of camera imaging and is somewhat robust to camera photography. Fang et al. [24] suggested a depthtemplate-based algorithm for watermarking, which designs a message embedding template and a localization template on the embedding side and uses a two-level DNN on the extraction side to achieve higher robustness against camera photography.
To increase the robustness of screen-photography watermarking and to strike a balance between robustness and invisibility. Therefore, we propose a screen-photography watermarking technique. Our contributions are the following: (1) In regard to embedding area selection, the intensity-based SIFT algorithm used for the construction of feature regions according to the density of feature points, so that the feature region pays more attention to the key content of the image, and this method can accurately locate the embedded region during extraction; (2) In terms of watermark embedding strength, the JND model is proposed to set the intensity of the watermark embedding, and the watermark embedding strength changes adaptively with image luminance and texture, to achieve the balance of robustness and invisibility; (3) An optimization scheme of neighborhood coefficient size constraint is proposed to apply the same size constraint to other coefficient pairs in the neighborhood of modified DCT coefficients, so that the watermark information can still be extracted accurately when the feature points are shifted. Results of the experiment showed that after the above three methods can improve the correct rate of watermark extraction, and the quality of the visual aspect of the watermarked picture is also improved.
The organization of this paper is as follows. Section 2 is a related presentation that describes screen photography distortion, the SIFT algorithm, and the JND model. Section 3 describes the proposed method in this paper. The experimental results and comparative analysis are given in Section 4. Finally, Section 5 concludes the work of this paper.

Screen Photography Distortion
Fang et al. [17] and Chen et al. [19] studied and summarized distortions that resulted from the screen capture process. The process of screen camera is classified into three sub-processes, which are screen display, shooting and camera imaging process. And a series of distortions are caused in these three sub-processes, Fang et al. divided the distortion in the screen capture process into four categories, which are display distortion, lens distortion, sensor distortion and processing distortion, and Chen et al. divided the distortion in the screen capture process into five categories, which are linear distortion, gamma adjustment, geometric distortion, noise attack and low-pass filter attack [25]. We believe that three of these distortions, lens distortion, light source distortion and Moiré streak distortion should be of concern. 4822 CMC, 2022, vol.71, no.3

SIFT Algorithm
The SIFT algorithm or scale invariant feature transformation is an algorithm in image processing, which is used to detect key points in the image and provide feature descriptors for the direction of key points [26]. The algorithm can adapt to changes in luminance with good stability and invariance. Schaber et al. [27] demonstrated that SIFT keypoints are screenshot invariant, but traditional SIFT algorithms do not allow blind extraction and have long localization times [28,29], so we also use the same intensity-based SIFT algorithm as Fang et al. for finding regions with embedded watermarks. The equation is defined as Eq. (1).
where p denotes a key point in the Difference of Gaussians (DoG) domain, and p = (x, y, σ ), the DoG image is described as Eq. (2).
where (x, y) is a pixel location of the image and σ is the scale space factor. L(x, y, σ ) is defined as the original image I(x, y) with a variable scale 2-dimensional Gaussian G(x, y, σ ) convolution operation. The equation is as follows: where * denotes the convolution operation, (x, y) represents a pixel position of the image, and σ is the scale space factor.

JND Model
JND is the minimum perceptible difference, and in image processing, JND can be used to measure the sensitivity of the human eye to distortion in different regions of an image [30]. The JND model mainly considers the properties of luminance masking as well as texture masking of Human Visual System(HVS), and uses a nonlinear relationship to superimpose the effects of both. Qin et al. [31] suggested using the spatial JND model to determine the strength of the watermark embedding to obtain good visual performance. This paper uses JND to adjust the size of watermark embedding intensity, which can achieve an adaptive change of embedding intensity with image luminance and texture. The basic formula of JND is as follows.
Of these, the T l (x, y) and T t θ (x, y) are the visibility threshold of the luminance masking factor and the texture masking factor. C lt θ is the superposition constant, which reflects the degree of overlap of the two masks.

Proposed Method 3.1 Watermark Embedding Process
The watermark embedding method proposed is described below. First select the host image for embedding, if a color image is an input, convert the input to YCbCr color space, and then select the image of Y channel, otherwise, it will be directly used as the host image. Then select the feature region for embedding the watermark, find the feature points using intensity-based SIFT algorithm and construct the feature region according to the density of the feature points. The watermark sequence is encoded using Bose-Chaudhuri-Hocquenghem(BCH) codes and the encoded watermark sequence is used to embed. The watermark embedding is performed in the DCT domain. Finally, the coefficients in the neighboring domain are adjusted again with optimization constraints. The specific watermark embedding process is shown in Fig. 1.

Figure 1: Watermark embedding process
Convert the watermark sequence into a binary sequence and encode it using BCH codes. The obtained binary watermark sequence is put into a matrix W of size a * b by column and W is the smallest square matrix possible. If the length of the binary watermark sequence is less than a * b, then the other remaining positions in W are subjected to a complementary 0 operation to finally obtain the watermark matrix. Each bit of information is embedded in a block of size 8 × 8 pixels. With the candidate feature points as the center, the feature region is obtained based on the watermark matrix, and the size of the feature region is a * b * 8 * 8.
The feature points are detected in the image using the intensity-based SIFT algorithm and sorted in descending order of feature point intensity, and the top n feature points with the largest feature points intensity are selected as candidate feature points. The feature region is filtered for the first time depending on the area that cannot exceed the borders of the image. Then the feature region with at least k candidate feature points is selected for the second time based on the density of feature points. Finally, the feature region is filtered based on the fact that each feature region cannot overlap with each other, and if there is an overlapping region, the one with low feature point intensity is removed. After finding all the feature points that satisfy the conditions, the top m feature points and the corresponding feature regions are the final choices in order of the intensity of the feature points. If the number of feature points is less than m, the embedding is done according to the actual feature points and feature regions. However, after extensive experiments, it is found that texture-rich images have at least five such regions. We found that the feature regions determined based on the feature point density can be more concentrated in key locations of the image, and such locations tend not to be cropped out. Taking the crown plot as an example. Fig. 2a shows the feature regions found by Fang et al.'s method, and Fig. 2b shows the feature regions found by the proposed method, from which we can see that the feature regions found by our method surround the key regions of the image. The embedding of the watermark is performed in the DCT domain. Fang et al. found that the relative magnitude of a pair of DCT coefficients at mid-frequency didn't change after the screenphotography process, so embedding of the watermark, as well as blind extraction, can be achieved by comparing the magnitude of this pair of coefficients. The image is first transformed to DCT domain, and then embed watermark by comparing the coefficient values of two positions, the two coefficients are C1 and C2. If the information to be embedded is 0, C1 ≥ C2 is guaranteed; If the information to be embedded is 1, C1< C2 is guaranteed. The watermark strength in this paper is determined by the JND model to achieve the adaptive change of embedding strength with image luminance and texture, and the specific equation of watermark embedding is shown in Eq. (5).
where (x, y) is a pixel position of the image, the T l (x, y) and (T t (x, y) are visibility threshold for the luminance masking factor and texture masking factor, the JND(x, y) is JND value of a pixel location, C1 and C2 are the discrete cosine transform coefficient values of the two locations, and w is the watermark value.
Since the shooting process may lead to an offset of some feature points during extraction, Fang et al. use a neighborhood traversal method to compensate for the offset. However, when the offset occurs, the comparison may be performed with more than just the two coefficient values C1 and C2, and we take the coefficients at positions (4, 5) and (5, 4) as an example below. Fig. 3 plots the results of eight possible offsets, one coefficient pair in each ellipse, and ultimately it may be possible to determine whether the embedded information is 0 or 1 by comparing the magnitudes of the two coefficient values in these pairs. When the magnitude relationship between the values of the other coefficient pairs is exactly opposite to that of (4, 5) and (5,4), and the number of opposite coefficient pairs is greater than the number of identical coefficient pairs, it is possible that the extracted information is the exact opposite of the embedded information, and thus the information cannot be extracted correctly. Therefore, after embedding the watermark, we propose to apply size constraints to the discrete cosine transform coefficients in the 3 * 3 neighborhood of (4, 5) and (5,4) positions to make sure that watermark information is still extractable when feature points are offset. The formula we wish to end up with for the size relationship is shown in Eqs. (6) and (7). if w = 1 C(4, 6) < C(5, 5) < C (6,4) C(5, 6) < C(6, 5) When the above formula is not satisfied, we need to modify the adjustment for the coefficient values that do not satisfy the condition. For C (3,4),C(4, 3) and C(5, 6),C(6, 5) the two pairs of coefficients are directly exchanged if the condition is not satisfied; for C (3,5),C(4, 4),C(5, 3) and C(4, 6),C(5, 5),C(6, 4) the four pairs of coefficients , if they do not satisfy the condition, are sorted according to the size relationship and assigned to each coefficient in turn; for C (3,6),C(4, 5),C(5, 4),C(6, 3) the three pairs of coefficients, if the condition is not satisfied, are modified according to Eq. (8).

Watermark Extraction Process
The following describes the watermark extraction process, a detailed process can be seen in Fig. 4. Since mobile phone shooting will cause some image distortion, the photo taken by the mobile phone should be perspective corrected first, and the cropping and scaling operation should be performed to get the extracted image as large as the raw image. But here four vertices of the image need to be known. The SIFT algorithm based on intensity is used to select the feature points, select the first n feature points with the greatest intensity, and select the feature region as the embedding method. The feature regions are divided into a * b blocks, a DCT transform is executed on each block, and the embedding information is determined as 0 or 1 by comparing the magnitude of the DCT coefficient values.

Figure 4: Watermark extraction process
Since the intensity of the feature points may change and their positions may be shifted after taking a picture, we perform offset compensation. With the feature point as the center, 9 neighboring feature points in their surrounding 3 * 3 neighborhood are selected to form an extracted point group for offset compensation. The 9 watermarks extracted from this extracted point group are used as a watermark group. Watermarks in different watermark groups are compared, and if the difference between two watermarks is less than the threshold th, the two watermarks are recorded as a watermark pair w f . Watermarks in the same group will not be compared. The formula for calculating the difference between two watermarks are shown in Eq. (9), and the formula for watermark pair w f of the formula is shown in Eq. (10).
where w iα is the αth watermark in the ith watermark group, w jβ is the βth watermark in the jth watermark group, (x, y) are the coordinates of the watermark matrix, and ⊕ is the XOR operation.
Finally, a watermark pair w f with a length of 2 l is obtained. Add all watermarks in w f in sequence according to the coordinates. If the value at a certain position is greater than or equal to l, that is, half of the number of watermarks in w f , it is considered that the watermark value at that position is 1, otherwise, it is 0. See Eq. (11) for details.
Of these, w i f (x, y) is the value of the ith watermark in w f at position (x, y) and 2l is the number of watermarks in w f . By adjusting the embedding region, the adaptive variation of embedding intensity and constraining the magnitude relationship of the coefficient in the neighborhood of the modified coefficient, we find that the probability of successful watermark extraction from the captured images increased and the quality of the watermarked images visually improved.

Experimental Results
Next, we will show and discuss the results of the comparison experiments with Fang et al. We carried out experiments under the MATLAB platform. The image database we used is the USC-SIPI image database [32]. Again, we choose a = 8, b = 8, and the error correction code chosen is BCH(64, 39), which corrects 4-bit errors. The n in the experiment is set to 50 and k is set to 3. The monitor we use is 'LEN T27q-20' and the phone used is 'Huawei p40 pro'. The following experiments compare the results in terms of PSNR values and different shooting processes including shooting distance, shooting angle, and handhold shooting, respectively.
Tab. 1 shows the comparison between Fang's method and the proposed method in PSNR value. From the table, it can be seen that most of the plots have some improvement in PSNR values, but there are also a small number of plots with slightly lower PSNR values.        The experimental comparison shows that our proposed method has some improvement in robustness and vision over the method of Fang et al. However, the scheme has some limitations like the method of Fang et al. Due to the limitations of the SIFT algorithm, if the texture of the image is too simple, the position of the selected feature points will change significantly, and thus the information cannot be extracted correctly. During extraction, it is still necessary to manually locate the four vertices of the image to be able to recover the extracted image. For some text images, the proposed embedding method leads to a very obvious visual distortion that can affect the normal reading of people.

Conclusion
For the purpose of protecting the image displayed on the screen, a robust watermarking solution based on image content for screen photography is proposed. In terms of embedded area selection, the intensity-based SIFT algorithm is used to construct feature regions based on the density of feature points, which can make the feature regions more focused on the key content of the image, more accurate in extraction and resistant to certain cropping attacks; the JND model is used to realize the adaptive change of embedding strength with image luminance and texture to achieve a balance between robustness and invisibility; after embedding, the coefficients in the domain are optimized again. In this way, when using statistical features to extract watermark, it can improve the number of correct coefficient size relationships, reduce the opposite situation of extraction results, and have higher accuracy. The results of the experiments indicate that the solution offers some improvements in terms of robustness and vision.