An Optimal Text Watermarking Method for Sensitive Detecting of Illegal Tampering Attacks

: Due to the rapid increase in the exchange of text information via internet networks, the security and authenticity of digital content have become a major research issue. The main challenges faced by researchers are how to hide the information within the text to use it later for authentication and attacks tampering detection without effects on the meaning and size of the given digital text. In this paper, an efficient text-based watermarking method has been proposed for detecting the illegaltampering attacks on the Arabic text transmitted onlinevia an Internet network. Towards this purpose, the accuracy of tampering detection and watermark robustness has been improved of the proposed method as compared with the existing approaches. In the proposed method, both embedding and extracting of the watermark are logically implemented, which causes no change in the digitaltext. This is achieved by using the third level and alphanumeric strategy of the Markov model as a text analysis technique for analyzing the Arabic contents to obtain its features which are considered as the digital watermark. This digital watermark will be used later to detecting any tampering of illegal attack on the received Arabic text. An extensive set of experiments using four data sets of varying lengths proves the effectiveness of our approach in terms of detection accuracy, robustness, and effectiveness under multiple random locations of the common tampering attacks.

through various applications such as eCommerce, eBanking, eLearning and other internet applications, and communication technologies are growing rapidly. Most of these digital texts are very sensitive to changes in terms of contents, structure, syntax, and semantic. Malicious attackers may temper these digital contents during the transfer process, which will result to wrong decisions [1]. Extensive research work to develop algorithms and techniques for accomplishing information security such as content authentication, integrity verification, tampering detection, owner identification, access control, and copyright protection are in progress. Steganography and digital watermarking are the main techniques used to solve these problems in information security. In digital watermarking, information such as text, binary image, video, and audio is embedded as a watermark key in those digital contents [2].
For information security, many algorithms and techniques are available such as the authentication of content, verification of integrity, detection of tampering, identification of owners, access control, and copyright protection. To overcome these issues, steganography and automated methods of watermarking are commonly used [3]. A technique of digital watermarking (DWM) can be inserted into digital material through various details such as text, binary pictures, audio, and video. A fine-grained text watermarking procedure is proposed based on replacing the white spaces and Latin symbols with homoglyph characters [4].
Several conventional methods and solutions for text watermarking were proposed [5,6] and categorized into different classifications such as linguistic, structure and image-based, and formatbased binary images [7]. To insert the watermark information into the document, most of these solutions require certain upgrades or improvements to the original text in digital format material. Zero-watermarking without any alteration to the original digital material to embed the watermark information is a new technique with smart algorithms that can be used. Also, this technique can be used to generate data for a watermark in the contents of a given digital context [1,[7][8][9].
Restricted research has centered on the appropriate solutions to verify the credibility of critical digital media online [10][11][12]. The verification of digital text and the identification of fraud in research earned great attention. In addition, text watermarking studies have concentrated on copyright protection in the last decade, but less interest and attention has been paid to integrity verification, identification of tampering and authentication of content due to the existence of text content based on the natural language [13].
Proposing the most appropriate approaches and strategies for dissimilar formats and materials, especially in Arabic and English languages, is the most common challenge in this area [14,15]. Therefore, authentication of content, verification of honesty, and detection of tampering of the sensitive text constitute a big problem in various applications and require necessary solutions.
Some instances of such sensitive digital text content are Arabic interactive Holy Qur'an, eChecks, tests, and marks. Different Arabic alphabet characteristics such as diacritics lengthened letters and extra symbols of Arabic make it simple to modify the key meaning of the text material by making basic changes such as modifying diacritic arrangements [16]. The most popular soft computation and natural language processing (NLP) technique that supported the analysis of the text is HMM.
We suggest an intelligent hybrid approach named "an efficient text-watermarking method based on the third level and alphanumeric strategy of Markov model (ETWMM)" for Arabic text authentication and tampering detection. Towards this purpose, text-watermarking and Markov model techniques have been integrated. In this method, the third level of the alphanumeric strategy of the Markov model was used for text analysis and extract the features of the given Arabic text to use it as a watermark key. Without any alterations or effects on the original text size, the watermark created is logically integrated into the original Arabic history. The embedded watermark would later be used to identify all manipulation on Arabic text obtained after transmission of text through the Internet and whether it is authentic or not.
The primary objective of the ETWMM method is to provide the high accuracy of tampering detection of illegal attack on Arabic text which is transmitted via the Internet network.
The remainder of the article is structured as follows: In Section 2, we explain the existing works done so far. In Section 3, we discussed the suggested method (ETWMM). The simulation and implementation are provided in Section 4, results discussion is provided in Section 5, and finally, we conclude the article in Section 6.

Related Work
According to the processing domain of NLP and text watermarking, these existing methods and solutions of text watermarking reviewed in this paper are classified into linguistical, structural, and zero-watermark methods [1,7,13].
Natural language is the foundation of approaches to linguistic text watermarking. The mechanism of those methods embedding the watermark is based on changes applied to the semantic and syntactic essence of plain text [1].
To enhance the capability and imperceptibility of Arabic text, a method of text watermarking has been suggested based on the location of the accessible words [17]. In this method, any wordspace is used to mask the Boolean bit 0 or 1 that physically modifies the original text.
A text steganography technique was proposed to hide information in the Arabic language [18]. The step of this approach considers Harakat's existence in Arabic diacritics such as Kasra, Fatha, and Damma as well as reverses Fatha to cover the message.
A Kashida-marks invisible method of watermarking [19], based on the features of frequent recurrence of document security and authentication characters, was proposed. The method is based on a predetermined watermark key with a Kashida placed for a bit 1 and a bit omitted.
The method of steganography of the text has been proposed based on Kashida extensions on the characters 'moon' and 'sun' to write digital contents of the Arabic language [20]. In addition, the Kashida method characters are seen alongside characters from Arabic to decide which hidden secret bits are kept by specific characters. In this form, four instances are included in the kashida characters: moon characters representing '00'; sun characters representing '01'; sun characters representing '10'; and moon characters representing '11'.
A text steganographic approach [21] based on multilingual Unicode characters has been suggested to cover details in English scripts for the use of the English Unicode alphabet in other languages. Thirteen letters of the English alphabet have been chosen for this approach. It is important to embed dual bits in a timeframe that used ASCII code for embedding 00. However, multilingual ones were used by Unicode to embed between 01, and 10, as well as 11. The algorithm of text watermarking is used to secure textual contents from malicious attacks according to Unicode extended characters [22]. The algorithm requires three main steps, the development, incorporation, and extraction of watermarks. The addition of watermarks is focused on the development of predefined coding tables while scrambling strategies are often used in generating and removing the watermarking key is safe.
The substitution attack method focused on preserving the position of words in the text document has been proposed [23]. This method depends on manipulating word transitions in the text document. Authentication of Chinese text documents based on the combination of the properties of sentences and text-based watermarking approaches have been suggested [24,25]. The proposed method is presented as follows: a text of the Chinese language is split into a group of sentences, and for each word, the code of a semantic has been obtained. The distribution of semantic codes influences sentence entropy.
A zero-watermarking method has been proposed to preserve the privacy of a person who relies on the Hurst exponent and the nullity of the frames [26]. For watermark embedding, the two steps are determined to evaluate the unvoiced frames. The process of the proposed approach bases on integrating an individual's identity without notifying any distortion in the signals of medical expression.
A zero-watermarking method was proposed to resolve the security issues of text documents of the English language, such as verification of content and copyright protection [27]. A zerowatermarking approach has been suggested based on the authentication Markov model of the content of English text [28,29]. In this approach, to extract the safe watermark information, the probability characteristics of the English text are involved and stored to confirm the validity of the attacked text document. The approach provides security against popular text attacks with a watermark distortion rate if, for all known attacks, it is greater than one. For the defense of English text by copyright, based on the present rate of ASCII non-vowel letters and terms, the conventional watermark approach [30] has been suggested.
According to the suggested methods, content authentication and tampering detection of digital English contents that have been ignored by researchers in the literature for many reasons. English text is natural language-dependent. On the other hand, hiding the watermark information is complicated since there is no location to hide it within the text as pixels in the case of an image, waves in audio, and frames in a video.

The Proposed Approach
In this paper, the authors propose an efficient method called ETWMM. Hence, the third level of alphanumeric strategy consisting of a model performing as a soft computing tool and NLP in cooperation between the zero-watermarking technique and the Markov model. The third level of alphanumeric strategy has been selected in our proposed ETWMM method to improve the accuracy of random detection of tampering attacks and reduced the gap of interrelationships between the given numbers, special symbols and Arabic alphabets as compared with other levels. Markov model is used in our proposed ETWMM method for text analysis to extract the features of the given Arabic text with getting rid of external watermark information and without any modifications of the original text to embed the watermark key. Unlike the previous work, the proposed approach ETWMM can effectively detect any tampering whenever tampering volume was very low or very high. In addition, ETWMM can be improved to determine the place of tempering occurrence. This feature can be considered an advantage over Hash function method.
The following subsections explain in detail two main processes that should be performed in ETWMM. Which are watermark generation and embedding process, and watermark extraction and detection process.

Watermark Generation and Embedding Process
The three sub-algorithms included in this process are pre-processing algorithm, text feature extraction and watermark generation algorithm, and watermark embedding algorithm as illustrated in Fig. 1.

Pre-Processing Algorithm
Preprocessing of the original Arabic text is one of the key steps in both the watermark generation and extraction processes to remove extra spaces and new-lines, and it will directly influence the tampering detection accuracy and watermark robustness. The original Arabic text (OATD) is required as input for Pre-processing process.

Text Feature Extraction and Watermark Generation Algorithm
This algorithm includes two sub-processes are building the Markov matrix, and text features extraction and watermark generation processes.
-Building the Markov matrix is the starting point of Arabic text analysis and watermark generation process using the Markov model. A Markov matrix that represents the possible states and transitions available in the given text is constructed without reputations. In this approach, each triple alphanumeric within a given Arabic text represents a present state, and each unique alphanumeric a transition in the Markov matrix. During the building process of the Markov matrix, the proposed algorithm initializes all transition values by zero to use these cells later to keep track of the number of times that the i th triple alphanumeric is followed by the j th alphanumeric within the given Arabic text document.
Pre-processing and building Markov matrix algorithm executes as presented below in Algorithm 1.

Algorithm 1: Algorithm of preprocessing and building Markov matrix of ETWMM
where OATD is an original Arabic text, OATD P is a pre-processed Arabic text, L3_mm refers to states and transitions matrix with zeros values for all cells, ps: refers to the current state, ns: refers to next state.
According to the above, a method is presented to construct a two-dimensional matrix of Markov states and transitions named L3_mm -Text features extraction and watermark generation process: after the Markov matrix was constructed, the Arabic text analysis process should be performed to extract features of the given Arabic text and generate watermark patterns. In this algorithm, the number of appearances of possible next states transitions for each current state of a single alphanumeric will be calculated and constructed as transition probabilities by Eq. (1) below.

L3_mm[ps][ns]
where n is number of states, i: is i th current state of a single alphanumeric, j: is j th next state transition.
The following example of the Arabic text sample describes the mechanism of the transition process of the present state to other next states.
When using the third level order of alphanumeric mechanism of Hidden Markov model, every unique triple of Arabic alphanumeric is a present state. Text analysis is processed as the text is read to obtain the interrelationship between the present state and the next states. Fig. 2. below illustrates the available transitions and analysis results of the above sample of Arabic text.  As illustrated in Fig. 3 above, we assume " " is a present state of three continuous alphanumerics, and the available next transition is " ". We observe that one transition only available in the given Arabic text sample.
Text analysis and watermark generation algorithm are presented formally and executed as illustrated in Algorithm 2.

Watermark Embedding Algorithm
In our proposed ETWMM method, the watermark key will be generated as a result of text analysis and text feature extraction process by finding all non-zero values in the Markov matrix. All of these non-zero values will be concatenated sequentially to generate the original watermark pattern L3_WM O , as given in Eq. (2) and illustrated in Fig. 5.

L3_WM O & = L3_mm[ps][ns]
( 2 ) The algorithm of watermark embedding based on the third level of alphanumeric strategy of the Markov model is presented formally and executed as illustrated below in Algorithm 3.

Algorithm 3: Watermark embedding algorithm of ETWMM
where L3_WM O is the original watermark pattern.

Watermark Extraction and Detection Process
Before the detection of pre-proceed attacked Arabic text (AATD P ), attacked watermark patterns (L3_EWM A ) should be generated, and matching rate of patterns and watermark distortion should be calculated by ETWMM for detecting any tampering with the authentication of the given Arabic text.
Two core algorithms are involved in this process, which are watermark extraction and watermark detection. However, L3_EWM A will be extracted from the received (AATD P ) and matched with L3_WM O by detection algorithm.
AATD P should be provided as the input for the proposed watermark extraction algorithm. The same process of watermark generation algorithm should have been performed to obtain the watermark pattern for (AATD P ) as illustrated in Fig. 6.

Watermark Extraction Algorithm
AATD P is the main input required to run this algorithm. However, the output of this algorithm is L3_EWM A . The watermark extraction algorithm is presented formally as illustrated in Algorithm 4.

Algorithm 4: Watermark extraction algorithm of ETWMM
where L3_EWM A is the attacked watermark.

Watermark Detection Algorithm
L3_EWM A and L3_WM O are the main inputs required to run this algorithm, while the output of this algorithm is the notification Arabic text document, which can be authentic or tampered. The detection process of the extracted watermark is achieved in two main steps: • Primary matching is achieved for L3_WM O and L3_EWM A . If these two patterns appear identical, then an alert will appear as "Arabic text is an authentic and no tampering occurred." Otherwise, the notification will be "Arabic text is tampered," and then it continues to the next step. • Secondary matching is achieved by matching the transition of each state in a generated pattern. This means that L3_EWM A of each state is compared with the equivalent transition of L3_WM O as given by Eqs. (3) and (4) below where L3_PMR T represents tampering detection accuracy rate value in transition level.
where L3_PMR S is value of tampering detection accuracy rate in state level.
After the pattern matching rate of every state has been produced, we have to find the weight of every state stored in the Markov matrix as presented in Eq. (5) below.
where L3_sw refers to total tampering detection accuracy and watermark robustness.
The final L3_PMR of AATD P and OATD P are calculated by Eq. (6).
The rate of watermark distortion represents the rate of tampering attacks occurring on the contents of the attacked Arabic context, which is denoted by L3_WDR and calculated by Eq. (7).
The steps involved in the watermark detection algorithm are illustrated in algorithm Algorithm 5.  The results of the watermark extraction and detection process are illustrated in Fig. 7.

Implementation and Simulation
To evaluate the tampering detection accuracy and robustness of our ETWMM method, several scenarios of simulation and experiments are performed. This section depicts an implementation, simulation, and experimental environment, experiment parameters, experimental scenarios of standard Arabic datasets, and results discussion.

Simulation and Implementation Environment
The self-developed program has been developed to test and evaluate the tampering detection accuracy and robustness of ETWMM. The implementation environment of ETWMM is: CPU: Intel Core i7-4650U/2.3 GHz, RAM: 8.0 GB, Windows 10-64 bit, PHP programming language with VS Code IDE.

ETWMM Simulation and Experiment
This subsection presents the tampering detection accuracy and robustness evaluation of ETWMM. Many simulations and experiment scenarios are performed as shown in Tab. 1, for all forms of attacks and their volumes.  In case of attack volume effects against all dataset sizes, results shown in Tab. 1 above and Fig. 8 below, high effect detected under reorder and deletion attacks in case of small, mid, and high attack volumes. Results shows also reorder attacks more sensitive than deletion attacks in case of medium attack volumes, however, deletion attacks more sensitive than reorder attack in cases of low and high attack volumes. This mean ETWMM gives the best detection accuracy at all under insertion attack in all scenarios of attack volumes.

Comparison and Result Discussion
The detection accuracy and robustness results were critically analyzed. This subsection displays an effect study and a comparison between ETWMM and baseline approaches named Hybrid of Natural Language Processing and Zero-Watermarking Approach (HNLPZWA) [5] and Zero-Watermarking Approach based on Fourth level order of Arabic Word Mechanism of Markov Model (ZWAFWMMM) [6]. It also contains a discussion of their effect under the major factors, namely dataset size, attack types and volumes.

Attacks Type-Based Comparison
Tab. 2 shows a comparison of the different attack's type of effect on detection accuracy and robustness of ETWMM, ZWAFWMMM, and HNLPZWA approaches against all dataset sizes and all scenarios of attack volumes. Tab. 2 and Fig. 9 show how the detection accuracy and robustness of ETWMM, HNLPZWA, and ZWAFWMMM approaches is influenced by analyzing the attack types. In all cases of deletion and reorder attack, ETWMM approach outperforms FAWMW and HNLPZWA in terms of general detection accuracy and robustness in all scenarios of all attacks. HNLPZWA and ZWAFWMMM give the best detection accuracy and robustness rate in case of insertion attack. This means that ETWMM approach is strongly recommended and applicable for content authentication and tampering detection of Arabic text transmitted via the internet especially under deletion and reorder attacks.

Attacks Volume-Based Comparison
Tab. 3 provides a comparison of the different attack volumes' effect on detection accuracy and robustness against all dataset sizes and all scenarios of attack volumes. The comparison is performed using ETWMM, ZWAFWMMM, and HNLPZWA approaches. Tab. 3 and Fig. 10 show how the detection accuracy and robustness of ETWMM, HNLPZWA, and ZWAFWMMM approaches is influenced by analyzing the attack volumes. In Fig. 10, it can be seen that if the attack volume increases, the detection accuracy and robustness decrease. It is seen also, ETWMM approach outperforms both HNLPZWA and ZWAFWMMM approaches in terms of general detection accuracy and robustness in all scenarios of low, mid, and high volumes of all attacks. This means that ETWMM approach is strongly recommended and applicable for tampering detection of Arabic text under all volumes of all attacks.

Dataset Size-Based Comparison
This section presents a comparison of the various dataset size effect on detection accuracy and robustness against all forms of attacks within their multiple volumes as shown in Tab. 4. Fig. 11 shows how the detection accuracy and robustness of ETWMM, HNLPZWA, and ZWAFWMMM approaches is influenced by analyzing the dataset size. In Fig. 11, it can be seen that in all cases of baseline HNLPZWA and ZWAFWMMM approaches, the detection accuracy and robustness increased with decreasing dataset size and decreased with increasing dataset size. However, in the proposed approach, we can say the detection accuracy and robustness are increased with increasing document size. On the other hand, results show that ETWMM approach outperforms both HNLPZWA and ZWAFWMMM approaches in terms of general detection accuracy and robustness under all scenarios of dataset sizes. This means that the proposed ETWMM approach is strongly recommended for tampering detection with low, mid, and large document size.  Figure 11: A compression based on dataset effect of our ETWMM method and baseline approaches

Conclusion
In this paper, ETWMM method has been proposed for detecting illegal tampering attacks on the Arabic text by combining text-watermarking and natural language processing techniques. A text analysis process should be performed to extract the features of the given Arabic text and generate a watermark key. The generated watermark will be embedded logically in the original Arabic context without modifications and effect on the size of the original text. The embedded watermark will be used later after transmission of text via the Internet to detect any tampering that occurred on received Arabic text and it is authentic or not. ETWMM approach has implemented using PHP zero program developed in VS code IDE and experiments using various standard datasets under different volumes of insertion, deletion, and reorder attacks. We have compared ETWMM with HNLPZWA and ZWAFWMMM. Comparison results show that ETWMM outperforms HNLPZWA and ZWAFWMMM in terms of robustness and accuracy of tampering detection because using the third-level order and alphanumeric mechanism of HMM leads to better robustness and accuracy of tampering detection in which the third level of interrelationships between alphanumeric is stronger than other levels of alphanumeric or words mechanisms of HMM. Also, results show that ETWMM is applicable to all Arabic alphabetic letters, special characters, numbers, and spaces. Although ETWMM is an efficient approach, and it is designed only for all scenarios of deletion and reorder attacks. For future work, we will consider detection accuracy under all scenarios of insertion attacks. Moreover, we also intend to evaluate the performance using other information security and artificial intelligence techniques.