Securing Arabic Contents Algorithm for Smart Detecting of Illegal Tampering Attacks

Themost common digital media exchanged via the Internet is in text form.TheArabic language is considered one of themost sensitive languages of content modification due to the presence of diacritics that can cause a change in the meaning. In this paper, an intelligent scheme is proposed for improving the reliability and security of the text exchanged via the Internet. The core mechanism of the proposed scheme depends on integrating the hidden Markov model and zero text watermarking techniques. The watermark key will be generated by utilizing the extracted features of the text analysis process using the third order and word level of the Markov model. The Embedding and detection processes of the proposed scheme will be performed logically without the effect of the original text. The proposed scheme is implemented using PHP with VS code IDE. The simulation results, using varying sizes of standard datasets, show that the proposed scheme can obtain high reliability and provide better accuracy of the common illegal tampering attacks. Comparison results with other baseline techniques show the added value of the proposed scheme.


Introduction
In communication technologies, validation and authentication of digital text and detection of its reliability in various languages become the most challenging to secure the transmission of the contents over the Internet. Many applications, e.g., e-Banking, eLearning, eGovernment and e-Commerce, render information exchange via the Internet the most difficult. This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Most of the digital media exchanged via the internet is in text form. The text data is very easy for illegal attacks to modify, and more sensitive than other media for changes whenever change volume was very low. The Arabic language is considered one of the most sensitive languages in terms of content modification due to the presence of diacritics that can cause a change in the meaning [1].
To overcome these issues, steganography and automated methods of watermarking are commonly used. A technique of Digital-Watermarking (DWM) can be inserted into digital material through various details such as text, binary pictures, audio, and video [2,3]. Another technique, A fine-grained text watermarking procedure, is proposed based on replacing the white spaces and Latin symbols with homoglyph characters [4].
Many solutions and methods are proposed by researchers to improve the security of digital media for several purposes such as copyright protection, identity verification, and authentication [5]. Solutions for text media are limited because text data is natural language-dependent and there are no locations to hide secret information within the text. In the case of images, secure data can be hidden in pixels. However, in the case of audio, secure data can be hidden in waves, and in the case of video, it can be hidden in frames [6,7].
Several information security techniques and algorithms can be used for various purposes of content authentication and tampering detection of digital text [8]. To hide the security data in the original document, some modifications and changes are required to be performed in the original document, and can affect in terms of contents, meaning, and document size [9,10]. Text watermarking is the most appropriate technique used for this purpose. Zero text watermarking is a smart algorithm that can be used, without any changes or modifications to the original text document, to embed the watermark data. Moreover, this technique can be used to generate data for a watermark in the contents of the given text [11][12][13].
The verification of digital text and the identification of fraud in research earned great attention [14]. Restricted appropriate techniques and solutions are proposed to improve the authenticity and reliability of sensitive digital multimedia such as Arabic interactive Holy Qur'an and digital money [15]. The most common technique of soft computation and Natural Language Processing (NLP) that supported the analysis of the text is HMM [16].
In this paper, we propose a combined smart scheme, Smart Arabic text Watermarking scheme based on third of Word-level of Markov Model (SAWMWMM) for detecting illegal tampering of Arabic text transmitted via the Internet. The proposed scheme depends on the integration of zero text watermarking and the hidden Markov model used as an NLP technique for text analysis and extraction of Arabic text features to use it as a watermark key.
The primary objective of the SAWMWMM scheme is to meet the high accuracy and reliability of sensitive detection of illegal tampering attacks on Arabic text which is transmitted through the Internet.
The remainder of this paper is structured as follows: In Section 2, the authors review the existing works done so far and explain the suggested SAWMWMM scheme in Section 3. The simulation and implementation are provided in Section 4, the discussion of results in Section 5, and finally, the authors conclude the paper in Section 6.

Related Work
According to content verification, authentication, and tampering detection domain, several solutions and techniques are reviewed in this paper.
In [17], the location of the words-based method is proposed to improve the visibility and capability of the Arabic contents. The mechanism of this method is based on word space to indicate the Boolean bit whether 1 or 0 which refers to the modifications of the plain text.
A steganography-based algorithm is designed in [18] to hide secure data in the Arabic text. In this algorithm, Harakat's of the Arabic diacritics such as Damma, Fatha, and Kasra are utilized to cover messages of secure data and use them later to validate the authenticity of the text.
In [19], A Kashida-based scheme is suggested for authentication of text documents. This scheme depends on the frequent recurrence of characters and utilizes its features as watermark data with a Kashida bit 0 and/or a bit omitted. In [20], a Kashida extension-based method is developed for Arabic content authentication. The mechanism of this method is based on using 'moon' and 'sun' characters of Arabic language to organize the given context. In addition, characters of the Kashida method are used to validate the authenticity of hidden bits.
In [21], a scheme for Quran text watermarking is proposed to improve the security of Arabic text. The proposed scheme uses vowel characters with kashida. This scheme is based on a reversing technique that depends on four phases, namely, pre-processing, embedding process, extraction process, and evaluate the performance of the proposed scheme. The results show good improvement of capacity.
A text watermarking scheme is proposed in [22] to improve the security of Arabic text in the Holy Quran by using vowels with kashida. The hybrid technique used in the proposed scheme are XOR and queuing techniques.
A position-based algorithm is designed in [23] for content authentication purposes. This algorithm depends on conserving the words positions of the given text, then manipulate their transitions and utilize them as a secret key. In [24,25], Chinese text-features methods are proposed for authenticity validation of Chines text. The mechanism of these methods depends on splitting the Chinese text into small groups of sentences, obtaining the semantic code of each word, and utilizing the sentence entropy and distribution of the semantic codes as secret data which will be used later to validate the given Chines text as authentic or not.
In [26], A Hurst exponent-based watermarking method is proposed to improve the user's privacy. This method depends on an individual's identity and utilizes it as a watermark key which will be used later to evaluate the unvoiced frames. An English text-based watermarking method is presented in [27] to improve the security issues of English text. Markov-based methods are proposed in [28,29] to validate the authenticity of the English text contents. These methods depend on utilizing the probability feature of the given text as a secret watermark key. The ASCII-based method is suggested in [30] for copyright protection of English text. This method uses ASCII of non-vowel letters of the English text.

The Proposed Scheme (SAWMWMM)
In this paper, we propose a smart scheme called Smart Arabic text Watermarking based on third of Word-level of Markov Model (SAWMWMM). The proposed scheme depends on combining text watermark and NLP techniques to get rid of external watermark data without altering the source text to hide the watermark data. The third of word-level of the Markov model is used as a NLP technique for text analysis of the given Arabic text, extract their interrelationships features and utilizes them as a watermark key.
The main contributions of the proposed SAWMWMM scheme can be summarized as follows: • The accuracy of tampering detection has improved as compared with other previous approaches. • Tampered locations will be determined in the given Arabic text.
• Unlike previous work where watermarking is done with language, contents, and scale effecting, the SAWMWMM scheme logically embeds watermarking with no effect on text, content, or size. • The SAWMWMM scheme is highly vulnerable to any basic alteration to the Arabic text and context defined as complex text. Somehow, the above three contributions are present only in pictures, though not in the text. That is the key argument for this paper's contribution.
Two core phases are conducted to run the SAWMWMM scheme, firstly the watermark generating and embedding, and secondly the watermark extraction and detection as shown in Figs. 1 and 2.

The Generating and Embedding of the Watermark
This phase consists of three sub-processes, namely, pre-processing, watermark generation, and embedding.

Pre-Processing and Construction of the Markov Matrix Process
The core aim of this process is to remove the extra new lines and spaces of the original given text and then create the initial matrix of the Markov model. The Original Arabic Text (OAT) is required to perform this process. The core mechanism of this process depends on states and transitions probability by dividing the given text into small series of unique triple words to create a two-dimensional matrix of the Markov model. States in the Markov matrix are represented by a unique triple of words, whereas transitions are represented by a unique of words of the given text. This process executes as presented in Algorithm 1.

Algorithm 1:
Pre-processing and creating the Markov matrix of the SAWMWMM scheme where OAT: refers to source Arabic text, PAT: refers to preprocessed text, w3_mm: refers to Markov matrix, ps: is current state, ns: is next state.
The matrix size of w3_mm [i] [j] of SAWMWMM is dynamic where it is based on the contents of the given text which is equal to the number of the triple of words without reputations.

Watermark Generation Process
This process represents the important task of the proposed SAWMWMM scheme. PAT is provided to this process as input to perform text analysis and extract the features of the given Arabic text by obtaining the interrelationships between them. The extracted features will be utilized as a watermark key. In this paper, we have used the following sample of Arabic text as an example to describe the various scenarios of the proposed SAWMWMM scheme. The interrelationships between the contents of the given sample of the Arabic text are shown in Fig. 4 as a result of the text analysis process using the third of word-level of the Markov model. We assume that " " represent the selected state, whereas the only one transition available is " ". The same cases in other available states. Depending on the available states and transitions of the given text, a two-dimensional matrix should be constructed  In this process, the probability appearances of the possible next transition for every present state of the unique triple of words will be computed and accumulated as given by Eq. (1).

w3_mm [ps] [ns]
where n refers to the account of states.

Figure 4:
Representation of the inter-relationships between the given arabic text using the SAWMWMM scheme Tab. 1 illustrates the results of the watermark generation process using the SAWMWMM scheme. Locations of the interrelationships between the Arabic contents will be saved in order to use them later in the detection process to determine the tampering locations.

Algorithm 2: Watermark generation using the SAWMWMM scheme
where pw is a previous word, and cw is a current word.

Watermark Embedding Process
The embedding process of the proposed SAWMWMM scheme will be performed logically by obtaining all non-zero values and their positions in the Markov matrix and concatenating them in sequential pattern W3_WMP O as given by Eq. (2), depicted in Fig. 5 and formally executed in Algorithm 3.

The Extraction and Detection of the Watermark
This phase consists of two sub-processes, namely, watermark extraction and watermark detection.

Watermark Extraction Process
The core aim of this process is to extract the embedded watermark within the preprocessed text PAT and provide it to the detection process through W 3_WMP A as shown in Algorithm 4.

Algorithm 4: Watermark extraction using the SAWMWMM scheme
where PAT P is a pre-processed attacked Arabic text and W3_WMP A is an attacked watermark pattern.

Watermark Detection Process
W3_WMP A and W3_WMP O are the core input needed to run the watermark detection process. However, the output of this process is to notify whether the Arabic text contents are tampered or not.
The detection process of the extracted watermark is performed at the transition level by computing the absolute value of the original and attacked watermark patterns as given by Eq. (3). However, detection at the state level is achieved by computing the total value of transition level divided by the total numbers of transitions as given by Eq. (4).
where W3_PMR T is the value of patterns matching in the transition degree.
where W3_PMR S is the value of patterns matching in the state degree. The matching weight of each state will be calculated by Eq. (5) W3_Sw = W3_PMR S (i) * Transitions frequency(i) total number of transitions (5) where W3_PMR S refers to the matching value of the i th state.
The final W 3_PMR of both AATD P and OATD P are computed by Eq. (6) The distortion value refers to the tampering rate of illegal attacks that occurred on the attacked Arabic text which is designated by W3_WDR and calculated by Eq. (7).
The steps involved in the watermark detection algorithm are as given in Algorithm 5.

Algorithm 5: Watermark detection using the SAWMWMM scheme
Tab. 2 illustrates the results of the watermark extraction and detection processes using the SAWMWMM scheme. Table 2: Watermark extraction and detection processes using the SAWMWMM scheme

Implementation and Simulation
To evaluate the reliability and accuracy of the SAWMWMM scheme, various scenarios of simulations are performed using a various size of standard datasets.

Implementation and Simulation Environment
The self-developed software is developed to evaluate the reliability and accuracy of the proposed SAWMWMM scheme. The implementation environment of SAWMWMM is: CPU: Intel Corei7 − 4650U/2.3 GHz, RAM : 8.0 GB, Windows 10-64 bit, PHP programming language with VS Code IDE.

SAWMWMM Simulation and Experiment
Many simulations and experiment scenarios are conducted to evaluate the reliability and accuracy of the SAWMWMM scheme using common types and volumes of tampering attacks (Insertion, Deletion and Reorder) as shown in Tab. 3. The results in Tab. 3 and Fig. 6 show that, in the case of low attack volume, the low effect is detected under insertion attack. However, in the case of medium and high attack volumes, a high effect is detected under deletion and reorder attacks. Therefore, in all scenarios of attack rates, the SAWMWMM scheme gives the best detection accuracy under deletion and reorder attacks.

Comparison and Results Discussion
This section presents a detection accuracy effect study and comparison between SAWMWMM and other baseline methods, namely, UZWAMW [1] and HAZWCTW [3]. The results of detection accuracy under the effects of attack type, attack volume, and dataset size are critically evaluated and analysed.

Detection Accuracy Comparison Under Attack Type Effect
Tab. 4 shows a comparison effect of the different types of attack on the accuracy of SAWMWMM, UZWAMW, and HAZWCTW methods against various sizes of datasets and attack volumes. As seen in Tab. 4 and Fig. 7, SAWMWMM outperforms UZWAMW and HAZWCTW in terms of general accuracy and reliability in all scenarios of all attacks. Hence, the SAWMWMM scheme is strongly recommended and applicable for detecting illegal tampering of all sensitive digital Arabic text.

Detection Accuracy Comparison Under Attack Volume Effect
Tab. 5 provides a critically comparison of the various attack volume effects on the detection accuracy against all sizes of the dataset and attack volumes.
As shown in Tab. 5 and Fig. 8, SAWMWMM outperforms both UZWAMW and HAZWCTW methods in terms of general performance in all volume scenarios of attack (low, mid, and high). Thus, the SAWMWMM scheme is applicable for detecting all illegal tampering of sensitive Arabic text exchanged via the internet whenever the attack volume is very high or very low.

Detection Accuracy Comparison Under Dataset Size Effect
Tab. 6 shows a comparison of different dataset size effects on the accuracy of tampering detection. The results of Tab. 6 and Fig. 9 show that in the SAWMWMM scheme, the highest effects detected of dataset size that reflect the best accuracy are ordered as ASST, ALST, AMST, and AHMST, respectively. Hence, the detection accuracy decreased with increasing document size and increased with decreasing document size. Nevertheless, the SAWMWMM scheme outperforms both UZWAMW and HAZWCTW approaches in terms of reliability and accuracy of tampering detection under all sizes of datasets.

Conclusion
In this paper, the SAWMWMM scheme is proposed to transfer and receive authentic Arabic content via the Internet. It depends mainly on combining zero watermarking and NLP techniques for text analysis to extract its features and utilize them as watermark data. The proposed scheme is implemented in PHP programming language using VS code IDE. It is simulated and evaluated using various sizes of standard datasets under different volumes of illegal tampering attacks, namely, insertion, deletion, and reorder. Evaluation results show that the SAWMWMM scheme is reliable, recommended, and applicable to use with e-Banking, eLearning, eGovernment and e-Commerce systems and applications to detect any tampering attacks of sensitive contents.
Compared with the UZWAMW and the HAZWCTW approaches, SAWMWMM outperforms them in terms of general performance which represents watermark capacity, reliability and detection accuracy under all scenarios of all attack types and volumes. For future work, the authors will intend to improve the reliability and accuracy using advanced techniques of soft computing.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.