|Intelligent Automation & Soft Computing |
Tampering Detection Approach of Arabic-Text Based on Contents Interrelationship
1Department of Computer Science, King Khalid University, Muhayel Aseer, KSA & Faculty of Computer and IT, Sana’a University, Yemen
2Department of Information Systems, King Khalid University, Mayahel Aseer, KSA
3Department of Electronic & Communication Engineering, Faculty of Engineering, University of Aden, Aden, Yemen
4Department of Mathematics and Computer, IBB University, IBB, Yemen & Department of Mathematics, KKU, KSA
5Faculty of Computer and IT, Sana’a University, Sana’a, Yemen
*Corresponding Author: Abdelzahir Abdelmaboud. Email: email@example.com
Received: 13 September 2020; Accepted: 28 October 2020
Abstract: Text information depends primarily on natural languages processing. Improving the security and usability of text information shared through the public internet has therefore been the most demanding problem facing researchers. In contact and knowledge sharing through the Internet, the authentication of content and the identification of digital content are becoming a key problem. Therefore, in this paper, a fragile approach of zero-watermarking based on natural language processing has been developed for authentication of content and prevention of misuse of Arabic texts distributed over the Internet. According to the proposed approach, watermark embedding, and identification was technically carried out such that the initial text document did not need to be changed to embed a watermark. The automated zero-watermarking approaches have been combined with a second-tier word order framework based on the Markov model to boost the efficiency, precision, ability, and fragility of the existing researches. This second-tier of the Markov-model has been used as a natural language processing technique to analyze Arabic-text and extract the features of the interrelationship between textual contexts. Moreover, the extracted features have been utilized as information of watermark and then validated to identify any possible tampering with the attacked Arabic-text. The recommended solution has been applied with VS code IDE using PHP. The experimental results using four datasets of varying size show that the proposed approach can obtain better detection accuracy of tampering attacks, effectiveness and high fragility for common random insertion, reorder and deletion attacks for common attacks, e.g., Comparison results with baseline approaches also show the advantages of the proposed approach.
Keywords: Text analysis; NLP; HMM; Arabic text processing; watermark fragility; tampering detection accuracy
The reliability and security of text information shared over the Internet is the most exciting and demanding area for the scientific community. In communication technologies, authentication of content, and honesty of automated text verification with different Languages formats are great significance. Numerous applications such as electronic banking, electronic commerce, etc. impose most challenges during contents transfer via the internet. In terms of content, structure, grammar, and semantics, much of the multimedia exchanged via the Internet is in form of textual which is very susceptible to online transmission. During the transfer process, malicious attackers can temper such digital content and thus the changed count .
For information security, many algorithms and techniques are available, such as content authentication, verification of integrity, detection of tampering, identification of owners, access control, and copyright protection.
To overcome these issues, digital watermarking (DWM) is a technique that can be used to hide the various data, for example; binary images, text, audio, and video with embed them in digital content as watermark information [2,3].
A fine-grain process for document watermarking is suggested based on the substitution of homoglyph characters of white spaces and Latin symbols .
Several conventional methods and solutions for text watermarking were proposed  and categorized into different classes, such as linguistic, structure, zero watermarks, and format-based methods . To insert the watermark information into the document, most of these methods need improvement to plain text material. Zero-watermarking without any alteration to the original digital material to embed the watermark information is a recent technology used with intelligent methods and algorithms. In addition, the contents of a given digital background can be used in this process to produce the watermark key [1,6–8].
Restricted research has centered on the appropriate solutions to verify the credibility of critical digital media online [9–11]. In the research community, digital text tampering detection and authentication have received great attention. Also, research in the field of text watermarking has concentrated on copyright protection in the last decade, but less interest and attention has been paid to integrity verification, identification of tampering, and authentication of content due to the existence of text content based on the natural language .
Proposing the most appropriate approaches for various formats and content, especially in English and Arabic languages, is the most common challenge in this area [13,14]. Therefore, authentication of content, verification of honesty, and detection of tampering of sensitive text is a major problem in different applications and needs required solutions.
Some instances of such sensitive digital text content are Arabic Holy-Qur’an, eChecks, and online marking of exams. Different Arabic alphabet characteristics such as diacritics, letters extraction, and symbols of Arabic supplementary that make it easy to alter the key text material meaning by creating basic changes such as modifying diacritic arrangements [11,15]. The most popular soft computing and Natural-Language Processing (NLP) technique which is involved for HMM text analysis.
This paper, present Fragile Arabic-Text Zero-Watermarking approach based on NLP (FATZWNLP). The FATZWNLP approach is focused on the word order mechanism of the second rank, focused on the Markov model for material authentication and identification of Arabic-text transmitted over the Internet. It includes of a model that functions as NLP and soft computing methods in cooperation between the zero-watermarking system and the Markov-model. The second-order term method was included in this approach to extract the interconnections between of the Arabic-text contents for text analysis and to produce a key watermark. Without any alteration or influence mostly on scale of a original text, the created watermark is logically incorporated with in original Arabic sense. The embedded watermark would later be used to identify any tampering that happens on the Arabic-text obtained and to determine whether it is genuine or not.
The main goal of the FATZWNLP strategy is to meet high precision of authentication of content and detection of Arabic-text sensitive attack tampering that is distributed over the Internet.
As follows, the remainder of the paper is structured. The current works completed so far are clarified in Section 2. FATZWNLP is discussed in Section 3. Implementation, study of experimental setup, findings, and discussion are defined in Section 4. Finally, an inference is reached in Section 5.
2 Related Works
According to the processing domain of NLP and text watermarking, these existing methods and solutions of text watermarking reviewed in this paper classified into linguistic, structural, and zero-watermark techniques [1,6,12].
2.1 Linguistic Techniques
The techniques to watermarking of linguistic text are based upon natural language to hide the watermark key by making some altering on the syntactic and semantic nature of the original text .
Watermarking text open-word spaces-based algorithm has been proposed to increase the capability and imperceptibility of Arabic text . In this method, every word-space is used to embed binary data 1 or 0 to obtain the physical altering that occurred on plain text.
A technique of text steganography  has been suggested to conceal details in the Arabic-language. The process of this algorithm considers the presence in Arabic of Harakat (diacritics, i.e., Kasra, Fatha, and Damma) and reverses the Fatha for the hiding of the message.
A Kashida-watermark based method has been presented in , which is frequency recurrence is utilized to get the document features. This method used a predefined watermark data whereby a Kashida is positioned for bit 1 and excluded for bit 0.
The method of text steganography  has been proposed to use Kashida extensions based on the characters ‘sun’ and ‘moon’ to write digital contents of the Arabic language. In this process, Kashida characters are used beside Arabic letters to decide which hidden secret bits are kept by specific characters. In this form, four situations are involved for kashida characters: moon characters representing ‘00’; sun characters representing ‘01’; sun characters representing ‘10’; and moon characters representing ‘11’.
A text steganographic approach  based on multilingual Unicode characters has been suggested to cover details in scripts of English letters using the English Unicode alphabet in other languages. Thirteen letters of the English alphabet have been chosen for this approach. Two bits would be hidden in a time frame. Used ASCII code for embedding00. However, for embedding 01, 10, and 11, Unicode involved multilingual ones.
2.2 Zero-Watermark Techniques
Several techniques and ideas based on watermarking text have been suggested in literature focused on text functionality and features [21–24]. For instance, the authors in Ref.  presents a zero-based watermarking method to maintain the Internet’s data security for items where a watermark is inserted in plain-text before being sent to base stations. The created watermark key is focused on certain content characteristics, such as data size, frequency of data presence, and data capture time. Ref.  offers a zero-watermarking solution to resolving the issues involved with the preservation of English text records, such as authentication of content and copyright protection. In addition, Ref.  provides a Markov model-based zero-watermarking for material authentication of English text, where the possibility characteristics of the English text have been used to extract and store safe watermark information to verify the validity of the attacked text message. Moreover, the authors in Ref.  suggest a conventional method of watermarking English text copyright rights, based on the frequency of occurrence of – anti-vowel ASCII words and letters.
2.3 Hybrid Techniques
Authors in Ref.  present a zero-watermarking method to protect the identification of persons with vocal fold defects. The method proposed is to create a picture of the identity of an individual and incorporate everything into watermark information to verify positions by measuring local binary patterns from the time-frequency continuum. In Ref. , the same strategy is suggested to use a separate method by which the identity of a patient is secured by producing the hidden shares through visual cryptography. Reference , presents an authentication scheme by using semi-fragile watermarking for content authentication of images from malicious attacks. The authors in Refs. [28,29] recommend zero-watermark image authentication methods by integrating scheme of zero-watermark with geometric invariants. In the proposed methods change in original images not necessary. Finally, a hybrid image and text watermarking approach for preserving an English text document is given in reference . In order to validate the ownership of content, the process of this approach relies on inserting a watermark key literally throughout the text that is later extracted.
3 Proposed Approach
This paper introduces a new intelligent approach named FATZWNLP by the combination of zero-watermark and NLP strategies in which no external details like the watermark key needs to be embedded or otherwise changed in the original text. The second tier ordering of the word method of the Markov-model was used to examine the Arabic-text contents and derive the interrelationship characteristics of such NLP technique and text contents. In FATZWNLP, two major processes should be conducted, which are the process of watermark creation and embedding as well as the process of extraction and identification of watermarks shown in Fig. 1.
The following paragraphs describe the generation and extraction processes of the watermark in detail.
3.1 Watermark Generation and Embedding Algorithms
The pre-processing process can be carried out to eliminate any additional spaces and newlines in the provided Arabic-text before the generation of watermark and inserting process. Input for the generation of watermarks and embedding algorithms includes a pre-processed original document of Arabic-text. The initial watermark pattern (W2 AWMO) is the performance of this algorithm. The created watermark would be recorded in the WM database alongside simple Arabic text document information, including the name of the author, the size and identity of the document and the last updated date.
In this step, the three major sub-algorithms included development and pre-processing of text analysis, Markov matrix, and generation and embedding of watermark.
3.1.1 Markov Chain Algorithm
For the pre-processing process to delete additional spaces and newlines, the original Arabic-text (OAT) is needed as an input. Building a Markov matrix is the starting point of Arabic text analysis and watermark generation process using the Markov model. Without reputation, a Markov chain is constructed that describes the potential states and transformations available in a given document. In this method, each specific set of words defines a present situation within a provided Arabic text, and each similar word expresses a transformation in the matrix of Markov. The presented algorithm identifies all transformation values by zero during the building step of the Markov matrix using these cells late to record the number of times the ith pair of words is preceded by the jth word inside the specified Arabic text document. As presented below in Alg. 1, pre-processing and constructing the algorithm of Markov matrix is performed.
Hence, OAT: represent Arabic-text originality, PAT: represent Arabic-text pre-processed, w2_mm: represent transitions and states matrix based zeros value for defining cells, ps: represent the current state, ns: represent the next state.
A procedure for constructing a matrix of two-dimensional of Markov-states and transformations called W2 mm[i][j], which describes the backbone of the Markov-model for analyzing Arabic-text, and providing conjunction with the above.
The W2 mm[i][j] matrix duration of the FATZWNLP is a complex matrix in which the majority of states differs according to the meaning of the defined Arabic-text, which is proportional to the number of non-reputable terms in the pair.
3.1.2 Algorithm of Watermark Generation
Just after Markov matrix has been created to identify the interrelationships between the meaning of the given Arabic text and generate watermark patterns, the NLP of the text analytical technique must be carried out.
The following example of an Arabic sample of texts explains the method of the phase of transformation from the current state to a new state.
When the hidden Markov-model uses the second-tier order of the word technique, every special pair of words is a current state technique. As the document is read to gain the interconnections between the current state and the next ones, document processing is processed. The transitions available from the above sample of the Arabic text are shown in Fig. 2 from below.
As a result of the study of the defined Arabic text using the Markov model’s second tier order of word mechanism, authors describe all current states and their possible transformations, as shown in Fig. 3.
Authors assume that “الثعلب البني” is a present state, and the available next transitions are “السريع”, “البطيء”, and “ الميت”. We observe that there are three transitions available in the defined sample of Arabic-text.
In this algorithm, to every present state of a pair of terms, the number of occurrences of potential next system variables with greater than zero values will be computed and built as transformation probabilities by Eq. (1).
- n: refer to total-number of states.
- i: refer to the ith of the current-state of the pair-words.
- j: refer to the jth of the next state-transition.
The watermark generation algorithm is related to the second-level order of the process of the word of the Markov model, as seen in Fig. 4
Let PATP represent text of pre-processed, w2 mm[ps][ns] defines the Markov-matrix to record values as many times as the ith of words-pair (present state) is preceded by the jth term in the defined Arabic text (next state transitions). As shown in Alg. 2 the algorithm of watermark generation is introduced formally and executed.
where, cw: refers to the current-word and pw: refer to the previous-word.
3.1.3 Algorithm of Watermark Embedding
The embedding process of FATZWNLP is performed logically without any improvement in the original Arabic text by finding the summation of nonzeros weight values of the content interrelationship of each state and store it in a watermark database as given in Eq. (2) and illustrated in Fig. 5 below.
3.2 Algorithms of Watermark-Extraction and Detection
Patterns of attacked watermark (W2-EWMA) must be created before the identification of the attacked documents of Arabic-Text (PAT) and the corresponding rate of patterns and watermark distortion must be determined by FATZWNLP to detect any tampering with the authentication of the provided material.
This method includes two main algorithms, which are watermark extraction and watermark identification. Nevertheless, the detection algorithm will remove W2-EWMA from the obtained (PAT) and balance W2-WMPO.
For the suggested algorithm of watermark extraction, PAT must be given as the input. In order to extract the watermark pattern for (PAT), this very of algorithm of watermark generation procedure should have been carried out.
3.2.1 Algorithm of Watermark Extraction
PAT should be provided as input to initial setup of this algorithm. Though, W2_WMPA is a core output of this algorithm as illustrated formally in Alg. 3.
where PATP refers to the pre-processed attacked Arabic text, W2_WMPA: attacked watermark.
3.2.2 Algorithm of Watermark Detection
W2_WVMA and W2_WVMO should be provided as inputs. However, the status of the given Arabic text is a core output of this algorithm which can be reliable or not. This process can perform in two steps as follows:
– Main matching for W2_WVMO and W2_WVMA is achieved. If those two WM patterns are similar in appearance, then there will be a warning “Given Arabic text is a reliable”. Otherwise, the note will be rendered “Given Arabic text is not reliable”, and then it will be going through next phase.
– Secondary matching is performed by matching of each state’s transition status in the entire produced pattern of watermarks. This means W2_WVMA of each state is contrasted with an analogous transition of W2_WVMO as given by Eqs. (3) and (4) below.
– W2_: represents matching of weight rate in transition of change, (0 < W2_PMRT <=1)T
– : refers to the initial transfer stage of watermark value.
– : refers to attacked transfer stage watermark value.
– i: is the cumulative pattern matching rate of the word state.
– : represents matching rate at the state of change, (0 < W2_WVMS <=100).
The following step is obtaining the weight of each state stored in the Markov chain matrix as presented in Eq. (5).
– : is the summation of matching weight of ith state for each pair of Arabic words.
W2_WVM of AATDP and OATDP are computed by Eq. (6).
where m: is summation of non-zeros in W2_MM.
The distortion rate of the watermark reflects the volume of tampering attacks that take place on the attacked contents of Arabic background, denoted by W2_WDR and calculated Eq. (7).
Algorithm of WM detection is implemented as illustrated in Alg. 4.
– W2_SW: is value of weight of matched states.
– : refer to the importance of WM distortion rate (0 < W2_WDRS <=100).
Fig. 6 shows the effects results of WM extraction and detection.
4 Implementation, Simulation, and Comparison
To validate the accuracy of FATZWNLP, a self-developed program has been implemented, several scenarios of experiments and simulation are performed as explained in detail in the following subsections.
4.1 Implementation and Setup Environment
FATZWNLP approach, is executed by self-developed program in object oriented and PHP using VS Code IDE on the environment having modern features.
4.2 Simulation and Experimental Parameters
The following an experimental, simulation metrics and their related values that used to perform the experiments are given in Tab. 1.
4.3 Watermark Fragility and Tampering Detection Accuracy Metrics
Watermark fragility and tampering detection accuracy of FATZWNLP is evaluated using the following metrics.
– Accuracy of watermark fragility (W2_WVM and W2_WDR) is evaluated under main four attack volumes which are: very low (5%), low (10%), mid (20%) and high (50%).
– Desired accuracy of watermark fragility values near to 100%.
– Comparison of text size, attack types, and attack volumes effects against accuracy of watermark fragility using the proposed FATZWNLP, MACATDW and UZWAMW baseline approaches.
4.4 Baseline Approaches
Detection accuracy of FATZWNLP are compared to MACATDW and UZWAMW approaches. A comparison is performed by using performance and accuracy parameters. Baseline approaches and their objectives are stated in Tab. 2.
4.5 Simulation and Experiment Discussion with FATZWNLP
In this subsection, various scenarios of simulation and experiments of FATZWNLP have been performed to evaluate its performance in terms of watermark fragility. The character set covers all Arabic characters, spaces, special symbols, and numbers. Simulations are performed on various datasets sizes and various kind of attacks and volumes as showed above in Tab. 1.
4.5.1 Accuracy Evaluation of Tampering Detection
Various simulation scenarios have been conducted to text and evaluate the watermark fragility accuracy of FATZWNLP using all types of attacks and their rates as show in Tab. 3. Results are illustrated in Fig. 7.
From Tab. 3 above and Fig. 7 below, it shows that FATZWNLP gives accurate results of WM fragility accuracy. In case of reorder attack, results show high effect of watermark fragility because insertion and deletion attacks are occurred automatically when reorder attack occurred.
4.6 Comparative Results and Discussion
This subsection presents a performance comparison in terms of watermark fragility accuracy of FATZWNLP with MACATDW and UZWAMW approaches, and study their effect under main comparison metrics, which are dataset size, attack types, and volumes.
4.6.1 Results Analysis of Dataset Size Metric
A comparison of dataset size effect on watermark fragility accuracy and a performance evaluation between FATZWNLP, MACATDW, and UZWAMW approaches have been presented in this subsection and graphically illustrated in Fig. 8.
As resulted in the summary of the comparative results shown by Fig. 8, in the case of FATZWNLP approach, the highest effects of dataset size that lead to the best accuracy of watermark fragility are ordered as ASST, AMST, AHMST and ALST, respectively. This means that tampering detection accuracy increases with decreasing Arabic document size and decreases with increasing document size. Furthermore, results show that FATZWNLP approach outperforms both MACATDW and UZWAMW approaches in terms of watermark fragility detection under all scenarios of dataset sizes.
4.6.2 Results Analysis of Attack Type Metric
Fig. 9 illustrates a comparison of the different attack types affecting tampering detection accuracy against all dataset sizes and all scenarios of attack volumes. The comparison was conducted using FATZWNLP with MACATDW and UZWAMW approaches.
As shown by Fig. 9, FATZWNLP outperforms MACATDW and UZWAMW in terms of watermark fragility detection in all scenarios of attack types. This means that FATZWNLP approach is strongly recommended for tampering detection of Arabic text exchanged via the Internet.
4.6.3 Results Analysis of Attack Volume Metric
Fig. 10 illustrates a comparison of the different attack volumes affecting watermark fragility accuracy against all scenarios of attack types and all dataset sizes. The comparison was performed using RDZWANLP with MACATDW and UZWAMW approaches.
As shown by Fig. 10, FATZWNLP outperforms MACATDW and UZWAMW in terms of watermark fragility accuracy in all scenarios of low, mid, and high volumes of all attack types. This means that FATZWNLP approach is highly recommended and applicable for tampering detection of text exchanged via the Internet under all volumes of all attacks.
Based on the second tire and word technique of HMM, a fragile digital zero-watermark approach abbreviated as FATZWNLP has been developed in this paper for tampering detection of Arabic text exchanged via the Internet. In the paper, zero-watermarking has been integrated with HMM, which is used as NLP technique for text analysis and finding interrelationships weight between given Arabic contents to extract a watermark data. FATZWNLP approach has been implemented using PHP self-developed program in VS code IDE, likewise, simulation and experiments were conducted using different Arabic datasets under various attack types and volumes. Simulation and experiment results show that RDZWANLP achieves a high accuracy of tampering detection under all scenarios of attack types with high accuracy of watermark fragility. For the future work, authors will intend to improve the performance using other techniques of Markov model.
Funding Statement: The authors express their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Research Groups under grant number (R. G. P. 2/55/40 /2019)
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|