|Computers, Materials & Continua |
Arabic Feature-Based Text Watermarking Technique for Sensitive Detecting Tampering Attack
1Department of Computer Science, King Khalid University, Muhayel Aseer, Kingdom of Saudi Arabia
2Faculty of Computer and IT, Sana’a University, Sana’a, Yemen
3School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Malaysia
4Department of Information Systems, King Khalid University, Muhayel Aseer, Kingdom of Saudi Arabia
5Department of Computer Science, Faculty of Science & Arts of Baljurshi, Al-Baha University, KSA
*Corresponding Author: Fahd N. Al-Wesabi. Email: firstname.lastname@example.org
Received: 06 February 2021; Accepted: 15 March 2021
Abstract: In this article, a high-sensitive approach for detecting tampering attacks on transmitted Arabic-text over the Internet (HFDATAI) is proposed by integrating digital watermarking and hidden Markov model as a strategy for soft computing. The HFDATAI solution technically integrates and senses the watermark without modifying the original text. The alphanumeric mechanism order in the first stage focused on the Markov model key secret is incorporated into an automated, null-watermarking approach to enhance the proposed approach’s efficiency, accuracy, and intensity. The first-level order and alphanumeric Markov model technique have been used as a strategy for soft computing to analyze the text of the Arabic language. In addition, the features of the interrelationship among text contexts and characteristics of watermark information extraction that is used later validated for detecting any tampering of the Arabic-text attacked. The HFDATAI strategy was introduced based on PHP with included IDE of VS code. Experiments of four separate duration datasets in random sites illustrate the fragility, efficacy, and applicability of HFDATAI by using the three common tampering attacks i.e., insertion, reorder, and deletion. The HFDATAI was found to be effective, applicable, and very sensitive for detecting any possible tampering on Arabic text.
Keywords: Watermarking; soft computing; text analysis; hidden Markov model; content authentication
For the research community, the reliability and security of exchanged text data through the internet is the most promising and challenging field. In communication technologies, authentication of content and automated text verification of honesty in different Languages and formats are of great significance. Numerous applications for instance; e-Banking and e-commerce render information transfer via the Internet the most difficult. In terms of content, structure, grammar, and semantics, much of the digital media transferred over the internet is in text form and is very susceptible to online transmission. During the transfer process, malicious attackers can temper such digital content .
For information security, many algorithms and techniques are available such as the authentication of content, verification of integrity, detection of tampering, identification of owners, access control, and copyright protection.
To overcome these issues, steganography and automated methods of watermarking are commonly used. A technique of digital-Watermarking (DWM) can be inserted into digital material through various details such as text, binary pictures, audio, and video [2,3]. A fine-grained text watermarking procedure is proposed based on replacing the white spaces and Latin symbols with homoglyph characters .
Several conventional methods and solutions for text watermarking were proposed [5,6] and categorized into different classifications such as linguistic, structure and image-based, and format-based binary images . To insert the watermark information into the document, most of these solutions require certain upgrades or improvements to the original text in digital format material. Zero-watermarking without any alteration to the original digital material to embed the watermark information is a new technique with smart algorithms that can be used. Also, this technique can be used to generate data for a watermark in the contents of a given digital context [1,7–9].
Restricted research has centered on the appropriate solutions to verify the credibility of critical digital media online [10–12]. The verification of digital text and the identification of fraud in research earned great attention. In addition, text watermarking studies have concentrated on copyright protection in the last decade, but less interest and attention has been paid to integrity verification, identification of tampering and authentication of content due to the existence of text content based on the natural language .
Proposing the most appropriate approaches and strategies for dissimilar formats and materials, especially in Arabic and English languages, is the most common challenge in this area [14,15]. Therefore, authentication of content, verification of honesty, and detection of tampering of sensitive text is a major issue in different systems that need critical solutions.
Some instances of such sensitive digital text content are Arabic interactive Holy Qur’an, eChecks, tests, and marks. Different Arabic alphabet characteristics such as diacritics lengthened letters and extra symbols of Arabic make it simple to modify the key meaning of the text material by making basic changes such as modifying diacritic arrangements . The most popular soft computation and natural language processing (NLP) technique that supported the analysis of the text is HMM.
We suggest a highly fragile method for detecting the tampering attacks on Internet-based Arabic text (HFDATAI) by incorporating the Markov model and zero watermarking. Hence, first-order of an alphanumeric mechanism consisting of a model performing as a soft computing tool and NLP in cooperation between the zero-watermarking technique and the Markov model. In this method, for text analysis, the first order of the alphanumeric mechanism of the Markov model was used to extract the connections between the contents of the Arabic text given and to generate the main watermark. Without alterations or effects on the original text size, the watermark created is logically integrated into the original Arabic history. The embedded watermark would later be used to identify all manipulation on Arabic text obtained after transmission of text through the Internet and whether it is authentic or not.
The primary objective of the HFDATAI strategy is to meet the high accuracy of content authentication and identification of sensitive tampering attacks of Arabic text which is transmitted through the Internet.
The remainder of the article is structured as follows: In Section 2, we explain the existing works done so far. In Section 3, we discussed the suggested approach (HFDATAI). The simulation and implementation are provided in Section 4, results discussion is provided in Section 5, and finally, we conclude the article in Section 6.
2 Related Work
According to the processing domain of NLP and text watermarking, these existing methods and solutions of text watermarking reviewed in this paper are classified into linguistical, structural and zero-watermark methods [1,7,13].
Natural language is the foundation of approaches to linguistic text watermarking. The mechanism of those methods embedding the watermark is based on changes applied to the semantic and syntactic essence of plain text .
To enhance the capability and imperceptibility of Arabic text, a method of text watermarking has suggested based on location of the accessible words . In this method, any word-space is used to mask the Boolean bit 0 or 1 that physically modifies the original text.
A text steganography technique was proposed to hide information in the Arabic language . The step of this approach considers Harakat’s existence in Arabic diacritics such as Kasra, Fatha, and Damma as well as reverses Fatha to cover the message.
A Kashida-marks invisible method of watermarking , based on the features of frequent recurrence of document security and authentication characters, was proposed. The method is based on a predetermined watermark key with a Kashida placed for a bit 1 and a bit omitted.
The method of steganography of the text has proposed based on Kashida extensions on the characters ‘moon’ and ‘sun’ to write digital contents of the Arabic language . In addition, the Kashida method characters are seen alongside characters from Arabic to decide which hidden secret bits are kept by specific characters. In this form, four instances are included in the kashida characters: moon characters representing ‘00’; sun characters representing ‘01’; sun characters representing ‘10’; and moon characters representing ‘11’.
A text steganographic approach  based on multilingual Unicode characters has been suggested to cover details in English scripts for the use of the English Unicode alphabet in other languages. Thirteen letters of the English alphabet have been chosen for this approach. It is important to embed dual bits in a timeframe that used ASCII code for embedding 00. However, multilingual ones were used by Unicode to embed between 01, and 10, as well as 11. The algorithm of Text Watermarking is used to secure textual contents from malicious attacks according to Unicode extended characters . The algorithm requires three main steps, the development, incorporation, and extraction of watermarks. The addition of watermarks is focused on the development of predefined coding tables, while scrambling strategies are often used in generating and removing the watermarking key is safe.
The substitution attack method focused on preserving the position of words in the text document has been proposed . This method depends on manipulating word transitions in the text document. Authentication of Chinese text documents based on the combination of the properties of sentences and text-based watermarking approaches have been suggested [24,25]. The proposed method is presented as follows: a text of the Chinese language is split into a group of sentences, and for each word, the code of a semantic has been obtained. The distribution of semantic codes influences sentence entropy.
A zero-watermarking method has been proposed to preserve the privacy of a person who relies on the Hurst exponent and the nullity of the frames . For watermark embedding, the two steps are determined to evaluate the unvoiced frames. The process of the proposed approach bases on integrating an individual’s identity without notifying any distortion in the signals of medical expression.
A zero-watermarking method was proposed to resolve the security issues of text-documents of the English language, such as verification of content and copyright protection . A zero-watermarking approach has been suggested based on the authentication Markov-model of the content of English text [28,29]. In this approach, to extract the safe watermark information, the probability characteristics of the English text are involved and stored to confirm the validity of the attacked text-document. The approach provides security against popular text attacks with a watermark distortion rate if, for all known attacks, it is greater than one. For the defense of English text by copyright, based on the present rate of ASCII non-vowel letters and terms, the conventional watermark approach  has been suggested.
3 The Proposed Approach
An intelligent approach is suggested in this paper by integrating a soft computing approach with zero-watermark that do not need additional details to be embedded as the key of a watermark and do not need to make any changes to the original text inserted into a watermark. The first level order of the alphanumeric mechanism of the Markov model was to be used as a soft computer approach to evaluate the Arabic text content and to build on the interrelationship characteristics of such text content.
The main contributions of our approach HFDATAI can be summarized as follows:
◾ Unlike previous work where watermarking is done with language, contents, and scale effecting, the HFDATAI approach logically embeds watermarking with no effect on text, content, or size.
◾ The watermarking mechanism does not require any external knowledge in our HFDATAI approach since this watermark key is generated by text processing and the extraction of a relationship between both the content and a watermark.
◾ The HFDATAI approach is highly vulnerable to any basic alteration to the Arabic text and context defined as complex text, namely Arabic symbols that can modify the meanings of the Arabic word. Somehow, the above three contributions are present only in pictures, though not in the text. That is the key argument on this paper’s contribution. In addition, our approach HFDATAI can effectively determine the place of tempering occurrence. This feature can be considered an advantage over the Hash function method.
The following sections describe in detail two major processes in HFDATAI. However, the first was the generation and incorporation of watermarks and the second was the detection and extraction process of watermarks.
3.1 Watermark Generation and Embedding Phase
Core sub-processes consist of pre-processing, watermark embedding, and watermark generation algorithms as well as text analysis as illustrated in Fig. 1.
3.1.1 Algorithm of Pre-Processing
Pre-processing the Arabic-text originality has been one of the main steps in creating and removing extra space and new lines and can have a significant effect on the precision of manipulation and watermark robustness. The original Arabic text (OAT) is necessary for the input process.
3.1.2 Algorithm of Watermark Generation
This algorithm involves the construction of the Markov matrix, the generation of watermarks, and the interpretation of text.
◾ Developing Markov matrix is a core step in developing the HFDATAI approach. A Markov chain matrix must be constructed in this process to setup the Markov model environment and represent all available states and transitions. In this approach, each unique single alphanumeric within a provided Arabic-text represents a current state, and each unique single alphanumeric corresponds to a conversion in the matrix of Markov chain. When constructing the Markov chain matrix, zero values will be initialized for all states and transitions positions. Those positions will be uses later holding a record of the number of occurrences that the alphanumeric is then backed up through the alphanumeric and provided by Arabic-text.
The Markov matrix algorithm construction is performed as shown in Algorithm 1 below.
where OAT: is the original Arabic text, PAT: is the preprocessed Arabic text, a1 mm: represent states and transformations matrix, ps: refers to the current state, ns: refers to the next state.
Watermark generation-based text analysis process: The proposed algorithm is performed as the second step of this process to perform Arabic text analysis and extract the features of the given text and produce watermark information. In this algorithm, there is a number of appearances of potential conversions for every present state of single alphanumeric will be computed as transition probabilities by Eq. (1).
where n: is the total number of states.
This example of the Arabic version demonstrates how this methodology was used to introduce the phase of transformation from the current state to the next state.
When you use the first stage order of the secret Markov-model alphanumeric approach, each special alphanumeric is a present state. Text processing is done as the text is read and the relationship meaning exchanged between the current and the next countries is calculated. The accessible transitions from the above sample of the Arabic text are shown in Fig. 2 below.
Fig. 3 illustrates the analysis results of the given Arabic sample and represents each state and its transitions based on the first scale order and alphanumeric approach of the Markov model. As shown in Fig. 3, we represent 22 unique present states of 78 possible transitions.
We assume that “” is a present state, and the available next transitions are “”. We observe that eleven transitions are available in the sample of the given Arabic-text and “” transitions are repeated five times. The algorithm of watermark generation and text analysis depends on the first scale order of the alphanumeric approach as shown in Fig. 4.
The algorithm of watermark generation and text analysis processes is formally introduced and performed as illustrated in Algorithm 2.
3.1.3 Algorithm of Watermark Embedding
Watermark embedding has taken place logically in this method without needing to change the original text. The feature extraction of the given Arabic-text, watermark key is embedded logically by identifying all non-zero values in the Markov chain matrix. All these non-zero values are sequentially concatenated to form the original pattern of watermark key a1_, as defined in Eq. (2) and Fig. 5.
The algorithm of the watermark embedding process using the HFDATAI approach is introduced formally and implemented as shown in Algorithm 3.
3.2 Algorithms of Watermark Extracting and Detecting
This process consists of two key algorithms that are extracting and detecting the watermark. However, a1_ is extracted from the obtained (PAAT) and matched by the detection algorithm with a1 . PAAT is required as input to run this algorithm. Hence, it is necessary to perform the algorithm of watermark generation for obtaining the pattern of watermark for PAAT as presented in Fig. 6.
3.2.1 Algorithm of Watermark Extraction
PAT should be provided as input to run this algorithm. Though, a1_WMPA is a core output of this algorithm as presented in Algorithm 4.
where, PAAT: pre-processed Arabic-text attacked, a1_EWMA: attacked pattern of watermark key.
3.2.2 Algorithm of Watermark Detecting
a1_ and a1_ should be provided as the inputs needed for this algorithm to run. However, the status of the given Arabic-text is a core output of this algorithm which can be actual or tampered with. The watermark detection process is performed by two sub-steps which are:
◾ The main matching for a1_ and a1_ is achieved. If these two watermark patterns are similar in appearance, then there’ll be a notification, “The Arabic text contents are authentic, and no tampering occurred.” Likewise, the note will be rendered “This Arabic text document is tampered and not authentic,” and then it continues to the next step.
◾ The secondary matching is performed by matching each state’s transition status in the entire produced pattern of watermarks. This means a1_ of each state is contrasted with an analogous transition of a1_ as given by Eq. (3) and (4) below
— represents tampering detection accuracy rate value in transition level, ( )
—: value of tampering detection accuracy rate in state level, ( ).
The weight of every state in the Markov matrix must be determined following the equivalent rate of every state, as seen in Eq. (5).
—: is the total matching value in the ith state level.
The ultimate a1_PMR of PAAT and PAT are computed by Eq. (6).
The distortion rate of the Watermark is the sum of manipulative attacks on the contents of the Arabic context that have been defined by a1 WDR and calculated by Eq. (7).
The algorithm of watermark detection is formally introduced and applied as seen in Algorithm 5.
where a1_SW: refers to the weight value for properly matched states. : refers to the importance of the distortion rate of the watermark ().
The effects of the method of watermark extraction and detection is illustrated in Fig. 7.
As shown in Fig. 7, TP1 represents 1st transition of non-zero in the given text, TP2 represents 2nd transition, and so on. Some states have only one transition, which is shown in TP1. However, some states have more than one transitions, which are represented in TP1, TP2, etc. such as “” and “” states.
4 Implementation and Simulation
A variety of implementation and simulation are conducted to test the accuracy of HFDATAI output and tampering detection. This section outlines the settings for implementation and experimentation, conditions for experiments, typical dataset experimental scenarios, and a discussion of outcomes.
4.1 Simulation and Implementation Environment
The self-developed software was developed to evaluate and assess the efficiency of HFDATAII. The HFDATAI implementing environment is: CPU: Intel Core i7-4650U/2.3 GHz, RAM: 8.0 GB, Windows 10-64 bit, PHP VS Code IDE programming language.
4.2 HFDATAI Simulation and Experiment Findings
To evaluate the accuracy of tampering detection of HFDATAI, scenarios of many studies are performed as shown in Tab. 1, for all forms of attacks and their volumes.
As the results shown in Tab. 1 and Fig. 8, it seems that the HFDATAI approach gives sensitive results of detection of tampering in all attacks that the structure, semantics, and syntax of the content of Arabic text may have been carried out. As a comparison of tampering based on attack types, the results show that the most sensitive tampering detection in all attack volume scenarios is the insertion attack.
5 Comparison and Result Discussion
The accuracy of tampering detection is carefully analysed and compared between HFDATAI and baseline algorithms ZWAFWMM  and RACAAT .
5.1 Results of Attack Type Impact
Tab. 2 shows a comparison of the different attack type’s effects on tampering detection accuracy of HFDATAI, ZWAFWMMM, and RACAAT approaches against all dataset scales and all attack volume scenarios.
Tab. 2 and Fig. 9 demonstrate how RACAAT and ZWAFWMMM tampering detection precision for HFDATAI is determined by the form of attack. In the event of an insertion attack, a low impact between the detection accuracy of the HFDATAI approach and the ZWAFWMMM baseline approaches as well as the RACAAT approach was observed. However, with baseline approaches, the high impact has been observed when attacks are removed and reordered, and findings show that HFDATAI exceeds ZWAFWMMM and HAZWCTW with better detection precision for manipulations. This indicates that the HFDATAI suggested approach to content authentication and manipulation of Arabic text in all forms of attacks in which reorder attacks simultaneously reflect deletion and insertion attacks are highly supported and very sensitive.
5.2 Results of Attack Rates Impact
Tab. 3 provides a comparison of the multiple attack volume effects on the performance of tampering detection for both dataset size and volume scenarios. The comparison is performed using HFDATAI with RACAAT and ZWAFWMMM approaches.
Tab. 3 and Fig. 10 demonstrate how the precision of deception is affected by low, medium, and high attack amounts. Fig. 10 shows in general that as the volume of the attack increases, the accuracy of tampering detection also increases. HFDATAI conducts RACAAT and ZWAFWMM concerning their efficiency and identification precision in low, mid, and high amounts of attacks, always with low, medium and high amounts of attacks. This makes HFDATAI highly recommended for the authentication of content and the exploitation of any transmitted Arabic-text through the Internet.
5.3 Results of Dataset Impact
This section tests the various dataset size impact on watermark reliability against all forms of attacks within their multiple volumes. Tab. 4 shows a comparison of that effect using HFDATAI with RACAAT and ZWAFWMMM approaches.
The comparative results as shown in Tab. 4 and Fig. 11 reflect the tampering detection accuracy of the HFDATAI approach suggested. The findings illustrate that in the proposed HFDATAI approach, the highest impact of the dataset scale leads to the best accuracy of tampering detection that is ordered as ASST, AMST, AHMST, and ALST, respectively. This means that the accuracy of tampering detection increases with the decreased Arabic text size and decreases with the increased Arabic text size. On the other hand, results show that the HFDATAI approach outperforms both RACAAT and ZWAFWMMM approaches in terms of tampering detection accuracy for all sizes of the Arabic dataset.
Cantered on the hidden Markov model mechanism of low-level order and alphanumeric, a high-sensitive approach for detecting the tampering attacks on Arabic text transmitted via the Internet (HFDATAI) has been proposed in this paper by integrating soft computing and digital watermarking techniques. Soft computing and NLP used in HFDATAI to perform a text analysis process to find interrelationships between the content of the Arabic-text provided and the main watermark created. Without modification or impact on the scale of the original text, the created watermark should logically be embedded in the original Arabic background. The embedded watermark can be used later to identify any manipulation that happens on the Arabic text after the text is distributed through the Internet. HFDATAI method in PHP programming language was developed and applied using VS code IDE. Experiments are done on various regular Arabic datasets in different amounts of attacks. The baseline approaches RACAAT and ZWAFWMM were compared to HFDATAI. The findings reveal that CZWNLPA beats RACAAT and ZWAFWMM in terms of identification precision of tampering and fragility of watermarks. Furthermore, the findings illustrate that HFDATAI refers to all Arabic literature, numbers, spaces, and special characters, for future research, the enhancement of identification precision and watermark fragility for all kinds of attacks should be considered.
Funding Statement: The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work under Grant Number (G.R.P./14/42), Received by Fahd N. Al-Wesabi. www.kku.edu.sa.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|