Languages–independent text tokenization can aid in classification of languages with few sources. There is a global research effort to generate text classification for any language. Human text classification is a slow procedure. Consequently, the text summary generation of different languages, using machine text classification, has been considered in recent years. There is no research on the machine text classification for many languages such as Czech, Rome, Urdu. This research proposes a cross-language text tokenization model using a Transformer technique. The proposed Transformer employs an encoder that has ten layers with self-attention encoding and a feedforward sublayer. This model improves the efficiency of text classification by providing a draft text classification for a number of documents. We also propose a novel Sub-Word tokenization model with frequent vocabulary usage in the documents. The Sub-Word Byte-Pair Tokenization technique (SBPT) utilizes the sharing of the vocabulary of one sentence with other sentences. The Sub-Word tokenization model enhances the performance of other Sub-Word tokenization models such pair encoding model by +10% using precision metric.
Recently, a great amount of data became available electronically in digital forms. This introduced a great chance to be retrieved for analysis and processing. However, manual analysis or processing of such huge content is costly and time-consuming. Hence, several computerized models were proposed to automatically process this data to deliver classification. Text classification models usually choose key points in texts to generate comprehensible classification target documents.
In general, text classification models attempt to analyze a document by picking the main topics that constitute the documents and identifying the relevant ideas of these topics. Therefore, current models attempt to enhance the classification process performance in identifying the document key points by allowing all the themes exist in it.
Text Classification is an important topic especially by using machine or deep learning. Text classification models assign text documents to different classes utilizing text contents. This study presents a language-independent classification model utilizing deep learning. Deep Learning techniques are employed extensively in many fields such as natural language processing [
The key problem that challenges text classification task is the generalization process. There are two main models for computerized text classification, namely abstractive feature extraction and tokenization models. Abstractive feature extraction models implement a deeper text analysis; they integrate semantic analysis and processing. The generated output contains new phrases not found in the original source text. Thus, phrases may be reformulated with a dissimilar meaning that is away from the intent of the author. On the other hand, tokenization models utilize superficial analysis and processing of the text-documents and convey only the syntactic level, where the output contains words and phrase from the original text only [
Our research presents a model for text tokenization classification using transformer method that substitutes each instance (yi,Zi) with |Zi| examples (yi,λj), for each λj element in Zi . Dubbed weight algorithm is an addition phase to the copy transformer model. It utilizes 1/|Zi| weight for each new instance. Previous models substituted each Zi by one of the members, and the model extracted the token with the most frequency.
In this paper, we realized that developing such models is a very significant task. The model will help people in rare languages to acquire cross-language classification in their own language with less time and cost.
This paper is structured as follows: Section 2 surveys the text documents classification techniques. Section 3 introduces problem definition and the new proposed methodology. In Section 4, experimental results are demonstrated. Section 5 depicts conclusion and future work.
The related work section discusses two main divisions. Text Classification techniques using deep learning are presented. Then, research that utilize genetic algorithms and CNN are presented.
Deep learning methods, that are characterized by their high classification accuracy, are applied in many arears. The authors in [
In [
The authors in [
In [
The research performed by the authors in [
The authors in [
In [
In [
The model in [
Genetic Algorithm is one of the best optimization algorithms in terms of reliability. In [
The authors in [
In this section, we are proposing a Cross-Language-independent classification model. The next subsections will represent the phases of the proposed technique.
Sub-Word tokenization is an important preprocessing phase that targets to represent a word by partitioning a word into Sub-Words. There are several cases where a single word is a collection of a several significant Sub-Words. Usually, Byte-Pair Encoding [
One important issue in text classification is how well entities are represented. Text classifications show that proper names and place names exist in most of the sentences. Also, the data about entities is present with same importance. To account for those properties in text classification, data about entities is limited in the Sub-Word tokenization phase. Also, for vocabulary, we propose the training of the Byte-Pair Encoding algorithm by connecting extracted vocabulary. This will enhance the Sub-Word partition consistency and will yield a better performance.
Therefore, this paper presents the Sub-Word Byte-Pair Tokenization technique (SBPT). SBPT utilizes the sharing of the vocabulary of one sentence with other sentences. Sharing vocabulary decreases the insertion of characters when repeating entity data. This technique is suitable for abstract classification where entity data is very important.
The entities are extracted from the source documents. They can be realized from the data structure and can be extracted from source data automatically.
The specific phases of the Sub-Word Byte-Pair Tokenization technique (SBPT) are as follows.
a) Prepare the source
b) Associate every two corpora to perform Sub-Word Tokenization.
c) Fix the Sub-Word token vocabulary size. In our proposed model, we set it to 35,000 Sub-Word based on statistical methods.
d) Divide the words into sequence characters.
e) Merge adjacent pairs of characters with the highest frequencies.
f) Repeat step e several times until the predefined vocabulary size is reached.
After we build the training data using the Sub-Word Byte-Pair tokenization technique, training is started employing a sequence-to-sequence method. The proposed Transformer employs an encoder that has ten layers with self-attention encoding and a feedforward sublayer. The decoder is composed of eight layers with masked self-attention decoding, and encoder–decoder intermediate self-attention, and feedforward sublayer.
When an unknown language sentence is input to the system, it is vectored using word embedding. The encoder learning scheme is started by learning expressions from one language sentence and finding matches of them in another language by utilizing the decoder.
The basic attention process allows the decoder to expect the output word and states of the input phrases in the encoder. Nevertheless, not every input phrase is intentioned with the same reference weight. As an alternative, the portion of the input phrase, that is associated with the word to be predicted, gets more focus. The self-attention process allows the system to learn the association between the current word and other preceding words using key vectors. This procedure is a dot product attention and is described as follows:
a) Create three vectors (query vector, key vector, and value vector).
b) Multiply the query key vectors to calculate the score.
c) Advance with Scale to store key vector size to 64 divided by the value square root to create a stable gradient.
d) Compute the Softmax value, which is equal to the ratio of the number of words in the to-position of the current position.
e) Multiply the results generated from all the steps up to step e by the value vector. Both extreme values of the Softmax value will be ignored (the highest and lowest values).
f) Chain the resultant vectors to form the self-attention output vector.
The multithread attention mechanism is a process that computes the attention outputs for eight weight matrices. The formula for computing the attention values is as follows,
where, A indicates ancient attention. q, k and v indicate query vector, key vector, and value vector respectively.
where,
The decoder completes the back-transformation using the information generated from the encoder. The attention layer, in the decoder, resembles the previous position of the attentional word in the output phrase sequence. The decoder’s output is sequentially produced. Also, the query matrix is extracted from the key and value vector that is part of the encoder output. That is, the attention between the source sentence and the target sentence is processed at the decoder.
We compute the target language attention matrix as follows:
Correspondences are built using the decoder. Softmax will produce the optimal output of succeeding token probabilities. In conclusion, the joint training process of the encoder and the decoder will maximize the conditional correspondence–likelihood. The maximization equation is as follows:
where α is the model-parameter set and each of (Sn, Tn) is the (source sentence, target sentence) pair in the training data.
In
In this section, we are evaluating the accuracy of our cross language classification model. Datasets are depicted in Subsection 4.1. Evaluation metrics are discussed in Section 4.2. The results are reported and discussed in Section 4.3. Finally, we compare our model with other related models in Section 4.4.
It should be noted that acquiring training data for rare languages classification is very difficult. Our experiments employed a crawling procedure to collect data. We also performed data filtering with omitting phrases with less than 85 characters.
For multi-token dataset, token cardinality (Card) and density (Dens), for a token L, are used as metrics and they are defined as follows:
where, m is defined as the total number of cases in the dataset. The density metric accounts for the total number of tokens in its calculation.
The dataset Mawdoo3 [
Domain | Cases | Features | Tokens | Cardinality |
---|---|---|---|---|
Civil Engineering | 2767 | 105 | 14 | 4.2 |
Science | 8754 | 310 | 24 | 3.9 |
Cosmetics | 6996 | 205 | 20 | 1.9 |
Children studies | 1950 | 221 | 18 | 2.9 |
Literature | 6710 | 276 | 19 | 4.4 |
Films | 5300 | 156 | 8 | 2.7 |
We applied the Sub-Word Byte-Pair tokenization technique (SBPT) to Arabic language utilizing long and short-term attention. SBPT is a sequence-to-sequence transformer model. Also, we tested the performance using other Sub-Word tokenization.
For the SBPT attention algorithm, a 2-layer SBPT were employed. In addition, 600 layers were utilized as hidden layers. The dropout proportion was 0.23, with a batch of 256 sentences. The transformer utilized eight attention blocks in both the encoder and decoder, the feedforward network had 1,024 layers, the embedding had 256 layers. The model used twelve heads, and a Noam decay optimization technique with batch size of 2,046. The model collected a vocabulary of 32,000 Sub-Words, with entropy loss function. the model was executed on a GPU GTX 1040 system.
To compute the performance of the proposed SBPT methodology, a comparison was carried between the SBPT and another Sub-Word tokenization models.
Two metrics are usually used: the token cardinality, which is calculated as the average number of tokens per case in the test-set, and the token density, which is computed as the number of tokens per test-set divided by the total tokens, averaged over the test-set.
We have to encounter partially correct tests, as we cannot consider these tests as incorrect. The metric, utilized in single-token schemes, was also applied for multi-token metrics. This metric is called exact-match metric (multi-token subset accuracy). The precision of the model was computed as the ratio between the exact-match and the true set of tokens.
Therefore, the concept of partially correctness is defined using the difference between the predicted tokens (P) and the actual true tokens (T), namely the token-based accuracy (TBA). TBA measures the closeness of P to T as the ratio between the number of correct tokens
TBA is a joint measure of both the precision and the recall metrics. It combines both false positives (members in P that was should not be included) and false negatives (members that are missing in P). TBA is defined as the Jaccard metric [
The Hamming Loss metric is similar to the TBA, as it takes both the false positives and the false negatives predictions, and computes the symmetrical difference (logical XOR) between P and T.
Therefore, the precision and recall are calculated in such scenario as follows:
Sub-Accuracy is the number of the correct cases
Precision is the average number of the correctly predicted tokens
Recall is the average correctly predicted tokens
F-measure is the harmonic average between the recall and the precision is calculated as follows:
The experiments are executed using 12-fold cross validation. Each dataset is iterated for 12 times. Each time, a single subset is used for testing, while 11 other subsets are used for training. Therefore, each fold will be a test set. The final results are averaged on all executions.
We compared our proposed Sub-Word Byte-Pair tokenization technique (SBPT) to the state-of-the-art models: ADAN [
Model | Hidden Layers | Att. Heads | Hidden size | Batch | Epochs | Layers | Activation function |
---|---|---|---|---|---|---|---|
Our proposed model | 24 | 24 | 1024 | Size: 20/32 | 10/20/50 | 10 for the encoder/8 for the decoder | Softmax |
ADAN [ |
24 | 20 | 512 | Size: 20/32 | 12 | 12 | ReLU |
ArabADAN [ |
20 | 12 | 512 | Size: 256/64 | 29 | 12 | Softmax |
The comparison of the different models and our proposed model is depicted in
Model | Civil Engineering | Science | Cosmetics | Children studies | Literature | Films |
---|---|---|---|---|---|---|
Our proposed model | 0.207 ± 0.021 | 0.201 ± 0.023 | 0.199 ± 0.021 | 0.204 ± 0.021 | 0.220 ± 0.019 | 0.221 ± 0.027 |
ADAN [ |
0.197 ± 0.018 | 0.205 ± 0.019 | 0.209 ± 0.019 | 0.211 ± 0.019 | 0.219 ± 0.023 | 0.224 ± 0.021 |
ArabADAN [ |
0.215 ± 0.021 | 0.215 ± 0.019 | 0.204 ± 0.023 | 0.209 ± 0.021 | 0.218 ± 0.015 | 0.223 ± 0.021 |
Model | Civil Engineering | Science | Cosmetics | Children studies | Literature | Films |
---|---|---|---|---|---|---|
Our proposed model | 0.909 ± 0.098 | 0.908 ± 0.093 | 0.966 ± 0.098 | 0.920 ± 0.098 | 0.986 ± 0.089 | 0.998 ± 0.099 |
ADAN [ |
0.868 ± 0.083 | 0.808 ± 0.086 | 0.806 ± 0.082 | 0.888 ± 0.086 | 0.888 ± 0.083 | 0.884 ± 0.078 |
ArabADAN [ |
0.858 ± 0.088 | 0.868 ± 0.086 | 0.804 ± 0.083 | 0.806 ± 0.088 | 0.888 ± 0.078 | 0.883 ± 0.068 |
Model | Civil Engineering | Science | Cosmetics | Children studies | Literature | Films |
---|---|---|---|---|---|---|
Our proposed model | 0.707 ± 0.076 | 0.706 ± 0.076 | 0.699 ± 0.076 | 0.699 ± 0.076 | 0.766 ± 0.067 | 0.776 ± 0.077 |
ADAN [ |
0.697 ± 0.068 | 0.707 ± 0.069 | 0.709 ± 0.067 | 0.766 ± 0.069 | 0.767 ± 0.076 | 0.774 ± 0.076 |
ArabADAN [ |
0.767 ± 0.076 | 0.767 ± 0.069 | 0.704 ± 0.076 | 0.709 ± 0.076 | 0.768 ± 0.067 | 0.776 ± 0.076 |
Sub-Accuracy is a strict measure because it depicts the exact matching extent between the predicted labels and the true labels. It also presumes some prediction penalty for subsets with almost correct or wrong prediction. In term of Sub-Accuracy, our approach does not perform well as depicted in
Model | Civil Engineering | Science | Cosmetics | Children studies | Literature | Films |
---|---|---|---|---|---|---|
Our proposed model | 0.948 ± 0.087 | 0.947 ± 0.083 | 0.966 ± 0.087 | 0.986 ± 0.077 | 0.876 ± 0.078 | 0.987 ± 0.088 |
ADAN [ |
0.914 ± 0.078 | 0.919 ± 0.076 | 0.926 ± 0.078 | 0.877 ± 0.076 | 0.878 ± 0.083 | 0.884 ± 0.087 |
ArabADAN [ |
0.927 ± 0.087 | 0.907 ± 0.076 | 0.904 ± 0.083 | 0.920 ± 0.087 | 0.878 ± 0.077 | 0.893 ± 0.087 |
In this set of experiments, we employed a k-fold validation algorithm to assess the performance. We distributed the text data into k volumes of same size equal folds. CNN was trained in k consecutive iterations, where one fold was tested and k – 1 folds were trained. To have a recognized performance, text data underwent stratification, which is the movement of the words in the text instances in each fold to be a true reprehensive of the data.
The execution time needed for the proposed model training phase against other models is depicted in
Model | Civil Engineering | Science | Cosmetics | Children studies | Literature | Films | |
---|---|---|---|---|---|---|---|
Our proposed model | K = 8 | 0.43 | 0.51 | 0.32 | 0.45 | 0.29 | 0.4 |
ADAN [ |
0.86 | 0.80 | 0.95 | 1.54 | 0.98 | 1.27 | |
ArabADAN [ |
2.85 | 2.96 | 2.12 | 3.08 | 2.88 | 2.83 | |
Our proposed model | K = 10 | 0.67 | 0.61 | 0.72 | 0.56 | 0.49 | 0.5 |
ADAN [ |
0.96 | 1.34 | 1.45 | 1.84 | 1.58 | 1.97 | |
ArabADAN [ |
5.75 | 5.96 | 5.15 | 5.07 | 5.77 | 5.75 | |
Our proposed model | K = 12 | 1.67 | 1.61 | 1.72 | 1.56 | 1.70 | 1.5 |
ADAN [ |
1.06 | 1.37 | 1.75 | 1.87 | 1.58 | 1.07 | |
ArabADAN [ |
5.75 | 5.06 | 5.15 | 5.17 | 5.77 | 5.75 | |
Our proposed model | K = 14 | 2.67 | 2.62 | 2.72 | 2.86 | 2.70 | 2.8 |
ADAN [ |
4.06 | 4.39 | 4.98 | 4.89 | 4.88 | 4.09 | |
ArabADAN [ |
8.78 | 8.06 | 8.28 | 8.27 | 8.77 | 8.78 |
In this research we proposed a cross-language text tokenization model using a novel Transformer. The proposed model improved the efficiency of text classification by providing a draft text classification for a number of documents. A novel tokenization model that shared frequent vocabulary usage in the documents was also proposed. We presented the Sub-Word Byte-Pair Tokenization technique (SBPT). SBPT utilized the sharing the vocabulary of one sentence with another sentences. Sharing vocabulary decreased the insertion of characters when repeating entity data. A Sub-Word Byte-Pair tokenization technique was also proposed. The Sub-Word tokenization model enhanced the performance compared to pair encoding model by +10% using precision metric. we compared our proposed Sub-Word Byte-Pair tokenization technique (SBPT) to the state-of-the-art models: ADAN [