Spam mail classification considered complex and error-prone task in the distributed computing environment. There are various available spam mail classification approaches such as the naive Bayesian classifier, logistic regression and support vector machine and decision tree, recursive neural network, and long short-term memory algorithms. However, they do not consider the document when analyzing spam mail content. These approaches use the bag-of-words method, which analyzes a large amount of text data and classifies features with the help of term frequency-inverse document frequency. Because there are many words in a document, these approaches consume a massive amount of resources and become infeasible when performing classification on multiple associated mail documents together. Thus, spam mail is not classified fully, and these approaches remain with loopholes. Thus, we propose a term frequency topic inverse document frequency model that considers the meaning of text data in a larger semantic unit by applying weights based on the document’s topic. Moreover, the proposed approach reduces the scarcity problem through a frequency topic-inverse document frequency in singular value decomposition model. Our proposed approach also reduces the dimensionality, which ultimately increases the strength of document classification. Experimental evaluations show that the proposed approach classifies spam mail documents with higher accuracy using individual document-independent processing computation. Comparative evaluations show that the proposed approach performs better than the logistic regression model in the distributed computing environment, with higher document word frequencies of 97.05%, 99.17% and 96.59%.

Spam text messages and emails cause significant damage to message communication systems. “Among commercial emails intended for commercial purposes, spam emails that the recipient does not want are causing various harms, such as a decline in personal, domestic, and even national credibility [

There are several document classification approaches to efficiently classify spam mail, such as the naive Bayesian classifier, logistic regression, support vector machine (SVM), decision tree, recurrent neural network (RNN), and long short-term memory (LSTM). Logistic regression, SVMs, and decision trees are the ones mainly used. For example, [

However, there is a limitation in analyzing the meaning of complex text from a document and its words when emphasizing only the importance of words, which causes the scarcity problem.

To resolve the issue of spam mail classification, we propose a term frequency-topic inverse document frequency (TFT-IDF) model and its extension with a singular value decomposition (SVD) model. First, The TFT-IDF model can effectively represent the frequency and weight of terms and the weight between each document and terms through topics by considering the weights of the group for classification. Second, in the TFT-IDF with SVD model, the dimensionality is reduced, compared to that of the existing term frequency-IDF (TF-IDF) model. Accordingly, the error function uses L2 normalization and mean squared error (MSE) to solve the sparsity problem. Third, the TF-IDF with SVD model solves the abovementioned problem of conventional methods.

The main contributions of this study are as follows:

A novel approach that effectively identifies the meaning of important words.

Robust solution to address the sparsity problem of NLP classification.

An enhanced approach to improve classification accuracy.

A novel idea of bidirectional integration for classification joints.

A novel classification method that beliefs in working with document cooperation.

Several algorithms are used for classification. The following summarizes related researcher’ contributions.

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

Reference [

When the value of a specific input variable is entered, binary decision logistic regression returns a value between 0 and 1. Classification problems include binary classification and polynomial classification.

Reference [

When the number of samples was 100, it was poor, and when the number of samples was 500, the power of the sum-of-squares test was slightly worse than that of the Stukel’s score test. Thus, the main classification problem in this study is when the target variable has a discrete value. In this study, binomial classification for spam and nspm(Not Spam) was used, and the formula used follows [

In SVD, a matrix M is expressed as the product of matrices U and V. The U and V matrices are normal orthogonal matrices, D is a diagonal matrix, and each value is called a singular value. Reference [

Reference [

The TF-IDF model is calculated by weighting the text along with word frequency. The formulas are given [

Term frequency methods include Boolean frequency, log scale frequency, and increase frequency.

Reference [

In this study, comparisons were performed using the TFIDF model and TFT-IDF with SVD model.

Latent Dirichlet allocation (LDA) is a model to describe the subjects of each document and analyze the document assuming that it is created according to the topic.

Reference [

In this study, LDA was used for topic extraction and as a parameter to calculate the proposed TFTIDF model.

In addition to determining how many terms exist in the document, we aimed to consider which topics terms may be effectively related to in a large semantic unit. An LDA model was chosen that explains the probability of which topic the term at each position corresponds to, which topic the document has, and the document with a probability model. Because the frequency value of the term present in the document is zero and is very sparse, a scarcity problem occurs. Therefore, SVD, a technique for reducing the dimension to determine importance, was used to discard unnecessary data. In the following experimental results, it can be seen that the performance also improved significantly.

As mentioned in the Section 2 part, various studies have been conducted to solve this problem.

In this study, we first built a wordbook using UCI’s messenger spam data and calculated the frequency of frequently used terms in text data. Then, it was classified as spam or nspm(Not Spam) using supervised learning, and features were created.

The TFTIDF model considers the weights of terms against terms, documents against topics, and documents against terms. From this, it created a descriptive feature.

The TFT-IDF in the SVD model solves the sparsity problem and avoids unnecessary computational cost for classification by expressing the generated dimension as a matrix relationship and removing features of low importance.

We explain the TFTIDF model and then the TFT-IDF in SVD model.

After that, we will also explain the TF-IDF in the SVD model.

The TF-IDF is equivalent to expression

In _{p}. The data extracted from the bag-of-words (BOW) are referred to as X_{b}. Let T_{j} be the topic and X_{v} be the term of a dictionary. X_{v1} represents a dictionary of words from LDA, where Φ is the coefficient. Then, the formula of the TFTIDF model proposed in this paper is as follows:

In the i_{th} TFTIDF value, T stands for the topic, D stands for document, and t stands for terms. The TFTIDF value is multiplied by the frequency of the term in the i^{th} document and the log function of the i^{th} term of the topic in the term dictionary of the n^{th} topic and divided by the frequency of all terms in the document. Then, it is multiplied by the calculated document weight value.

The topic is determined by the word distribution on a particular topic and the word distribution contained in the document using the Bayesian network. For these, inference-based sampling was performed using the LDA algorithm. Let us call alpha and beta potential variables. Depending on alpha and beta, the conditional probability can be represented by the following joint probability distribution [

Reference [

First, the formula for obtaining the i-th TFTIDF value is repeated k times.

The formula for TFTIDF (t, d, T, D, n, r) is as follows:

In the TFTIDF (t, d, T, D, n, r) model, the parameter n represents the number of term weight groups, where n = 0, 1,…, 10… r is the number of times the total matrix is decomposed, where r = 0,…, integer. Therefore, the second model proposed in this paper is as follows:

TFT-IDF in SVD consists of three logic functions. From 0 to 1, r follows the SVD (n, Z_{i}) function, and when r is greater than 1, it follows SVD (n, Z_{i}, R). Other values are used as classification values.

Let the original matrix be X. When the decomposition matrix is obtained, it is divided into normal orthogonal matrices U and V and Sigma matrices. Matrix X^{T}X becomes (U∑V^{T})^{T}U∑V^{T}. That is, V∑^{2}V^{T}, V_{a} is an eigenvector, and ∑_{a} is the square root of corresponding the eigenvalue. ALS was used to determine the error of the decomposed matrix. The formula is as follows:

In addition to the factors p_{d} and q_{t}, this is affected by the parameters n and r and the number of ranks.

This enables faster operation and shows robustness to sparse data. We fixed the lambda value to 0.001 using L2 normalization to prevent overfitting. The Frobenius norm formula was calculated as follows:

Depending on the number of ranks, the matrices p and q were alternately updated until the error became small. The formula was calculated as this:

For this purpose, the normalized loss function L2 was used. Partial differentiation was performed to determine the optimal value to minimize the original error.

The formula was expressed in matrix form using the value calculated in the previous step,

Based on

_{i}x^{T}. _{i}x^{T}.

Training data and testing data were divided, each value was computed, and learning was conducted.

As the result, error were found between data dimensionally decomposed with eigenvectors with large eigenvalues and the X matrix and creates a close matrix.

The following shows the algorithm of TTIS and TIS models. Parameters t refers to term, d refers to document, T refers to topic vectors, D refers to document vectors, n refers to the number of word weight group from LDA, r refers to the number of dimension decomposition where integer, Z refers to the matrix calculation between document-word and of Topic weights and R refers to decomposed matrix.

As in

The third proposed model follows the same method as the above proposed model but is decomposed using a different feature model up to r = 2. The formula is as follows:

In the third proposed model, the TF-IDF model is used, and the remaining formulae are the same as

Let the original matrix be X′. When the decomposition matrix is obtained, so are U′, V′, and Diag(D). V_{a}′ is an eigenvector, and d_{a}′ is a high eigenvalue. ALS was used to determine the error of the decomposed matrix, as in

To prevent overfitting, the lambda value was equally implemented as 0.001 using L2 normalization, and p and p′ and matrices q and q′ were alternately updated. By dividing into training and testing data-sets, each model was trained to obtain an optimal value and a classification result was produced.

We used UCI’s messenger spam data from [

Dataset | Environment | Algorithms |
---|---|---|

Messenger spam from UCI | Tensorflow and keras and scikit-learn and python 3.4 in windows 10-home 64-bit | TFTIDF |

The purpose of the experiment was to measure the classification result score, accuracy, receiver operating characteristic (ROC), similarity, and learning error for the TF-IDF, TFIDF, TF-IDF in SVD and TF-IDF in SVD, and to compare influential words. In addition, as the parameters n and r changed, the performance of the models was evaluated. Cosine similarity was used to evaluate the similarity distance, and the MSE function was used as the learning error. The results are presented in

Logistic regression |
TF-IDF | TFTIDF | ||
---|---|---|---|---|

Score (Train) | 96.59% | 99.25% | ||

Score (Test) | 94.80% | 97.06% | ||

Confusion matrix | [[2419 2] [143 223]] | [[2418 3] [79 287]] | ||

Accuracy | 94.79% | 97.05% | ||

ROC curve | ||||

Logistic regression |
TF-IDF in SVD | TFT-IDF in SVD | ||

WTD | Score (Train) | 96.59% | 99.17% | |

Score (Test) | 94.80% | 96.52% | ||

Confusion matrix | [[2404 2] [93 288]] | [[2405 1] [22 359]] | ||

Accuracy | 96.59% | 99.17% | ||

Frobenius | 0.000526094294398151 | 0.00011786358789691685 | ||

ROC Curve | ||||

WVD | Score (Train) | 96.52% | 99.14% | |

Score (Test) | 94.65% | 97.09% | ||

Confusion matrix | [[2421 0] [97 269]] | [[2420 1] [23 343]] | ||

Accuracy | 96.51% | 99.13% | ||

Frobenius | 0.000526094294398151 | 0.0005254540272451593 | ||

ROC curve |

Logistic regression |
TF-IDF | TFTIDF | ||
---|---|---|---|---|

Score (Train) | 96.59% | 99.25% | ||

Score (Test) | 94.80% | 97.06% | ||

Confusion matrix | [[2419 2] [143 223]] | [[2418 3] [79 287]] | ||

Accuracy | 94.79% | 97.05% | ||

ROC curve | ||||

Logistic regression |
TF-IDF in SVD | TFT-IDF in SVD | ||

WTD | Score (Train) | 96.59% | 99.39% | |

Score (Test) | 94.80% | 97.24% | ||

Confusion matrix | [[2419 2] [143 223]] | [[2404 2] [15 366]] | ||

Accuracy | 94.79% | 99.39% | ||

Frobenius | 0.000526094294398151 | 0.00031077547080683065 | ||

ROC curve | ||||

WVD | Score (Train) | 96.52% | 99.32% | |

Score (Test) | 94.65% | 96.66% | ||

Confusion matrix | [[2421 0] [97 269]] | [[2420 1] [18 348]] | ||

Accuracy | 96.51% | 99.31% | ||

Frobenius | 0.000526094294398151 | 0.0004190117126713189 | ||

ROC curve |

When classifying with logistic regression using the existing TF and IDF features, the classification score calculated during training with train data was 96.59%. After testing with the remaining 50% of the data, the classification score was 94.80%. To obtain a better classification performance, we measured the accuracy of the confusion matrix. The existing TF-IDF model classified spam correctly with 94.79% accuracy. The ROC curve of this model is shown in

When classifying using TFTIDF, the classification score calculated during training with Train data was 99.25%. After testing with the remaining 50% of the data, the classification score was 97.60%. To obtain a better classification performance, we measured the accuracy of the confusion matrix. When classifying using the TFTIDF model proposed in this paper, spam was classified correctly with an accuracy of 97.05%. The ROC curve of this model is shown in

Specifically, when comparing the classification results using the training data of the two models, the TFTIDF model performed approximately 2.66% better than the TF-IDF model. Moreover, the results of the classification performance experiment using the testing data showed that the TFITIDF model scored 2.26% higher than the existing TF-IDF model. Therefore, it can be seen that binary classification using the TFTIDF model is more effective than that using the existing TF-IDF model.

When classifying with logistic regression using features of the existing TF-IDF, the classification score calculated during training with Train data was 96.59%. After testing the remaining 50% of the data, the classification score was 94.80%. The TF-IDF model classified spam with an accuracy of 94.79%. The ROC value was 0.98. When classifying using TFT-IDF in SVD, the classification score for training was 99.17%. After testing with the remaining 50% of the data, the classification score was 96.52%.

Moreover, the training score calculated during training with validation data was 96.52%, and the testing data classification score was 94.65%. When learning with validation data, the calculated training classification score was 99.14%, and the test score was 97.09%. We learned using the MSE function. When the TFT-IDF in SVD model was trained and classified, the classification was predicted with an accuracy of approximately 99.17%. When classification was performed using validation data and the TFT-IDF in the SVD model, the prediction accuracy was 99.13%. The ROC curves of this model are shown in

Specifically, when comparing the classification results of the TF-IDF and TFT-IDF in SVD models using the training data and testing data, the accuracy of the latter was higher than that of the former by (train) 4.38% and (test) 4.35%. As a result, it was found that the binary classification using the TFT-IDF in the SVD model proposed in this paper has a more effective performance than binary classification using the existing TF-IDF model.

Model type | TFTIDF | Model type | TF-IDF in SVD | TFT-IDF in SVD | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Train | Test | Train | Test | ||||||||

Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | ||

Score 1 n = 1, r = 0 | 0.9767 | 0.9473 | Score 1 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9946 | 0.9781 | 0.9925 | 0.9763 |

Score 2 |
0.9587 | 0.9286 | Score 2 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9828 | 0.9652 | 0.9810 | 0.9580 |

Score 3 |
0.9738 | 0.9451 | Score 3 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9846 | 0.9717 | 0.9806 | 0.9670 |

Score 4 |
0.9813 | 0.9534 | Score 4 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9885 | 0.9695 | 0.9828 | 0.9634 |

Score 5 |
0.9846 | 0.9566 | Score 5 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9853 | 0.9677 | 0.9864 | 0.9724 |

Score 6 |
0.9867 | 0.9591 | Score 6 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9860 | 0.9656 | 0.9849 | 0.9638 |

Score 7 |
0.9874 | 0.9688 | Score 7 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9900 | 0.9717 | 0.9885 | 0.9663 |

Score 8 |
0.9892 | 0.9699 | Score 8 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9896 | 0.9684 | 0.9896 | 0.9659 |

Score 9 |
0.9907 | 0.9702 | Score 9 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9910 | 0.9709 | 0.9900 | 0.9630 |

Score 10 |
0.9925 | 0.9706 | Score 10 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9917 | 0.9652 | 0.9914 | 0.9714 |

Model type | TFTIDF | Model type | TF-IDF in SVD | TFT-IDF in SVD | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Train | Test | Train | Test | ||||||||

Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | ||

Score 1 n = 1, r = 0 | 0.9767 | 0.9473 | Score 1 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9912 | 0.9711 | 0.9925 | 0.9763 |

Score 2 |
0.9587 | 0.9286 | Score 2 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9835 | 0.9620 | 0.9824 | 0.9699 |

Score 3 |
0.9738 | 0.9451 | Score 3 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9842 | 0.9709 | 0.9842 | 0.9724 |

Score 4 |
0.9813 | 0.9534 | Score 4 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9828 | 0.9677 | 0.9842 | 0.9724 |

Score 5 |
0.9846 | 0.9566 | Score 5 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9878 | 0.9656 | 0.9846 | 0.9648 |

Score 6 |
0.9867 | 0.9591 | Score 6 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9885 | 0.9695 | 0.9889 | 0.9688 |

Score 7 |
0.9874 | 0.9688 | Score 7 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9889 | 0.9724 | 0.9867 | 0.9634 |

Score 8 |
0.9892 | 0.9699 | Score 8 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9932 | 0.9731 | 0.9889 | 0.9620 |

Score 9 |
0.9907 | 0.9702 | Score 9 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9935 | 0.9709 | 0.9892 | 0.9616 |

Score 10 |
0.9925 | 0.9706 | Score 10 |
0.9659 | 0.9480 | 0.9652 | 0.9465 | 0.9939 | 0.9724 | 0.9932 | 0.9666 |

Conducting classification using the TF-IDF in SVD model projected by reducing the dimensions of the features of the existing word frequency and IDF when learning with training data, the calculated classification score was 96.59%. After testing the remaining 50% of the data, the classification score was 94.80%. Moreover, the score calculated during training the validation data was 96.52%, and the test score was 94.65%. All models were trained using the MSE function as an error function during training. Classification using the training data and the TF-IDF in the SVD model, the classification accuracy was approximately 96.59%. When classification was performed using validation data and the TF-IDF in the SVD model, the prediction accuracy was 96.51%. The ROC curves of the model are shown in

There was no difference between the two models when using Train TF-IDF in SVD, which compared the classification results of the two models TF-IDF and TF-IDF in SVD for the training and testing data. On the other hand, when using the validation TF-IDF in SVD, the train score was approximately 0.07% lower, and the test score was approximately 0.15% lower than that using the existing TF-IDF model.

As the parameters of the proposed models change, performance evaluation was conducted. ‘r’ is the number of times the total matrix is decomposed. The experiment was performed by disassembling only up to the first and second total. ‘n’ is a parameter that affects the weight between documents and topics and between topics and words, representing the number of word weight groups. The results are shown in

Measure | F1-Score | Recall | Precision |
---|---|---|---|

TFTIDF | 0.97 | 0.99 | 0.96 |

TIS | 0.96 | 0.99 | 0.94 |

TTIS | 0.99 | 0.99 | 0.99 |

TF-IDF | 0.96 | 0.99 | 0.94 |

Model |
Mean | Deviation | Max | Min | |
---|---|---|---|---|---|

r = 1 | TF-IDF | 0.021013 | 0.211265 | 4.000124 | −1.942070 |

TFTIDF | −0.004366 | 0.180398 | 1.498211 | −3.480228 | |

When train data, TF-IDF in SVD | 0.021005 | 0.211251 | 4.000369 | −1.942319 | |

When train data, TFT-IDF in SVD | −0.027842 | 0.173702 | 2.372024 | −2.172323 | |

When validation data, TF-IDF in SVD | 0.019940 | 0.208464 | 3.850464 | −2.297395 | |

When validation data, TFT-IDF in SVD | −0.011387 | 0.165624 | 2.759352 | −1.884541 | |

r = 2 | TF-IDF | 0.021013 | 0.211265 | 4.000124 | −1.942070 |

TFTIDF | −0.004366 | 0.180398 | 1.498211 | −3.480228 | |

When train data, TF-IDF in SVD | 0.021005 | 0.211251 | 4.000369 | −1.942319 | |

When train data, TFT-IDF in SVD | −0.029638 | 0.174278 | 1.463640 | −3.263103 | |

When validation data, TF-IDF in SVD | 0.019940 | 0.208464 | 3.850464 | −2.297395 | |

When validation data, TFT-IDF in SVD | −0.023235 | 0.175948 | 2.360387 | −2.569441 |

This paper proposed the TFIDF and TFT-IDF in the SVD models, which improved the TF-IDF model and TF-IDF in the SVD models. We evaluated the results using a binomial classification model. Calculating the score, accuracy, and ROC results of the classified experiment, the TFTIDF model at n = 10 showed an approximately 2.26% higher score than the existing TF-IDF feature model. The TFT-IDF in the SVD model, was trained and classified spam. It achieved a prediction accuracy of approximately 99.17% when r = 1, n = 10, outperforming the TF-IDF model. In addition, the model with r = 2, n = 10 showed the highest performance among the models, with an accuracy of 99.39%. However, the TF-IDF in the SVD model showed an approximately 0.07% lower training score and 0.15% lower testing score than binary classification using the existing TF-IDF model. Overall, document predictability was significantly improved compared to using only TF-IDF. In addition, the TFT-IDF in the SVD model presents a reasonable solution to the scarcity problem encountered in the existing TF-IDF model and effectively reduced the dimension to achieve superior performance. It is expected to improve theoretical and practical value by applying genuine bot services and theoretical research for many ML algorithms.

This work was supported by the BK21 FOUR Project etc.