ZU Scholars ZU Scholars

: System logs record detailed information about system operation and are important for analyzing the system's operational status and performance. Rapid and accurate detection of system anomalies is of great signi ﬁ cance to ensure system stability. However, large-scale distributed systems are becoming more and more complex, and the number of system logs gradually increases, which brings challenges to analyze system logs. Some recent studies show that logs can be unstable due to the evolution of log statements and noise introduced by log collection and parsing. Moreover, deep learning-based detection methods take a long time to train models. Therefore, to reduce the computational cost and avoid log instability we propose a new Word2Vec-based log unsupervised anomaly detection method (LogUAD). LogUAD does not require a log parsing step and takes original log messages as input to avoid the noise. LogUAD uses Word2Vec to generate word vectors and generates weighted log sequence feature vectors with TF-IDF to handle the evolution of log statements. At last, a computationally ef ﬁ - cient unsupervised clustering is exploited to detect the anomaly. We conducted extensive experiments on the public dataset from Blue Gene/L (BGL). Experimental results show that the F1-score of LogUAD can be improved by 67.25% compared to LogCluster.


Introduction
Nowadays, information technology and the economy are growing closer. In order to ensure the sustainable development of the economy, the requirements of robust information systems keep getting higher. Any failure, including service interruption, and transmission quality degradation, may lead to application error, collapse, and even significant economic losses. For example, the securities industry is sensitive to the real-time and continuity of information systems. To enhance the ability of network security emergency response, securities companies have established an automatic emergency switching process to reduce system paralysis time. Once the cause is clearly located, the emergency switching time is controllable. Therefore, identifying the cause or anomaly detection has become a research hotspot [1,2]. At present, information systems have installed a great number of devices and software to monitor the system states, which generate large-scale system logs and key performance indicators (KPI) data [3], such as CPU utilization, query times per second, etc. Abnormal KPI data (e.g., spikes, sudden drops, jitters) can show the abnormal state of the system. However, KPI only can contain limited information. System logs record the detailed operation information. Generally, the cause of fault is also recorded in the system log. Analysis and detection of logs can provide various dimensional information to locate faults. The system log is a rich data source of anomaly detection. Therefore, log anomaly detection has been widely adopted in practice due to its directness and effectiveness.
System logs are generally semi-structured statements and have many different types and formats. Therefore, it is not easy to understand the causes of log anomalies. Initially, developers manually detect anomalies by keyword searching (such as "fail" or "exception") or regular expression matching according to their relevant field knowledge. However, it depends on manual inspection heavily by developers, which is inefficient and high error rate. Therefore, many machine learning methods are applied into log anomaly detection [4,5], such as Logistic Regression (LR), Support Vector Machine (SVM), Principal Component Analysis (PCA), etc. Machine learning methods can save time, reduce the possibility of error, and further improve detection accuracy. However, they do not consider the log instability and still have the following problems: 1) An ideal supervised classification model needs to be trained with a large amount of label data.
Generally, the more the amount of label data, the better the classification model. In order to get a good classification model, a large number of label logs are needed, which is hard in practice.
2) The log is unstable. The update and evolution of log statements and the noise introduced by log parsing lead to log unstable, which has a great influence on anomaly detection results. 3) Since there are few label logs, unsupervised methods to detect anomalies in logs are the trend. The detection accuracy of the unsupervised model depends on the feature representation. The original log is often high-dimensional, including a lot of redundant information. Directly training original log data is often inefficient. A suitable feature representation is important for unsupervised model. Therefore, to overcome the above issues, we propose Log Unsupervised Anomaly Detection (LogUAD), which is an unsupervised anomaly detection method based on Word2Vec. LogUAD does not require a log parsing step and takes original log messages as input to avoid the noise from log parsing. LogUAD uses Word2Vec to generate word vectors and generates weighted log feature vectors to handle the evolution of log statements. At last, a computationally efficient unsupervised clustering is exploited to detect anomalies based on the feature vectors of the log sequence. The contributions can be summarized as follows.
1) We propose a LogUAD anomaly detection model without log parsing, which uses the original log directly for feature extraction and avoids the noise from log parsing on anomaly detection. 2) A word-sequence weight matrix based on the TF-IDF (Term Frequency-Inverse Document Frequency) is used to indicate the relationship between words and log sequences. The feature vector of the log sequence is obtained from word vector generated by Word2Vec and the wordsequence weight matrix, which can handle the evolution of log statements because of the semantic similarity among evaluation of log. It effectively extracts suitable feature for clusteringbased anomaly detection. 3) Compared with the LogCluster [6], the F1-score of LogUAD increases by 67.2%. LogUAD improves the accuracy of unsupervised log anomaly detection and can effectively carry out unsupervised anomaly detection.
The rest of this paper is organized as follows. The related work is introduced in Section 2. Section 3 introduces the necessary knowledge about log anomaly detection. Section 4 proposes the LogUAD method. Section 5 describes the experiment in detail. Finally, Section 6 summarizes the work of this paper and future work.

Related Work
In practice, system logs are widely used in anomaly detection by developers. There are many works on log anomaly detection. The current main technologies are machine learning and deep learning.
In order to pursue higher detection accuracy, log anomaly detection has applied many supervised learning methods. Liang et al. [7] trained Support Vector Machines (SVM) classifier to detect log events fault. Chen et al. [8] proposed a decision tree method for fault detection and diagnosis of website sites, by sorting and merging the correlation degree of faults to improve the detection accuracy. Farshchi et al. [9] proposed a novel regression-based analysis method. Their experiments on Amazon servers show that this method can solve the problem of operating cloud applications. Zhang et al. [10] proposed PreFix, which applied the random forest algorithm to the template sequence so that it could intervene and "fix" the potential failures. Bertero et al. [11] mapped the logfiles' words to a high-dimensional metric space and then used a standard classifier to check whether the log event is abnormal. However, a large number of label logs are necessary to train unsupervised models and detect anomalies.
Afterward, many unsupervised learning methods were proposed. The advantage of unsupervised learning is without labels. Lou et al. [12] presented Invariants Mining (IM). IM mines the invariants of sparse integer values from the message count vector. Experiments on Hadoop and CloudDB show that IM can detect more and more complete anomaly messages from the distributed systems. Xu et al. [13] used principal component analysis (PCA) to detect abnormal events and found that PCA with term weighting technology in information retrieval got good anomaly detection results without too much parameter adjustment. Yang et al. [14] proposed LogOHC, which had high extraction efficiency for multisource log datasets. It is an online log template extraction method with an online hierarchical clustering algorithm.
The deep learning method is favored by more and more researchers, which provides a new direction for log anomaly detection. Zhang et al. [15] firstly employed LSTM for log system failure prediction. They gathered logs of similar format and content, and dealt with the "rarity" of labeled data in the training process by LSTM to capture the long-range dependency across log sequences. Due to the great similarity between log analysis and natural language processing, some works use natural language processing (NLP) related technologies for log anomaly detection. Du et al. [16] proposed DeepLog. DeepLog regards anomaly detection as a multi-classification problem of log keys. LSTM is used to model the log key sequence, and only the normal log key sequence is trained for LSTM. LSTM model needs to be trained with a large amount of normal data, but the original dataset cannot be all normal which may contain some abnormal log.
These works can not cope with the evolution of log statements, especially in the case of new log templates [17], which greatly limits their applicability in practice. Log instability is common, but there are few works on it. Brown et al. [18] proposed attention-based RNN, which improved the interpretability of the model without sacrificing the effect of anomaly detection. However, it focuses on log events and ignores the relationship of context in the log sequence. Zhang et al. [19] proposed LogRobust, which could handle the instability in log events and log sequences. LogRobust uses Bi-LSTM neural network based on attention to capturing context information. By assigning weights to different events, the model can automatically analyze the importance of events to improve the accuracy of anomaly detection. It can solve log instability problems, but the computational cost is large, and training is slower when the amount of log data is huge. Lin et al. [6] introduced the LogCluster algorithm based on the clustering method to realize data clustering and linear pattern mining of logs, but it still needed log parsing. Therefore, we propose LogUAD to resolve the problem of log instability and increase the accuracy and efficiency of anomaly detection with low computational cost.

Preparatory Knowledge
In this session, we introduce the basic knowledge about different log types, general log anomaly detection framework, and Word2Vec.

Different Log Types
The system log records detailed information during system operation. Anomaly detection based on a system log can effectively maintain system security and reduce system failure.
Four different system logs are shown in Fig. 1. The log records the detailed information of the system operation, such as timestamp, message type, system operation status, dynamic changes of software and hardware, message level, date, and time of occurrence, etc. System logs are all semi-structured texts. The log specifications are not uniform, and different types of devices print different log formats. Fig. 2 shows the overall framework of log anomaly detection, which contains the following four steps: log collection and processing, log parsing, feature extraction, and anomaly detection.

General Log Anomaly Detection Framework
The system usually generates logs to record system status, operations, and changes. Therefore, logs are collected for preprocessing, such as log division and information extraction. Logs are unstructured, and the word count, and position in the text are not fixed. After preprocessing, this valuable information can be used for various purposes (for example, statistics, anomaly detection). The purpose of log parsing is to extract the log template from the original system log. Each log message can be seen as composed of a constant part and a variable part. Log parsing eliminates the variable part or replaces it with wildcards, and keeps only the constant part to form a log template.
After the logs are parsed into log templates, log templates are further encoded into digital feature vectors (such as log template count vectors, etc.). Then, a machine learning model can be used to detect the anomaly.
Finally, after obtaining the feature vector, the feature vector is input into the anomaly detection model. The trained model can identify whether the log is normal or abnormal.

Word2Vec
Word2Vec is a word vectorization tool proposed by Google in 2013, which calculates the relationship between words and explores the connections between words. According to the different types of input and output, it can be roughly divided into two completely different models, one is the continuous Skip-gram model, and the other is Continuous Bag of Words (CBOW). Word2Vec can convert words into word vectors, and vectorize texts while maintaining the semantic relationship between texts. Word2Vec can distinguish the semantic similarity of short texts, such as news headlines or similar bus and subway station analysis. Fig. 3 summarizes the network structure of Word2Vec. The input and output are the One-Hot vectors of words. The first layer is a fully connected layer, and then there is no activation function. The output layer is a Softmax layer, which outputs a probability distribution that represents the probability of each word in the dictionary. Skip gram is a model that predicts context by giving input words. On the contrary, CBOW inputs the context word vector of the target word and outputs the word vector of the target word.

Design of LogUAD
In this section, we first give the overview of LogUAD and the detailed steps of LogUAD, including log preprocessing, word vector matrix generation, word-sequence weight matrix generation, log sequence feature vector matrix generation, and unsupervised anomaly detection. Then, an example is given to illustrate our method.

The Overview of LogUAD
To unravel the issue of log instability, reduce computational overhead, and improve the accuracy and efficiency of log anomaly detection, we propose an unsupervised log anomaly detection method based on Word2Vec--LogUAD, as shown in Fig. 4. First of all, we download the relevant logs in the network system and divide and extract the collected system logs. Only the detailed contents are retained. There is no need to form structured text. By setting different window sizes, the detailed contents are divided into multiple log sequences. Then, the log contents are used as the input of Word2Vec to obtain the word vector. The feature vector of the log sequence can be obtained by combining TF-IDF of the log sequence. After obtaining the feature vector of the log sequence, log anomaly detection based on clustering can distinguish normal and abnormal logs. LogUAD mainly involves the following steps: 1) Data collection: We can download and collect a large number of logs from the network service system, such as distributed system logs, operating system logs, mobile system logs, etc. The collected logs include a great number of normal system logs and few anomaly system logs in the actual network environment. 2) Processing: After getting the original system log, delete the timestamp, node, information of the original system log, only extract the message content. Log messages are divided into multiple log sequences according to different window sizes (that is, the number of log entries). 3) Word vector matrix generation: Taking the words in the log sequence as the input of the Word2Vec, each word can be mapped to a vector space to generate a word vector. After training the vector of words, it is easy to be calculated in the next step. All words in the dictionary form a word vector matrix. 4) Word-sequence weight matrix generation: TF-IDF is a common weighting technique for information retrieval and data mining. The weight of a word increases proportionally with the number of occurrences in the log sequence, but if the frequency of occurrences in the file decreases, the weight of words also decreases. We use TF-IDF to calculate word-sequence weight matrix. 5) Log sequence feature vector generation: The word vector matrix is multiplied by the word-sequence weight matrix to get the log sequence feature vector. 6) Unsupervised clustering for anomaly detection: By the K-means clustering algorithm, the log sequence is divided into different clusters for anomaly detection according to the Euclidean distance from the feature vector of the log sequence to the cluster center.

Log Processing
Most of the log files are original text and unstructured files. We take a dataset from the Blue Gene/L (BGL) [20] supercomputer system of Lawrence Livermore National Laboratory (LLNL) as an example. Fig. 5 lists some fragments of the BGL dataset.
As shown in Tab. 1, BGL data contains a lot of irrelevant information such as time, node, message type, kernel, etc. For example: "1118765360 2005.06.14 R04-M0-ND-C:J15-U01 RAS KERNEL FATAL data storage interrupt". The timestamp is 1118765360, the record date is 2005.06.14, the node is R04-M0-ND-C:J15-U01, the type of the log message is RAS, the location where the message is generated is  Figure 4: The overall framework of LogUAD KERNEL, the corresponding level of the log message is FATAL, and the content is "data storage interrupt". We remove irrelevant information and keep useful log messages, that is, "data storage interrupt".
After extracting the log message contents, they are divided according to the fix window to generate log sequences. Assuming that the total number of log entries is N and the window size is t, the log contents are divided into m log sequences, where m = N/t. Each log sequence contains t consecutive logs. Seq i represents the i-th log sequence.

Word Vector Matrix Generation
Because Word2Vec takes words as input, it is necessary to segment the log content first. Here, we can use space and other ways to segment words. For example, "141 double-hummer alignment exceptions data TLB error interrupt" are divided into nine words: "141", "double", "hummer", "alignment", "exceptions", "data", "TLB", "error" and "interrupt".
CBOW and Skip-gram are two training modes of Word2vec. We use CBOW to predict the current word through the context. CBOW calculates the probability of the current word through the first n or last n words of the current word.
As shown in Fig. 6, we suppose that a word is only associated with the first n and last n words of the word. The one-hot vector of the context word of the word v (i−n) , …, v (i−1) , v (i+1) …, v (i+n) are entered, then the word vector of the v (i) through the Projection layer and Softmax layer is calculated. The word vector of word  i is w i ∈ R 1×j , where j is the dimension of the word. Assuming that the dictionary contains s words, the vectors of all words form a word vector matrix W 2 R sÂj according to the order of words appearing, as shown in Fig. 7.

Word-Sequence Weight Matrix Generation
TF-IDF calculates the weight of words in all log sequences. If a word appears less frequently in all log sequences, it should be given higher weight. If it appears more frequently, its weight should be lower. If it does not appear in the current log sequence, its weight is 0. TF calculates the frequency of words, and IDF measures the occurrence of log events, which are formulated by Eq. (1) and Eq. (2).
where Count(word i ) is the occurrence number of word i in all log sequences, Total(word) is the total number of words in all log sequences, Total(seq) is the total number of log sequences, and Total(seq i ) is the total number of log sequences contain word i . Finally, TF-IDF is the product of TF and IDF, according to Eq. (3): The weight vector of the i-th log sequence Seq i is denoted by T Seq i 2 R 1Âs , where s is the total number of words. T Seq iz represents the value of column z in the i-th log sequence. If word z belongs to log sequence ... ... Figure 6: CBOW model structure  Figure 7: Generating schema of word vector matrix W the value of the z-th column of T Seq iz is TF − IDF(word z ). Otherwise, it is zero, as shown in Eq. (4):

Projection layer SUM
There are m log sequences in total, and the weight vectors of all log sequences form a word-sequence weight matrix T ∈ R m×s .

Log Sequence Feature Matrix Generation
Obtaining the log sequence feature matrix is a key step in log anomaly detection. The log anomaly detection model takes the log sequence feature matrix as the input. The word-sequence weight matrix T is multiplied by the word vector matrix W to obtain the log sequence feature matrix vector E seq ∈ R m×j , as shown in Eq. (5).

Unsupervised Anomaly Detection
An unsupervised clustering method based on K-means is exploited to detect anomaly. The K-means method has the advantages of simple principle, fast convergence speed, and good interpretability. Algorithm 1 shows an unsupervised clustering method based on K-means. The input of this algorithm is the log sequence feature matrix, the number of cluster centers k, and the threshold of the abnormal fraction. The output of the algorithm is an anomaly label for each log sequence. The 1-3 row is initialization. The Euclidean distance between the log sequence vector and the center of each cluster is calculated in 4-8 rows and the log sequence is divided into the cluster with minimum distance. Line 9-17 updates the current cluster center until it is unchanged. Lines 18-24 mark the log that the distance between the log sequence vector and its cluster center is larger than the anomaly score threshold as an abnormal log, otherwise it is normal.

Example
An example is taken to briefly explain the main process of feature matrix generation. 6 lines of log messages are selected from the BGL dataset, which contains 19 (s = 19) unique words forming a dictionary.
As shown in Fig. 8a, the original system log is preprocessed to extract the log message contents. As shown in Fig. 8b, 6 log message is divided into 3 log sequences with a window size of 2. The 19 words in the dictionary are trained by Word2Vec to generate word feature vectors w (i) ∈ R 1×j , where j is the word vector dimension and is set in Word2Vec. Because there are 19 words in total, the word vector matrix W is R 19×j , as shown in Fig. 8c. TF-IDF is used to calculate the weight of each word in all log sequences. As shown in Fig. 8d, the TF-IDF of the word is used to represent the word-sequence weight vector T Seq i 2 R 1Â19 of the i-th log sequence. All log sequence's weight vectors form a word-sequence weight matrix T ∈ R 3×19 , where 3 is the number of log sequences.
Finally, as shown in Fig. 8e and 8f, the log sequence feature matrix E seq ∈ R 3×19 is obtained by multiplying the word-sequence weight matrix T and the word vector matrix W. Then, the feature matrix is used unsupervised to detect the anomaly of the log.

Simulation and Evaluation
In this section, we first introduce the details of the data set used for the experiment, the experimental setup, and the experimental platform. Then we compare and evaluate our log anomaly detection method with LogCluster. Finally, we analyze the effects of different experimental parameters.

Datasets
We use the BGL dataset, which is published by the Blue Gene/L supercomputer system of LLNL. The BGL data contains 1,048,576 log messages. Tab. 2 shows the detailed information of the BGL dataset, including data size, number of log information, etc. BGL logs contains 6% Out-of-Vocabulary (OOV) words because of log evolution. After log parsing, the percent of OOV words can exceed 80% [21].

Experimental Setup
We use 1,000,000 lines of log messages for the experiment from the BGL dataset. There are a total of 773 words (that is, s = 773). The word vector dimension of each word is 376 dimensions. The window size t of log sequences ranges from 5 to 100. Each line of log message has its label in the BGL dataset. Once there are one or more logs in the log sequence, this log sequence is abnormal. Otherwise, it is a normal log sequence. The default parameters of LogUAD are shown in Tab. 3.
We conduct an experimental comparison with LogCluster [6]. LogCluster processes and classifies the logs of large-scale online software systems, treats the log clustering problem as a pattern mining problem, and uses clustering methods to identify whether current failures have occurred before. LogCluster generates log templates by log parsing, then combines IDF-based and Contrast-based method to generate word vectors, and finally clusters them by similarity. Although the deep learning method has better performance, the computational overhead can be very large, when the data set is large. Moreover, the input log in the training stage must be correct so that a better model can be trained. However, most of the dataset contains anomaly logs. Therefore, we only compare with machine learning-based method.  We consider the impact of window sizes, the number of cluster and the dimension of word vectors on LogUAD. All experiments of LogUAD were run on AI Studio. AI Studio is an AI development platform based on Baidu deep learning platform, which equipped the server with a 4-core 32GB RAM and 100G memory CPU of Intel (R) Xeon (R) Gold 6148 and a Tesla V100 GPU with 32G memory.

Evaluation Metrics
Anomaly detection is usually a two-class classification problem. We use precision, recall, and F1-score, which are widely used in related studies, to evaluate the performance of our proposed method. The definition for precision, recall, and F1-score is shown in Eq. (6): A true positive (TP) is the number of the normal log sequences in practice, and the results predicted by the model are also normal. A false positive (FP) is the number of the abnormal log sequences in practice, and the results predicted by the model are normal. A false negative (FN) indicates the number of log sequences that the predictions are normal, but actually they are abnormal log sequences.

Comparison with Logcluster
In order to evaluate the performance of LogUAD, we compare it with the LogCluster proposed by Lin et al. [6]. The window size determines the division of the log sequences. To investigate the influence of window size, t is set to 5, 10, 20, and 32, and other parameters are set as default values.
The experimental results of LogUAD and LogCluster are shown in Fig. 9. As the window size increases, the F1-scores of LogUAD are 0.756, 0.757, 0.729, and 0.727, respectively, while the F1-scores of LogCluster are 0.457, 0.522, 0.625, and 0.545. Although the anomaly detection performance of LogCluster improves as the window size increases, the performance begins to decline when the window size increases to 20. LogCluster needs to parse the original log into a log template before clustering analysis, which introduces a lot of OOV words and results in a deviation in the log anomaly detection. As the window size gradually increases, LogUAD anomaly detection performance tends to decrease insignificantly and remains at a high level. Therefore, the overall performance of LogUAD is more accurate and more robust than LogCluster.

The Impact of Number of Clusters
The number of clusters plays an important role in LogUAD. Moreover, the number of clusters is set in advance according to experience. In order to observe the impact of the number of clusters on LogUAD, we set the number of cluster centers from 2 to 6. We use log sequences with four different window sizes t (5, 10, 20, and 32), and set other parameters, such as word vector dimensions, to the default value.
The F1-scores of LogUAD with the different number of clusters are shown in Fig. 10. The overall abnormality detection score tends to increase as the number of clusters decreases. For example, when the window size is 5 marked with the blue line, the F1-score of LogUAD gradually decreases from 0.756 to 0.637. When the number of clusters is less than 5, the red line with the window size of 32 maintains relatively stable anomaly detection performance. This shows that the LogUAD is stable and robust. When the number of clusters is 2 or 3, the F1-scores of LogUAD in different windows can reach 0.75, and can obtain better anomaly detection effects. Because the abnormal log vector is far from the cluster center, the fewer the number of cluster centers, the fewer the abnormal logs within the threshold range.

The Impact of Word Vector Dimensions
What word embedding does is map words into vector space, and the word vector dimension is a parameter that can be set. In order to evaluate the effect of word vector dimension on LogUAD, we select four types of word vector dimensions (200, 250, 300 and 376).
LogUAD with a high dimension of word vector can keep a stable anomaly detection effect when the window size changes. Fig. 11 shows the performance of 2 cluster centers in LogUAD and Fig. 12 demonstrate the performance of 2 cluster centers in LogUAD.
As shown in Fig. 11, when the word vector dimension is 200 and the window size is small, LogUAD can maintain a good anomaly detection effect. When the window size is larger than 32, the F1 scores begin to decline. The reason is that as the window size increases, a log sequence contains more and more log information. When the dimension of word vector is small, the cosine similarity of the word vectors of two log sequence is higher. It is difficult to distinguish and result in clustering errors. As shown in Fig. 12, as the window size increases, LogUAD with high-dimensional word vector has a higher F1 score than that with low-dimensional word vector. LogUAD with high-dimensional word vector can maintain a stable anomaly detection effect. Therefore, LogUAD can effectively perform unsupervised anomaly detection.
From the above experiments, compared with LogCluster, LogUAD has better performance. The weight feature vector of LogUAD characterizes the log message efficiently. With different windows sizes and the different number of clusters, LogUAD can achieve robust performance. At the same time, LogUAD abandons the log parsing step and reduces computational overhead.

Conclusion
In this paper, we propose LogUAD, a log anomaly detection method based on Word2Vec. It is an off-line feature extraction method, which avoids the steps of log parsing for the original log messages, saves the time of the whole log anomaly detection. It can reduce the impact of the low accuracy of log parsing and also overcome the instability of system log. It uses K-means unsupervised clustering algorithm to detect anomaly. Simulation results show that our method is effective and superior to LogCluster. It is our next step to combine deep learning with LogUAD framework.