Joint Event Extraction Based on Global Event-Type Guidance and Attention Enhancement

Event extraction is one of the most challenging tasks in information extraction. It is a commonphenomenonwheremultiple events exist in the same sentence. However, extracting multiple events is more difficult than extracting a single event. Existing event extraction methods based on sequence models ignore the interrelated information between events because the sequence is too long. In addition, the current argument extraction relies on the results of syntactic dependency analysis, which is complicated and prone to error transmission. In order to solve the above problems, a joint event extraction method based on global event-type guidance and attention enhancement was proposed in this work. Specifically, for multiple event detection, we propose a globaltype guidance method that can detect event types in the candidate sequence in advance to enhance the correlation information between events. For argument extraction, we converted it into a table-filling problem, and proposed a tablefilling method of the attention mechanism, that is simple and can enhance the correlation between trigger words and arguments. The experimental results based on the ACE 2005 dataset showed that the proposed method achieved 1.6% improvement in the task of event detection, and obtained state-of-the-art results in the argument extraction task, which proved the effectiveness of the method.

evaluation conference defined event extraction as two subtasks: event detection (identifying and classifying event triggers) and argument extraction (identifying arguments of event triggers and labeling their roles).
The traditional methods generally handle event extraction as a pipeline of two separate tasks: event detection and argument extraction. The pipeline method achieves good results, especially when deep learning techniques are used. The most successful pipelined method was proposed by Chen et al. [2], who used dynamic multiple pooling convolutional neural networks to automatically learn features from sentences and represented words with continuous representations [3][4][5]. However, as the pipeline method is divided into two subtasks, the interrelationship between the subtasks is ignored. Specifically, the result of event detection affects the following argument extraction, and the effect of argument extraction promotes the result of event detection [6]. Thus, researchers turned to the method of joint extraction.
Li et al. [6] performed one of the most successful studies of the joint method, which is based on a structure-aware algorithm with sets of local and global features for EE. The interdependence between trigger words and arguments is captured by global features. This method alleviates the shortcomings of the pipeline method and achieves good results. However, this feature extraction relies on natural language processing tools (e.g., part of speech tagging) and has poor generalization capabilities for new words and unseen features. Therefore, Nguyen et al. [7] proposed joint event extraction based on the Recurrent Neural Network (RNN). They used recurrent neural networks to automatically learn rich contextual semantic representations. In order to capture the interdependence between trigger words and arguments, memory vectors and matrices are introduced in the method to store prediction information in the process of sentence labeling. To a certain extent, this method solves the deficiencies of Li et al. [6] method, but it does not make full use of the syntactic dependence between the components in the sentence. Sha et al. [8] used a dependency bridge based on a bi-directional RNN to learn the syntactic dependency between each component in a sentence and introduce a tensor to learn the interdependence between arguments. However, all of the above methods have a common disadvantage: They ignore the interdependence of multiple events in the sentence.
In actual event extraction scenarios, there will inevitably be multiple events in one sentence. Compared with single event extraction, it is more complicated to accurately extract multiple events. There is a strong correlation among events drawn from the same sentence. For example, as shown in Fig. 1, the Attack event helps us determine that the word died triggers the Die event rather than the Inject event. It is worth noting that multiple event phenomena are ubiquitous in natural language. According to statistics, there are 3,978 incident-related sentences in the ACE 2005 dataset, and 1,042 sentences contain multiple events, accounting for 26.6% of the entire incident dataset. It is common for multiple events in the sentence to require extraction. Liu et al. [9] conducted an in-depth study on multiple event extraction. They used the graph convolutional neural to learn the dependency syntax relationship between the components in the sentence, and tried to capture the correlation between events. However, owing to the complexity of the dependency syntax tree and reliance on NLP tools for preprocessing, this method inevitably encounters the error propagation problem and the interdependence between events is not fully resolved.
In order to solve the above problems, we proposed a joint event extraction method based on global event type guidance and the attention enhancement mechanism. Recent studies on multitask learning (MTL) in deep neural networks for NLP revealed that multi-task gains were more likely for target tasks that quickly plateaued with non-plateauing auxiliary tasks [10]. Because of the compelling benefits of MTL, we proposed a multi-task setup for identifying and classifying events and arguments. Specifically, we first use the BERT pre-training model to encode the sentence in order to obtain the context information of each token. Next, the event guidance layer is exploited to predict the candidate event types of the input sentence. At the same time, we introduce the CRF layer to identify the candidate arguments. Then, we feed the candidate event types and context features into the softmax layer for trigger word recognition and event classification. Finally, we enumerate the combinations between two tokens in the sentence. The corresponding context features, candidate argument features, trigger words and event-type features are attentively considered for argument role classification in a table-filling [11][12][13] manner (see Fig. 2). From the above, we noticed that the event types predicted by the event guidance layer helped to guide event classification. With the injected events, we allow the network to be aware of all the events that exist in the sentence in advance. Thus, the interdependencies of events are taken into account. Moreover, we use an attention mechanism to comprehensively take all tokens into account for the role classification of any two tokens. Therefore, the correlation between trigger words and arguments is taken into account at the table filling stage. In summary, the contributions of this work can be summarized as follows: In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel  (1) We proposed a novel event-type guidance layer to predict the event types of the input sentence. The candidate event types are used to guide trigger word recognition and event detection, which can strengthen the complex interdependencies of events.
(2) The method converts the argument extraction as a table-filling problem. An attention mechanism is introduced to involve the representations of multiple tokens, which automatically discovers some useful contextual information for argument role classification. (3) We conducted wide experiments on the ACE 2005 dataset. The experimental results indicate that the proposed method outperforms several strong baselines.

Related Work
Traditional event extraction methods usually exploit a pipelined way to extract events where arguments are identified using a classifier after event detection [14,15]. These methods have a fatal flaw: They ignore the underlying interdependencies between event detection and argument extraction and suffer from error propagation.
To address the above problem, a joint event extraction method based on the Markov logic network [16][17][18] was proposed. Afterward, the structured perceptron [6,19] and the dual decomposition method [20] were successively proposed for event extraction.
Recently, with the widespread application of neural networks in machine translation, text classification, steganography analysis [21,22], and other fields, researchers have also tried to use neural networks to complete event extraction. For example, Chen et al. [2] employed dynamic multi-pooling convolutional neural networks to automatically learn features, in which the input words are represented by pre-trained word embeddings [3][4][5]. However, they achieved promising results, and the method still follows the pipelined framework. Similarly, Nguyen et al. [7] proposed a joint approach named JRNN, in which recurrent neural networks are used to automatically learn the rich contextual semantic representation of sentences. The relations between event triggers with specific subtypes and their corresponding arguments are captured by the devised memory vectors and matrices. Similarly, Sha et al. [8] exploited dependency bridges to connect syntactically related words based on a bidirectional recurrent neural network. Moreover, a tensor layer was proposed on each pair of two candidate arguments, which enables intensive, argument-level information interaction. Liu et al. [9] conducted an in-depth study on multi-event extraction, which introduced a syntactic dependency tree and used the graph convolutional neural networks to learn the syntactic dependency of each component for the sentence.
The above mentioned joint extraction methods have achieved good results. However, these existing methods have a common disadvantage: They do not consider the situation where multiple events appear in one sentence at the same time. To solve this problem, we proposed an eventguided and attention enhancement joint approach for event extraction. The pre-predicted eventtype information allows for better event detection, and the attention mechanism is exploited to leverage the sentential context.

Methodology
Our proposed joint model consists of six modules: (i) BERT, (ii) NER, (iii) Event-Types Proposal, (iv) Event Detection (v) Token Pair Attention, and (vi) Table Filling, as illustrated in Fig. 3. Given a sentence as the model input, the model first generates a deep contextualized word representation for each token using BERT. Next, the event guidance layer is exploited to predict the candidate event types of the input sentence. At the same time, we introduce the CRF layer to identify the candidate arguments. Then, we feed the candidate event types and context features into the softmax layer for trigger word recognition and event classification. Finally, we enumerate the combinations between two tokens in the sentence and comprehensively consider the BERT output, NER label, predicted event types and attention results to fill the trigger word-argument role table. We explain the model details in the following subsections.  Each block is composed of a multi-head self-attention layer and a position-wise, fully connected feed-forward layer. Assuming the output sequence of the previous layer is packed together into a matrix H, the output matrix Z of a multi-head self-attention layer is computed as where h is the number of attention heads, d k is the dimension of queries and keys, and i are the parameter matrices. Each layerin the encoder has a residual connection around it, followed by layer normalization.
For a given token, the input representation of BERT is the sum in the corresponding token and segment and position embeddings. BERT uses WordPiece embeddings as token embeddings. In addition, BERT adds a special token ([CLS]) as the first token to obtain the aggregate sequence representation for a classification task and a special token ([SEP]) to distinguish between different sentences in the same input sequence. In particular, as the input sequence extracted by the joint event is only one sentence, the special token ([SEP]) is not useful in the current task.
Given an input token sequence X = (x 0 , x 1 , . . . , x n−1 , x n ), we denote the BERT contextual representation of each token as Z = (z 0 , z 1 , . . . , z k−1 , z k ). Moreover, given that the WordPiece tokenizer might split a token into several sub-tokens, we use the hidden state corresponding to the first sub-token of a given token as its contextual representation.

Named Entity Recognition
We formulate the NER task as a sequence-labeling problem and use the BIEO (Beginning, Inside, Ending, Outside) encoding scheme. A linear-chain CRF is employed to calculate the most probable tag for each token. Formally, we first derive the emission potential that comes from the sentence encoder output. The score of each token x i for each entity tag is calculated as follows: where f (.) is an element wise activation function (e.g., relu), s i ∈ R d , d is the number of encoding scheme tags, W 1 ∈ R l×2m , V 1 ∈ R l×d are the transformation matrixes. b h ∈ R l , b s ∈ R d are bias vectors, l is the hidden size, and m is the output dimension of BERT. Given a sequence of tag predictions Y = (y 1 , y 2 , . . . , y n−1 , y n ), the linear-chain CRF score is defined as where s i, y i is the score of the tag y i for the token x i , which is obtained by Eq. (3), and a y i−1, y i is the score of transition from tag y i−1 to tag y i . Using Eq. (4), we can get the score of a tag sequence y, which is further converted to probability by the following softmax function: whereỸ ∈ Y (w) denotes the set of possible tag sequences for x. The loss function of the sequence labeling is: where T represents the training set and Y * is the gold standard for sequence x. During training, we minimize the negative log likelihood L NER of the gold standard. In the decoding process, the Viterbi algorithm is adopted to derive the optimal tag sequence. The tags are converted to embeddings by looking up an embedding layer. We then obtain the label-embedding sequence e ner = (e ner 1 , e ner 2 , . . . , e ner n ), e ner i ∈ R m , where m is the dimension of the label embeddings.

Event-Types Proposal
Event-types proposal is an auxiliary task for event extraction. The task aims to predict the possible event types in the sentence regardless of which trigger word contains them. The eventtypes proposal layer employs hard parameter sharing, the most common approach used in multitask learning, to share the same sentence encoder with NER. We use the first token that BERT outputs and then use a dense layer with non-linear activation to get predicted event types in this sentence: where z 0 is the first for the BERT output. W p ∈ R |tp|×h is the transformation matrix, b p ∈ R |tp| is the bias vector, |tp| is the number of predefined event types, and f (.) stands for the sigmoid function, which potentially allows multiple events to exist in the same sentence. We create a criterion that measures the binary cross entropy between the target and the output. The loss function of the event-types proposal is: where T represents the training set, ∇ is the gold standard event types set for sequence w, and E * tpi is the i-th one. p tp E tpi | W is calculated by applying the softmax function across event types. All The predicted event types have two uses. On the one hand, the event-types proposal is a simple auxiliary task that can cooperate with the event detection task. On the other hand, it creates complex dependencies for the event types in a sentence.

Event Detection
Assume we have extracted an entire trigger candidate that meets an O label after an I-Type label or a B-Type label. Through softmax, the event-type embeddings are concatenated with the BERT contextual representation. Then we can get every token label category.
where W c ∈ R m×d is the parameter matrix, b c ∈ R m is the bias vector, and expand is the dimensional extension function. According to the obtained label probability distribution, the event-type prediction label corresponding to each token can be obtained.
The loss function is the cross-entropy between the target and the output for tokens: where T represents the training set. The tags are converted to embeddings by looking up an embedding layer. We obtain the label embeddings sequence e ed = (e ed 1 , . . . , e ed n ), e ed i ∈ R m , where m is the dimension of label embeddings.

Token Pair Attention
The filling of the vanilla table takes into account just two candidate tokens to predict the role of the argument. We use the token pair attention mechanism to capture information between trigger words and arguments. Specifically, the attention score of the token pair < x i ; x j > for the k-th token in a given sentence is calculated by the following equation: where V ij is the average of V e i and V e j ; W q is the attention parameter; a t ij is equal to 0 when t is equal to i or j. The main reason for this strategy is that we consider the representations of the token pair < x i ; x j > in the table filling stage (see Eqs. (15) and (16)). Thus, we directly mask the token pair themselves when performing attention calculations. The attentive result for the token pair < x i ; x j > is computed by the following equation: where a ij = (a 1 ij , a 2 ij , . . . , a k ij ). In the weighted average sentence representation, s ij focuses on useful contextual information for table filling.

Table Filling
For event embedding (e ed ), NER embedding (e ner ) is then concatenated with the BERT contextual representation to form a final feature representation V = (V e 1 , . . . , V e k ) = e ed : e ner : Z . Let x i and x j be two words, Y (x i , x j ) be all possible role relations, and s x i , x j , r be a scoring function that assesses x i and x j for the existing role types r. We can further get the conditional probability of role types r given x i and x j through the softmax function: Here, δ(.) is an elementwise non-linear activation function (e.g., tanh). Moreover {W , U, V , M} are the transformation matrixes.
Based on the probability distribution of table filling, the predictive effect of each table in table filling is defined as The loss function is the cross entropy between the target and the output for all the table cells: where x represents a sentence in the training set T, x k is the k-th word of x, n is the sentence length, and R is the argument-roles set between x i and x j .
In our framework, there are two main tasks: event detection and table filling, and two auxiliary subtasks: NER and ETP. The loss function is the summation of these four tasks. For joint event extraction, the loss function is the summation of these four tasks: L ED + L ETP + L NER + L TF . The loss is calculated as the average over-shuffled minibatch, and the derivatives of each parameter can be computed via backpropagation.

Dataset, Resources and Evaluation Metric
We evaluated our framework on the ACE 2005 dataset. The ACE2005 dataset annotates 33 event subtypes and 36 role classes, and along with the NONE class and BIO annotation schema, we divided each token into 67 categories in event detection and 37 categories in argument extraction. In order to be consistent with previous work, we used the same data split as in previous work [2,[7][8][9]. This data split included 40 newswire articles (881 sentences) for the test set, 30 other documents (1,087 sentences) for the development set and 529 remaining documents (21,090 sentences) for the training set.
We then used precision, recall, and F1 score to evaluate the performance as done in previous work [2,8,9].

Hyperparameter Setting
For all the experiments below, in the ETP layer and event detection, we used 200 dimensions and 300 dimensions for the tag embeddings. We utilized a maximum length n = 120 of sentences in the experiments by padding shorter sentences and cutting off longer ones. The batch size in our experiments was 64, we set the dropout rate to 0.5 and the learning rate to 0.05. Adam was used to optimize the neural networks. The experiments were trained with an NVIDIA RTX 1080Ti GPU.

Baselines and Main Results
To evaluate the performance of the proposed method, we compared our model with four competitive baselines, as follows: 1)DMCNN [2], which uses dynamic multiple pooling to keep multiple event information; 2) JRNN [7], which uses a bi-directional RNN and manually designed features to jointly extract event triggers and arguments; 3) dbRNN [8], which adds dependency bridges over bi-directional LSTM for event extraction; 4) JMEE [9] via attention-based graph information aggregation for multiple event extraction.
Tab. 1 shows the results of comparing our model with the baseline methods. Our framework achieved the best F1 scores in trigger recognition and trigger classification, is the scores were 1.6% higher than the best-reported models. However, it did not achieve the best F1 score for argument role classification. In summary, these results demonstrated the effectiveness of our method to incorporated with the global event type guidance and attention enhancement.

Effect of ETP Layer for Extracting Multiple Events
To evaluate the effect of the ETP layer for alleviating the multiple event phenomenon, we divided the test data into two parts (1/1 and 1/N) following [2,9,10] and then performed evaluation separately. Here, 1/1 means that one sentence only had one trigger or one argument playing a role in one sentence; otherwise, 1/N was used. Tab. 2 illustrates the performance (F1 scores) of JMEE [9], JRNN [7], DMCNN [2], and our framework (with and without ETP layer) in the trigger classification subtask and argument role classification subtask. From the table we can see that our framework with the ETP layer achieved the best F1 scores, they were 1.6% higher than those of the best-reported models. However, the F1 scores decreased from 75.3% to 73.1% when without the ETP layer. The results indicate that the proposed ETP layer is effective.

Analysis of Attention Mechanism
As Tab. 3 shows the results were not ideal when table filling alone was used for event argument extraction. The F1 value increased by 5.2% when the attention mechanism was added to the table filling, indicating that the attention mechanism can help improve the event extraction of arguments.
We used the sentence "In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel" as an example to illustrate the capture feature in our attention mechanism by attention scores for the heat map in Fig. 2. There were two events in the sentence: an Attacked event triggered by fired and a Die event triggered by died. Additionally, the entities Baghdad, Cameraman, Tank, and Palestine Hotel played an important role in Die and Attacked. As Fig. 4 shows, the trigger words "died" and "fired" had relatively strong connections with Baghdad, Cameraman, Tank, and Palestine Hotel in the Die and Attacked event, which may potentially be because of the capture information between triggers and the arguments through the attention mechanism.

Conclusion
In this work, we proposed a global event type guidance and attention enhancement to improve event detection and argument extraction. The enhancement exploits the pre-predicted event types to guide event detection, which strengthens the interdependencies of relations of multiple events in a sentence. Moreover, for argument extraction, we use the table filling method with the attention mechanism to obtain the correlation information between triggers and arguments. The experimental results on the ACE 2005 dataset indicate that our proposed model is effective, which superior to several strong baseline methods.
As the relationship between arguments among multiple events has not yet been considered, we will examine its influence on the extraction of arguments in the future.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.