|Intelligent Automation & Soft Computing |
Attention Weight is Indispensable in Joint Entity and Relation Extraction
1Key Laboratory of Intelligen Computing and Information Processing, Ministry of Education, Computer science College of Xiangtan University, Xiangtan, 411100, China
2Department of Computer Science, University of Georgia, Athens, USA
*Corresponding Author: Jianquan Ouyang. Email: firstname.lastname@example.org
Received: 08 February 2022; Accepted: 11 April 2022
Abstract: Joint entity and relation extraction (JERE) is an important foundation for unstructured knowledge extraction in natural language processing (NLP). Thus, designing efficient algorithms for it has become a vital task. Although existing methods can efficiently extract entities and relations, their performance should be improved. In this paper, we propose a novel model called Attention and Span-based Entity and Relation Transformer (ASpERT) for JERE. First, differing from the traditional approach that only considers the last hidden layer as the feature embedding, ASpERT concatenates the attention head information of each layer with the information of the last hidden layer by using an attentional contribution degree algorithm, so as to remain the key information of the original sentence in a deep transferring of the pre-trained model. Second, considering the unstable performance of the linear span classification and width embedding structure of the SpERT, ASpERT uses a multilayer perceptron (MLP) and softmax-based span classification structure. Ablation experiments on the feature embedding and span classification structures both show better performances than SpERT’s. Moreover, the proposed model achieved desired results on three widely-used domain datasets (SciERC, CoNLL04, and ADE) and outperforms the current state-of-the-art model on SciERC. Specifically, the F1 score on SciERC is 52.30%, that on CoNLL04 is 71.66%, and that on ADE is 82.76%.
Keywords: Attentional contribution degree; joint entity and relation extraction; BERT; span
Entity and relation extraction (ERE) has received much attention as a fundamental task in NLP, especially in specific domains (e.g., science, journalism, and medicine). The purpose of ERE is to extract structured triplets automatically from unstructured or semistructured natural language texts. A triplet consists of two entities and the relationship between them, and a sentence may contain multiple triplets. Owing to nested entities and overlapping relations, the extracted triplets may have similar or identical entities, and a triplet itself may contain two identical entities (with different relationships).
ERE is divided into pipeline ERE  and joint ERE (JERE) [2–5]. Their difference is the execution sequence of two subtasks, named entity recognition (NER) [6–9] and relation extraction (RE) . Specifically, pipeline ERE first extracts entities from the text and then extracts relations between every two entities. In this serial execution, the success of RE most likely depends on the results of NER, and the lack of information interaction between NER and RE can cause errors to accumulate. Compared with the pipeline method, the joint method uses a parameter sharing or joint decoding mechanism between NER and RE. Such a mechanism enhances the information interaction between NER and RE, reduces the high dependence of RE on NER results, and improves the accuracy of ERE. JERE includes three directions: tagging , table filling , and sequence to sequence (Seq2Seq) . Studies are considerably inclined to methods based on BIO/BILOU labels, and some complex algorithms may cause unbearable computational costs. Unlike BIO/BILOU labels, span-based methods  can efficiently identify nested entities, such as “phenytoin” within “phenytoin toxicity.”
Known as state-of-the-art span-based JERE, Span-based Entity and Relation Transformer (SpERT)  uses a sufficient number of strong negative samples and localized context to construct lightweight inference of BERT  embeddings, but this model still has two main flaws. First, SpERT focuses on learning span representation and lacks clear boundary supervision of entities. That is, the model relies on a width embedding layer to train the span length and directly classifies the sampled span through a fully connected layer. Second, many BERT-based JERE models (including SpERT) do not fully exploit domain-specific information. The semantic learning of sentences by using these models mainly comes from the coding information of the last hidden layer obtained through fine-tuning the BERT model, which limits the model’s performance.
To solve the problems mentioned above, we propose Attention and Span-based Entity and Relation Transformer (ASpERT), which is a JERE model based on the attentional contribution degree and MLP-softmax span classification structure. In ASpERT, a more complex MLP is added to enhance the entity boundary detection. In addition, in JERE’s studies on Transformer [17,18], the multihead self-attention is used to capture interactions among tokens, but only the last hidden layer is considered as the feature embedding for downstream tasks. In this paper, we develop a novel attentional contribution degree algorithm, which concatenates the softmax score of the attention head and the hidden layer feature embedding. View as a training strategy, this algorithm remains the strong attention between words by backpropagating to learn query vectors and key vectors in the pre-trained model. Finally, weighted joint optimization of the multitask loss function is conducted in the training process.
ASpERT is compared with state-of-the-art methods on three datasets, SciERC, CoNLL04 and ADE (public dataset repository address: http://lavis.cs.hs-rm.de/storage/spert/public/datasets/). Specifically, our model shows a significant performance improvement with a 1.39% increase in F1 score comparing to the baseline model (SpERT). Our model outperforms the current state-of-the-art model on the SciERC dataset and achieves desired results on CoNLL04 and ADE. In addition, we also investigate how to set contribution thresholds and different fusion methods more efficiently. And in the ablation experiments, we demonstrate the effectiveness of the novel span classification structure and attentional contribution degree algorithm.
The contributions of our work can be summarized as follows:
a) We analyze the reasons for the inaccurate boundary recognition of SpERT and propose a simple and effective span classification structure to alleviate this problem.
b) We propose an attentional contribution degree algorithm to enhance the model with strong attention between words by backpropagation.
c) Experiments show that our model achieves outstanding performance on domain-specific datasets (SciERC, CoNLL04, and ADE) in science, news, and medicine. Especially, it is better than the current state of the art on SciERC.
2 Related Work
Acting as an implementation of ERE, the pipeline method [19,20] executes NER and RE in series. Herein, NER methods  can be categorized into rule, dictionary, and machine learning-based methods [22–24]. ER methods can be divided into handcrafted feature-based methods  and neural network-based methods [26–28]. Although the pipeline method has been successfully applied in some fields, the sequential execution of NER and RE makes it ignore the correlation between the two tasks, which limits the further development of these methods.
To alleviate the above limitations, researchers proposed JERE, including feature-based methods [29,30] and neural network-based methods [31–35]. Limited by the expression capability of the model, later studies are mainly based on the neural network method. Research on JERE includes three main directions: tagging, table filling, and Seq2Seq. Zheng et al.  proposed a novel tagging scheme, which assigns a tag to each word (including word position, relation type, and relation role) for classification. The table filling  is usually to construct a two-dimensional table; thus, the solutions of NER and RE become the problems of labeling diagonal and nondiagonal elements in the table, respectively. These methods allow a single model to execute NER and RE simultaneously but cannot fully use the table structure. Wang et al.  proposed to learn two separate encoders (a table encoder and a sequence encoder), which effectively alleviates this problem. The Seq2Seq method  first retains sentence features and then extracts triplets in sequence. CopyRE , the most typical method, is based on the copy mechanism and Seq2Seq structure, but only extracts individual word. In response to this problem, Zeng et al.  proposed a multitask learning method based on BIO labeling.
Methods aforementioned are all based on the BIO/BILOU scheme, and they face a common problem—nested entities. To solve the problem, Takanobu et al.  adopted a hierarchical reinforcement learning framework. In this framework, entities and relations are divided into different levels, and the semantic information detected by high-level relations is used in extracting low-level entities. The two levels alternate back and forth to achieve JERE. Dai et al.  proposed a position-attention mechanism to solve this problem. It uses tag sequences that have the same length as the sentence to annotate each word. Although these methods alleviate the nested entity problem, the immense computational burden is inevitable.
An alternative to the BIO/BILOU scheme is the span-based method , which performs a detailed search on all spans to prevent the interference of nested entities on JERE results. This method enhances the interaction among tasks by refining the span representation, allowing the model to learn useful information from a broader context. The methods include the bi-LSTM-based span-level model proposed by Dixit et al.  and the dynamic span graph approach through soft coreference and relation links proposed by Luan et al. . To improve the performance of the span method further, Wadden et al.  replaced the BiLSTM encoder with Transformers and combined it with BERT encodings and graph propagation to capture context relevance. Recently, Eberts and Ulges’ SpERT  found localized context representation and strong negative sampling to be of vital importance. Although SpERT is the state-of-the-art model for span-based JERE, it suffers from underutilization of BERT encoding information and inaccurate identification of span boundaries.
In this section, we introduce the baseline model, SpERT. It uses pretrained BERT as the core, tokenizes the input sentence, and applies span classification, span filtering, and relation classification. Specifically, it classifies each span into entity types, filters nonentities, and categorizes all candidate entity pairs. To train the classifier efficiently, SpERT uses negative samples at the model training stage.
3.1 Negative Sampling
Negative sampling is performed on each sentence in corpus . A fixed number of negative samples are randomly sampled from sentence and labeled with none, which is combined with the positive samples of existing labels in corpus to form training samples (including candidate span and candidate entity pair). Then, the training samples are applied in learning the span and relation classifiers. For the span classifier, SpERT selects subsequences that do not belong to the positive span sample and are less than 10 words as the negative span sample. For the relation classifier, SpERT selects entity pairs without any relation labels from positive span samples as negative relation samples (See supplementary file for details).
3.2 Span Classification
The span classifier of SpERT consists of a fully connected layer and a softmax layer, and regards any candidate span as input (where represents the i-th token embedding). Its output is the entity class probability of this candidate span (where denotes concatenation):
where is the last hidden layer embedding from the fine-tuned BERT. is width embedding, which learns the width of each candidate span from a dedicated embedding matrix. is the maximum pooling. is the last hidden layer embedding from BERT’s special [CLS] token. is the trainable weight, and is the bias. is the dimension of the BERT’s last hidden layer, is the dimension of , and is the number of entity classes (including ) is the softmax activation function.
3.3 Span Filtering
The entity classes include predefined entity types (Tab. 2) and label that does not constitute entities. In accordance with the output of the span classifier (Eq. (3)), the entity class with the highest probability is selected as the predicted result. If the predicted probability of the label is the largest, then the candidate span does not constitute an entity.
3.4 Relation Classification
The relation classifier consists of a fully connected layer and sigmoid. The input of the classifier is any candidate entity pair , and the output is the relation class probability of this candidate entity pair:
where and are the BERT/width embedding (Eq. (1)) of the head entity and the tail entity in the candidate entity pair . is the localized context representation. Specifically, SpERT places the span between the head entity and the tail entity into the fine-tuned BERT for encoding and obtains . If this span is empty, then Eq. (4) is changed to . is the trainable weight, is the bias, and is the number of relation classes (including ). is the sigmoid activation function. Given a threshold , any relation class probability greater than is considered activated. If is activated, then this entity pair has no known relation. For example, if the predicted probabilities of the entity pair with respect to , , and . are 0.43, 0.47, and 0.1, respectively, then there are two types of relationships between and . If the predicted probabilities are 0.59, 0.0, 0.41, respectively, then no relationship exists for that entity pair (The threshold is set at 0.4).
3.5 Problems of SpERT
As mentioned in the Introduction, we determine that SpERT has two problems. First, SpERT’s classifier lacks clear boundary supervision on the span. Width embedding is the only constraint mechanisin span width. Considering that the span is long or short, spans composed of different numbers of words will have distinct characteristics. SpERT specifically learns a width embedding matrix through backpropagation; hence, it should play a key role in entity boundary supervision. To evaluate the effectiveness of width embedding, we test two different training models on three datasets:
• SpERT: It uses the default structure settings, which provide the width embeddings that need to be learned by backpropagation (Eq. (1)).
• ERT’: The variant model of SpERT that removes the width embedding in the span and relational classifiers, while keeping the other default structure settings of the model.
As shown in Tab. 1, the addition of width embedding is unreliable in improving the performance of the span classifier. Especially on the SciERC dataset, the F1 score of the SpERT model with width embedding decreases by 0.75% in terms of NER. Three reasons are considered for our analysis. First, the model lacks boundary supervision when facing a complex dataset. The SciERC dataset is more complicated than the two other datasets. It is more significant than CoNLL04 in the dataset size, and it is 3 times that of ADE in the entity class. Second, the width embedding of SpERT only learns the span width and cannot essentially solve the problem of inaccurate boundary recognition. Consequently, performance degradation is expected. Third, because the span classifier of SpERT is only a fully connected layer, the model is overly dependent on BERT encoding. For example, when the extraction target is the “geometric estimation problem,” the model extracts the correct span while also extracting the semantically similar wrong span “selection of geometric estimation problems,” which leads to a decrease in model performance.
In addition, many experiments have shown that the BERT model effectively extracts text information. If the text data are domain-specific (e.g., science, news, and medicine), we may need to consider creating our domain-specific language model. Relevant models have been created by training the BERT architecture on a domain-specific corpus rather than the general English text corpus used to train the original BERT model. Because pretraining BERT requires a large corpus, and we cannot use this method to improve the model’s extraction of text in a specific field. Therefore, we need to change the method to mine the unexploited information in the Transformers pretrained model under the existing conditions. At present, the input for downstream tasks of many mainstream models (including SpERT) often comes from the last hidden layer embedding of BERT while ignoring the interactive information among words carried by the BERT attention head itself. To this aim, we provide a novel attentional contribution degree algorithm, which combines the softmax attention head score with hidden layer feature embedding to improve the model’s extraction of entities and relationships.
4 Our Method
In this section, we choose SpERT as the baseline model, analyze SpERT’s problems of inaccurate span recognition and insufficient information mining in specific fields, and propose a novel ASpERT model (Fig. 1). Then, we introduce a novel attentional contribution degree algorithm and a multitask training method that combines span and relation classifiers.
4.1 Novel Structure for the Span Classifier
We consider that the span classifier is different from the traditional classifier, such as fully connected layer and softmax layer. In addition to classifying the span, it also needs to predict which words belong to the entity boundary. Thus, we propose a span classification structure that considers these two functions.
The BERT embedding of candidate span and the BERT embedding of the special [CLS] token are the main sources of textual semantic information (Eq. (1)). The special [CLS] token represents the complete sentence information in the classification task. The maximum pooling of these BERT embeddings is only applicable to span classification. A candidate span includes one or more words, and BERT assigns an embedding matrix to each word through fine-tuning. Maximum pooling (Eq. (1)) of the BERT embedding of the candidate span is equivalent to selecting the largest single word embedding matrix to represent the semantic information of this span. SpERT excessively strengthens key information at the expense of marginal information and the association among words, resulting in the unclear boundary supervision of entities. This condition explains well why “geometric estimation problems” and “selection of geometric estimation problems” have similar span class probabilities. For the above reasons, we add the attentional contribution degree to the span representation as the boundary confidence of span classification. Specifically, we first concatenate all the attention heads with the residuals of the span classifier, then remove the lower attention scores and take the average (via the attention contribution algorithm in Section 4.2) combined with the feature embedding. Finally, the fine-tuning of the pre-trained model is constrained by backpropagation learning. We consider that the attentional contribution degree incorporates word-to-word attention as well as residual concatenation, allowing the model not to lose original information as depth increases. The specific improvements to Eqs. (1) and (2) respectively are as follows:
where , and are obtained in the same way as in Eq. (1). The details of the calculation of are described in the Attentional Contribution Degree Algorithm section.
JERE tasks are usually converted into one or more classification tasks at the end. Therefore, the classifier’s quality is related to whether the high-dimensional data information can be accurately mapped to a given category. SpERT’s span classifier is a linear fully connected layer. Few data strictly adhere to the linear distribution when noise is introduced, such that a simple linear structure cannot accurately predict the span class. Recently, MLP has been repositioned in visual classification . For migration learning, we use MLP for span classification, hoping to increase the number of parameters to improve the potential representation capability of the classifier. The improvements to Eq. (3) are as follows:
where is the entity probability. and are the trainable weights. and are the biases. is the number of BERT’s attention heads. is the number of hidden layer units of the MLP. is the ReLu activation function, and is the softmax activation function.
4.2 Attentional Contribution Degree Algorithm
In this subsection, we describe the attentional contribution degree algorithm in detail. Attentional contribution degree is a novel attention weight, which concatenates the calculated attentional contribution degree with hidden layer features to obtain a weighted feature encoding. This encoding helps the model understand the contextual information of the span and strengthens the model’s extraction of entities and relations.
The attentional contribution degree is derived from the attention paid to interword information by each attention head in each layer of the pretrained model. Among them, pretrained model comes from the BERT variant of the Transformers library. The large model and corpus symbolize many GPU resources, such that we only fine-tune the pretrained model (such as BERT base (cased) , SciBERT (cased) , and BioBERT (cased) ) in a specific field. This condition does not mean that we are bound by the pretrained model. On the contrary, we fully utilize the attention header information of Transformers. We train and use the intermediate product of the model—self-attention head. For example, BERT base has 12 layers, and each layer has 12 attention heads. Then we can make use of the information of these 144 attention heads.
Specifically, first we extract all the attention heads, which contain information about the relationship among words in a sentence. Second, we concatenate multiple attention heads in the num head dimension. As shown in Algorithm 1, we mask irrelevant words and only retain the relationship information between the candidate span and the words in the full text. Immediately after, considering that each attention layer provides multiple “representation subspaces,” the multihead attention mechanism expands the model’s ability to represent different positions. We provide the contribution threshold to filter the attention head information with low attention to candidate span. Finally, the attention contribution degree is obtained by mean-pooling the attention header information from the token dimensions of both the context and the entity. (Algorithm 2).
Our training is supervised, providing the model with labeled sentences (including candidate span, entity class, candidate entity pair, and relation class). We learn width embedding and span/relation classifiers’ parameters ( , , ) and fine-tune the domain-specific BERT. Different from the joint loss function defined by SpERT, Eq. (10) is used here for entity classification and relation classification:
where is the weight of the joint loss function, is the loss of the span classifier calculated using the cross-entropy loss function, and is the loss of the relation classifier calculated using the binary cross-entropy loss function.
5.1 Datasets and Setting
We evaluate the model on three datasets from different domains, CoNLL04 , SciERC , and ADE . As shown in Tab. 2, the CoNLL04 dataset is derived from news articles and includes four entity types and five relationship types. The dataset is divided into a training set of 911 sentences, a validation set of 231 sentences, and a test set of 288 sentences. The SciERC (scientific information extractor) dataset is derived from abstracts of artificial intelligence papers and includes six scientific entity types and seven relationship types. This dataset is divided into a training set of 1861 sentences, a validation set of 275 sentences, and a test set of 551 sentences. The ADE (adverse drug effect) dataset is derived from medical reports describing the adverse effects of drug use and contains two entity types and one relationship type. The dataset is divided into a training set of 3843 sentences and a validation set of 429 sentences.
We evaluated ASpERT on entity extraction and RE. An entity prediction is considered correct if the span and entity type of the entity prediction match the ground truth. A relation prediction is considered correct if the relation type and the two related entities (span and type) match the ground truth. In particular, to be consistent with the evaluation criteria of the comparative model, we only consider the prediction of relationship and entity span (ignoring the accuracy of entity type) on the SciERC dataset. Hyperparameters used for final training are listed in Tab. 3.
5.2 Comparison with the State of the Art
First, to evaluate the effectiveness of ASpERT’s improvement based on SpERT, we train both models on the same device and unify the pretrained model and training parameters. We report an average of over five runs for each dataset. In particular, the ADE dataset uses 10-fold cross validation. As shown in Tab. 4, the performance of ASpERT is significantly better than that of the baseline model (SpERT) on different datasets. For entity extraction, the micro-F1 scores are increased by 0.45% (CoNLL04), 0.20% (SciERC), and 0.52% (ADE), and the macro-F1 scores are increased by 0.71% (CoNLL04), 0.33% (SciERC), and 0.50% (ADE). For RE, the micro-F1 scores are increased by 1.25% (CoNLL04), 1.39% (SciERC), and 1.31% (ADE), and the macro-F1 scores are increased by 1.25% (CoNLL04), 1.29% (SciERC), and 1.31% (ADE).
Subsequently, we compared the proposed model with the most advanced models currently. As shown in Tab. 5, these models are the top four models (except for SpERT) of the three datasets in the Papers With Code ranking list. We sorted ASpERT and these models in descending order in accordance with the F1 score of RE. The experimental results show that ASpERT has higher extraction performance in entities and relations. Even in the challenging and domain-specific SciERC dataset, ASpERT’s F1 score RE is 0.30% higher than that of the top-ranked PL-Marker.
5.3 Effects of Attentional Contribution Degree
In Tab. 4, although the performance of ASpERT is better than that of SpERT, it is still not clear which part of ASpERT plays a key role. To demonstrate the advantage of the attentional contribution algorithm in JERE, we test two models:
• Full: We use the complete ASpERT model structure.
• -AC: We retain most of the ASpERT model structure but remove the attentional contribution degree algorithm.
We ran these two models more than 5 times on three datasets and average them (the ADE dataset uses 10-fold cross validation). As shown in Tab. 6, the performance of the variant model without the attentional contribution degree algorithm is significantly decreased. In terms of entity extraction, F1 scores decreased by 0.48%. In RE, the F1 score decreased by 1.46%. These experimental results show that the attentional contribution degree algorithm can capture word-to-word relationships adequately, which helps in efficient relation classification and is the main contribution of the new model architecture.
Then, we investigated the effect of setting different contribution thresholds on the model’s ability to capture word-to-word relationships on SciERC and CoNLL04. Fig. 2 shows the F1 scores (RE) with different contribution thresholds. When the threshold is 0.5, the model performance is optimal.
Lastly, we also investigate the different fusion methods of each attention head information, namely, the maximum pooling, sum pooling, and mean pooling. Tab. 7 shows the F1 scores by using different fusion methods on three datasets. We determined that the mean pooling is more advantageous for JERE.
5.4 Effects of the Novel Span Classifier
To evaluate the effectiveness of the novel span classifier, we further test two models on the SciERC dataset:
• Full: We use the complete ASpERT model structure.
• -MLP: We retain most of the ASpERT model structure but replace the MLP structure with a fully connected layer in span classification.
As shown in Tab. 8, removing the MLP structure weakened the classifier’s ability to learn information about span boundaries, leading to a decrease in the recall and accuracy of entity extraction and thus a decrease in the F1 score by nearly 0.74%.
In this paper, we have proposed a novel model termed ASpERT for JERE. This model fuses the overlooked attention header information in downstream tasks with the feature embedding of the hidden layer via a new attentional contribution degree algorithm. Specifically, the attentional contribution incorporates word-to-word attention and the residual connectivity of the span classifier with each attentional head. This allows the model to maintain the raw information as depth increases and thus enhance the model’s ability to capture contextual information, thus being adapted to domain-specific JERE. Moreover, the MLP-softmax structure of the span classifier and the attentional contributions is used to determine the boundary supervision and to improve the span classification. Without these ideas, researchers who are limited by hardware conditions may have to fine-tune parameters for information extraction tasks. The use of pre-trained models is not limited to the encoding of implicit layer information.
Considering that the attentional head is the base unit of Transformer pre-training models, in future work, we will further demonstrate the influence of the attentional contribution degree algorithm on other Transformer pre-training models. Notably, Asian languages, however, require more words to express the same meaning as English, which is not friendly to the random sampling method, hence we will focus on spanwise sampling of complex language structures.
Acknowledgement: We thank the open-source authors of the dataset. We also thank all members from Xiangtan University 504 Lab for their strong support for my research.
Funding Statement: This work was supported by Key Projects of the Ministry of Science and Technology of the People’s Republic of China (2020YFC0832401) and National College Students Innovation and Entrepreneurship Training Program (No. 202110530001).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|