An exhaustive study has been conducted to investigate span-based models for the joint entity and relation extraction task. However, these models sample a large number of negative entities and negative relations during the model training, which are essential but result in grossly imbalanced data distributions and in turn cause suboptimal model performance. In order to address the above issues, we propose a two-phase paradigm for the span-based joint entity and relation extraction, which involves classifying the entities and relations in the first phase, and predicting the types of these entities and relations in the second phase. The two-phase paradigm enables our model to significantly reduce the data distribution gap, including the gap between negative entities and other entities, as well as the gap between negative relations and other relations. In addition, we make the first attempt at combining entity type and entity distance as global features, which has proven effective, especially for the relation extraction. Experimental results on several datasets demonstrate that the span-based joint extraction model augmented with the two-phase paradigm and the global features consistently outperforms previous state-of-the-art span-based models for the joint extraction task, establishing a new standard benchmark. Qualitative and quantitative analyses further validate the effectiveness the proposed paradigm and the global features.
Span-based joint entity and relation extraction models simultaneously conduct NER ( Span-based models add a Not-Entity type for spans that are not entities and a Not-Relation type for relation tuples that don’t hold relations.
Span-based joint extraction models [
NER | Other | Org | Per | Loc | Not-Entity | |
592 | 786 | 1,370 | 1,541 | 101,555 | ||
RE | Kill | LocIn | Work | OrgBI | Live | Not-Relation |
229 | 312 | 325 | 317 | 421 | 12,915 |
Global features, such as those derived from entity information, can be critical in the joint extraction task. As illustrated in
Relation tuple | Type | Entity distance | |||
---|---|---|---|---|---|
[0–3] | [4–7] | [8–11] | [>11] | ||
Live | 39.0% | 27.6% | 10.5% | 22.9% | |
LocIn | 76.6% | 12.8% | 3.5% | 7.1% | |
Work | 41.5% | 35.1% | 8.0% | 15.4% | |
OrgBI | 72.0% | 10.7% | 5.5% | 11.8% | |
Kill | 21.3% | 33.5% | 26.7% | 18.5% |
In this paper, we propose a two-phase span-based model for the joint extraction task, with the goal of addressing the issue of grossly imbalanced data distributions and the lack of effective global features. Motivated by the fact that we can achieve NER (RE) in two steps, namely first classify all entities (relations) and then predict their types. We divide the joint extraction task into two phases, with the first phase obtaining entities and relations and the second phase predicting their types. Our model reduces the data distribution gap by dozens of times using the two-phase paradigm. Take the data in 1: 24 ≈ (592+786+1370+1541): 101555; 1: 8 ≈ (229+312+325+347+421): 12915. The entity distributions can be approximate to (592: 786: 1370: 1541), and the relation distributions can be approximate to (229: 312: 325: 347: 421), which are approximately even.
Experimental results on the ACE05, CoNLL04 and SciERC datasets demonstrate that our model consistently outperforms the strongest span-based baselines in terms of F1-score, providing a new span-based benchmark for the joint extraction task. Extensive analyses further validate the effectiveness of our model.
In summary, our model differs from the previous span-based models in three ways: (1) As far as we know, our model makes the first attempt to balance the grossly imbalanced data distributions. (2) Our model combines entity type and entity distance as the global features, whereas previous span-based models use at most one of them. (3) Our model uses a gated mechanism to fuse various semantic representations, whereas previous span-based models use a simple concatenation manner.
Recently, span-based models have been extensively investigated for the joint entity and relation extraction task. Luan et al. [
The entity type and entity distance are two types of important global features that are frequently used in joint extraction models [
The neural architecture of our two-phase span-based model is illustrated in
We formulate the text spans (denoted as
Our model uses the BERT [ The [CLS] token is a specific token that added to the beginning of tokenized texts. The embedding of the [CLS] token is generally used for text classifications.
Based on
As shown in
This module obtains coarse-grained entities by performing binary classification on span semantic representations. We begin by converting all entity types in the training set to the Entity type and set the type of sampled negative entities to the Not-Entity type. Our model will be trained to classify spans as the Entity type when they are predicted to be entities, otherwise the Not-Entity type.
In this paper, we obtain the span semantic representations using three different types of semantic representations: (1) span token representation, (2) contextual representation, and (3) span width embedding.
For the span
In this paper, we take the
Span width embedding allows the model to incorporate prior experience over span widths. In this paper, we train a fixed-size embedding for each span width (i.e., 1, 2,…) during the model training. And we refer to the width embedding for the span
To obtain coarse-grained entities, we first pass the
This module obtains coarse-grained relations by performing binary classification on semantic representations of relation tuples. We begin by converting all relation types in the training set to the Relation type and assigning the Not-Relation type to sampled negative relations. Our model will be trained to classify relation tuples as the Relation type if they have relations, otherwise the Not-Relation type.
Let
We obtain the sematic representation of
Relation context is the text that between the two entities of a relation tuple [
We obtain the contextual representation of
In this paper, we propose to combine entity type and entity distance as global features. Due to the fact that all entities here are the Entity type, only the entity distance can be used to distinguish different feature entries. As show in
We obtain the semantic representation of
To obtain coarse-grained relations, we first pass the
For each of the above two binary classifications, the training objective is to minimize the following binary cross-entropy loss:
In the Phase Two, our model predicts the types of coarse-grained entities and relations, obtaining fine-grained entities and relations. The Phase Two, as illustrated in
In this module, we obtain entity types by conducting multi-class classifications on the semantic representations of coarse grained entities. Specifically, for each coarse-grained entity
We obtain relation types by performing multi-class classifications on relation semantic representations. As shown in
For each coarse-grained relation
Then we obtain the relation semantic representation (denoted as
To obtain the type of
For each of above two multi-class classification tasks, the training objective is to minimize the following cross-entropy loss:
During the model training, we minimize the following joint training loss:
We evaluate our model on ACE05 [
ACE05 defines seven entity types (Per, Org, Loc, Gpe, Fac, Veh, and Wea) and six relation types (Phys, Part-whole, Per-soc, Org-aff, Art, and Gen-aff) between entities. We use the same data splits, pre-processing, and task settings proposed by Li and Ji [
CoNLL04 defines four entity types (Loc, Org, Per, and Other) and five relation types (Kill, Live, LocIn, OrgBI, and Work). We use the splits defined by Ji et al. [
SciERC is derived from 500 abstracts of AI papers. The dataset defines six scientific entities (Task, Method, Metric, Material, Other, and Generic) and seven relation types (Compare, Conjunction, Evaluate-for, Used-for, Feature-of, Part-of, and Hyponym-of) in a total of 2,687 sentences. We use the same training (1,861 sentences), development (275 sentences), and test (551 sentences) split following the previous work [
For a fair comparison with previous work, we use the bert-base-cased model on ACE05 and CoNLL04, and use the scibert-scivocab-cased model on SciERC. We optimize our model using the BertAdam for 120 epochs with a learning rate of 5e-5 and a weight decay of 1e-2. We set the span width threshold
For ACE05, an entity mention is considered correct if its head region and type match the ground truth, and a relation is correct if both its relation type and two entity mentions are correct. For CoNLL04, an entity mention is considered correct if its offsets and type match the ground truth, and a relation is correct if both its relation type and two entity mentions are correct. For SciERC, the entity type is not considered when evaluating relation extraction, which is in line with the previous work [
We compare our model with all the published span-based models for the joint extraction task that we are aware of. We report the comparison results in
To be more precise, on ACE05, our model achieves + 0.4% and + 3.2% absolute F1 gains on NER and RE, respectively, when compared to Ji et al. [
Model | NER | RE | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
Dixit and Al-Onaizan [ |
85.9 | 86.1 | 86.0 | 68.0 | 58.4 | 62.8 |
Luan et al. [ |
- | - | 88.4 | - | - | 63.2 |
Wadden et al. [ |
- | - | 88.6 | - | - | 63.4 |
Zhong and Chen [ |
- | - | 88.7 | - | - | 66.7 |
Wang et al. [ |
- | - | 88.9 | - | - | 64.3 |
Ji et al. [ |
89.3 | 89.9 | 89.6 | 71.2 | 60.2 | 65.2 |
Our Model |
Model | NER | RE | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
Eberts and Ulges [ |
88.3 | 89.6 | 89.0 | 73.0 | 70.0 | 71.5 |
Zhang et al. [ |
88.1 | 90.6 | 89.3 | 75.4 | 71.1 | 73.2 |
Tang et al. [ |
- | - | 89.4 | - | - | 72.6 |
Ji et al. [ |
90.4 | 90.2 | 77.0 | 71.9 | 74.3 | |
Our Model | 89.6 |
Model | NER | RE | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
Luan et al. [ |
67.2 | 61.5 | 64.2 | 47.6 | 33.5 | 39.3 |
Luan et al. [ |
- | - | 65.2 | - | - | 41.6 |
Wadden et al. [ |
- | - | 67.5 | - | - | 48.4 |
Zhong and Chen [ |
- | - | 68.9 | - | - | 50.1 |
Eberts and Ulges [ |
70.9 | 69.8 | 70.3 | 53.4 | 48.5 | 50.8 |
Zhang et al. [ |
69.7 | 71.1 | 70.4 | 50.0 | 52.5 | |
Santosh et al. [ |
71.3 | 70.5 | 51.9 | 50.6 | 51.3 | |
Our Model * | 69.7 | 52.9 |
We attribute the above performance improvements to that our model is capable of balancing the grossly imbalanced data distributions and exploiting the effective global features.
We conduct extensive effectiveness investigations across the three datasets and use SpERT [
As illustrated in
Data | Task | Baseline | Our model | |
---|---|---|---|---|
Phase one | Phase two | |||
ACE05 | NER | 1: 773.3 | 1: 19.7 | 1: 21.3 |
RE | 1: 150.0 | 1: 13.8 | 1: 3.4 | |
CoNLL04 | NER | 1: 171.5 | 1: 23.7 | 1: 2.6 |
RE | 1: 56.4 | 1: 9.9 | 1: 1.8 | |
SciERC | NER | 1: 605.3 | 1: 25.5 | 1: 7.7 |
RE | 1: 913.5 | 1: 35.6 | 1: 13.6 |
Based on the above observations, we conclude that the two-phase paradigm allows our model to avoid suffering from grossly imbalanced data distributions.
In general, as the entity lengths increase, it becomes increasingly difficult to recognize the entities. In this section, we conduct investigations on NER performance in relation to entity lengths. We divide all entity lengths, which are restricted by the span width threshold
In general, as the distance between the two entities of a relation increases, the relation becomes more difficult to extract. In this section, we conduct investigations on RE performance in relation to entity distances. We divide all entity distances into five intervals, namely [0], [1–3], [4–6], [7–9], and [>=10]. We conduct investigations on the dev sets of the three datasets and report the investigation results in
We conduct ablation studies on the dev sets of the three datasets to analyze the effects of various model components. We report the ablation results in
Model | ACE05 | CoNLL04 | SciERC | |||
---|---|---|---|---|---|---|
NER (F1) | RE (F1) | NER (F1) | RE (F1) | NER (F1) | RE (F1) | |
Our Model | 88.1 | 64.8 | 88.7 | 74.2 | 71.4 | 53.5 |
w/o Two-phase | 86.5(−0.6) | 61.7(−3.1) | 86.4(−2.3) | 71.5(−2.7) | 68.2(−3.2) | 50.9(−2.6) |
w/o Bi-features | 88.3(+0.2) | 63.9(−0.9) | 88.2(−0.5) | 73.5(−0.7) | 71.1(−0.3) | 51.7(−1.8) |
w/o Multi-features | 87.8(−0.3) | 63.4(−1.4) | 88.4(−0.3) | 72.4(−1.8) | 71.5(+0.1) | 52.0(−1.5) |
w/o Both-features | 87.5(−0.6) | 61.5(−3.3) | 88.5(−0.2) | 72.0(−2.2) | 70.0(−0.4) | 51.5(−2.0) |
w/o gated | 87.9(−0.2) | 64.3(−0.5) | 88.1(−0.6) | 73.3(−0.9) | 69.9(−1.5) | 52.7(−0.8) |
base | 86.3(−1.8) | 61.5(−3.3) | 85.9(−2.8) | 70.2(−4.0) | 67.8(−3.6) | 49.0(−4.5) |
We have the following observations: (1) The two-phase paradigm consistently improves the model performance across the three datasets, delivering + 0.6% to + 3.2% F1-scores on NER and + 2.6% to + 3.1% F1-scores on RE, which can be attributed to the paradigm’s ability to prevent our model from being harmed by grossly imbalanced data distributions. (2) Both binary and multi-class global features consistently benefit RE performance, and the multi-class features are generally more effective than the binary ones, as demonstrated on ACE05 and CoNLL04. The explanation for this could be that the multi-class features take into account fine-grained entity types. Additionally, both types of global features have a negligible effect on NER. A plausible explanation is that these features are derived from entity information and are employed in the relation extraction. (3) The combination of the two types of global features results in improved RE performance, suggesting that they have a beneficial effect on one another. (4) The proposed gated method consistently improves model performance, bringing + 0.2% to + 1.50% F1-scores on NER and + 0.5% to + 0.9% on RE, suggesting that the gated mechanism can better fuse various semantic representations.
In this paper, we propose a two-phase span-based model for the joint entity and relation extraction task, aiming to tackle the grossly imbalanced data distributions caused by the essential negative sampling. And we augment the proposed model with global features obtained by combining entity types and entity distances. Moreover, we propose a gated mechanism for effectively fusing various semantic representations. Experimental results on several datasets demonstrate that our model consistently outperforms the strongest span-based models for the joint extraction task, establishing a new standard benchmark.