In recent years, many text summarization models based on pre-training methods have achieved very good results. However, in these text summarization models, semantic deviations are easy to occur between the original input representation and the representation that passed multi-layer encoder, which may result in inconsistencies between the generated summary and the source text content. The Bidirectional Encoder Representations from Transformers (BERT) improves the performance of many tasks in Natural Language Processing (NLP). Although BERT has a strong capability to encode context, it lacks the fine-grained semantic representation. To solve these two problems, we proposed a semantic supervision method based on Capsule Network. Firstly, we extracted the fine-grained semantic representation of the input and encoded result in BERT by Capsule Network. Secondly, we used the fine-grained semantic representation of the input to supervise the fine-grained semantic representation of the encoded result. Then we evaluated our model on a popular Chinese social media dataset (LCSTS), and the result showed that our model achieved higher ROUGE scores (including R-1, R-2), and our model outperformed baseline systems. Finally, we conducted a comparative study on the stability of the model, and the experimental results showed that our model was more stable.
The goal of text summarization is to deliver important information in the source text with a small number of words. In the current era of information explosion, it is an undeniable fact that text information floods the Internet. Hence, it is necessary for us to apply text summarization which can help us obtain useful information from the source text. With the rapid development of artificial intelligence, automatic text summarization was proposed, that is, computers can aid people in complex text summarization. By using machine learning, deep learning, and other methods, we can get a general model for automatic text summarization, which can replace humans to extract summary from source text.
Automatic text summarization is usually divided into two categories according to the implementation method: extractive summarization and abstractive summarization. And Extractive summarization is to extract some sentences containing key information from the source text and combine them to a summarization, while abstractive summarization is to compress and refine the information of source text to generate a new summarization. Compared with extractive summarization, abstractive summarization is more innovative, because machines can generate summary contents that are more informative and attractive. Abstractive text summarization models are usually based on a sequence-to-sequence model [
BERT [
Nowadays, Neural Network has been applied to many fields [
The remainder of this paper is organized as follows. The related work will be discussed in Section 2. The proposed model will be presented in Section 3. Details of the experiment will be explained in Section 4. Comparison and discussion of experimental results will be made in Section 5. Conclusions and Future work will be drawn in Section 6.
The research on abstractive summarization mainly depends on the seq2seq model proposed by Cho et al. [
Pre-trained language model has become an important technology in NLP field in recent years. The main idea is that the model’s parameters are no longer randomly initialized, but trained in advance by some tasks (such as Language Model) and large-scale text corpus. Then they are fine-tuned on the small dataset of specific tasks, and it makes it easy to tarin a model. The early pre-trained Language Model is Embeddings from Language Model (ELMo) [
Liu et al. [
Ma et al. [
In 2017, Sabour et al. [
Based on the methods mentioned above, we complete abstractive summarization by adopting the idea of seq2seqLM, and added the semantic supervision method into the model. We conducted relevant experiments on the Chinese dataset LCSTS [
Our model structure is shown in
BERT’s embedding layer contains Token Embedding, Segment Embedding and Position Embedding. Token Embedding is the vector representation of tokens, which is obtained by looking up the embedding matrix with token Id. Segment Embedding is used to express whether the current token comes from the first segment or the second segment. Position Embedding is the position vector of the current token.
Transformer Layer consists of
The input of seq2seqLM is the same as that of BERT, but the main difference is that seq2seqLM changes the mask matrix of multi-head attention in Transformer. As shown on the left of
The element of the mask matrix is 0, which means the
The output of Embedding Layer is defined as
where
and
We took the output of the last Transformer Block as the input of Output Layer. Output Layer consists of three parts: two full connection layers and one Layer Normalization.
The first full connection layer is used to add nonlinear operations to BERT’s output, and we use GELU as the activation function, which is widely used in BERT. In
Different from Batch Normalization [
The second full connection layer is used to parse the output, which contains
For lack of fine-grained semantic representation in BERT, it can’t produce high-quality summaries when it was applied to text summarization. And there are semantic deviations between the original input and the encoded result passed multi-layer encoder. We hope to improve these problems by adding semantic supervision based on Capsule Network. The implementation of semantic supervision is shown on the right side of
Ma et al. [
It can be seen from
We took the output of Embedding layer
We found that the longer the input sequence is, the larger the semantic deviations are. So we use different intensity semantic supervision for different lengths of the input. We controlled the intensity of supervision by the parameter
The loss function of Semantic Supervision can be written as follows:
There are two loss functions in our model that need to be optimized. The first one is the categorical cross-entropy loss in
During training, we used Adam optimizer [
In this section, we will introduce our experiments in detail, including dataset, evaluation metric, experiment setting and baseline systems.
Dataset | pairs | |
---|---|---|
PART I | 2400591 | – |
PART II | 10666 | 8685 |
PART III | 1106 | 725 |
We conducted experiments on LCSTS dataset [
We used the ROUGE scores [
We used the Chinese glossary of BERT-base, which contains 21,128 characters, but the number we counted all the characters in PART I of LCSTS is 10,728. To reduce the computation, we only used the characters of the intersection between them, including 7,655 characters. In our model, we used the default embedding size 768 of BERT-base, the number of heads
We have compared the proposed model with the following model’s ROUGE score, and we would briefly introduce them next.
Models | ROUGE-1 | ROUGE-2 | ROUGE-3 |
---|---|---|---|
RNN(W) [ |
17.7 | 8.5 | 15.8 |
RNN(C) [ |
21.5 | 8.9 | 18.6 |
RNN-context(W) [ |
26.8 | 16.1 | 24.1 |
RNN-context(C) [ |
29.9 | 17.4 | 27.2 |
CopyNet(W) [ |
35.0 | 22.3 | 32.0 |
CopyNet(C) [ |
34.4 | 21.6 | 31.3 |
DRGD(C) [ |
37.0 | 24.2 | 34.2 |
WEAN(C) [ |
37.8 | 25.0 | 35.2 |
39.2 | 26.0 | 36.2 | |
BERT-seq2seqLM(C) (our impl.) | 39.84 | 25.47 | 34.62 |
+ |
35.75 |
For clearer clarification, we named the BERT with the modified mask matrix as BERT-seq2seqLM, and denote our model with semantic supervision based on Capsule Network as
After we compared our model with baseline systems, the experimental results of these models on LCSTS datasets are shown in
In addition, we also compared the ROUGE scores of models under different epochs, as shown in
As for semantic supervision, in addition to Capsule Network, we also tried to use LSTM and GRU. However, after comparative experiments, we found that Capsule Network was more suitable. As shown in
Models | ROUGE-1 | ROUGE-2 | ROUGE-3 |
---|---|---|---|
BERT-seq2seqLM | 39.84 | 25.47 | 34.62 |
+LSTM | 40.19 | 25.79 | 35.14 |
+GRU | 40.34 | 26.0 | 35.22 |
+Capsule | 40.63 | 26.4 | 35.75 |
As shown in
According to the idea of UNILM, we transformed the mask matrix of BERT-base to accomplish the abstractive summarization. At the same time, we introduced the semantic supervision method based on Capsule Network into our model and improve the performance of text summarization model on the LCSTS dataset. Experimental results showed that our model outperformed baseline systems. In this paper, Semantic Supervision method was only used in the pre-trained language model. As for other neural network models, we have not do experiments for verification yet. In this experiment, we only used the Chinese dataset and did not verify on other datasets. In the future, we will improve the semantic supervision method and experiments for its problems.
We would like to thank all the researchers of this project for their effort.