|Computers, Materials & Continua |
An Intelligent Tree Extractive Text Summarization Deep Learning
Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia
*Corresponding Author: Hanan Ahmed Hosni Mahmoud. Email: firstname.lastname@example.org
Received: 18 March 2022; Accepted: 17 May 2022
Abstract: In recent research, deep learning algorithms have presented effective representation learning models for natural languages. The deep learning-based models create better data representation than classical models. They are capable of automated extraction of distributed representation of texts. In this research, we introduce a new tree Extractive text summarization that is characterized by fitting the text structure representation in knowledge base training module, and also addresses memory issues that were not addresses before. The proposed model employs a tree structured mechanism to generate the phrase and text embedding. The proposed architecture mimics the tree configuration of the text-texts and provide better feature representation. It also incorporates an attention mechanism that offers an additional information source to conduct better summary extraction. The novel model addresses text summarization as a classification process, where the model calculates the probabilities of phrase and text-summary association. The model classification is divided into multiple features recognition such as information entropy, significance, redundancy and position. The model was assessed on two datasets, on the Multi-Doc Composition Query (MCQ) and Dual Attention Composition dataset (DAC) dataset. The experimental results prove that our proposed model has better summarization precision vs. other models by a considerable margin.
Keywords: Neural network architecture; text structure; abstractive summarization
Text summarization is an important research topic in language processing. It is an ideal way to attack the information surplus challenge by dropping the size of lengthy text(s) into a less number of phrases or fewer paragraphs. The status of mobile devices, such as smart tablets, marks text summarization as a crucial tool for small screens and less bandwidth capabilities [1–3]. It can also be used as a comprehension exam for computers. To produce an acceptable summary, it is essential for a deep learning method to comprehend the text(s) and condense the significant information from it. These responsibilities are highly problematic for computers when the text size increases. Although search engines are allowed to use advanced retrieval methods, they don’t have the capacity to extract data from several sources and to return a brief helpful response. Also, there is a necessity for timely tools to abstract several sources. These alarms have started an interest in computerized text summarization models. Current text summarization techniques depend on refined feature extraction engineering that used statistical features of the text. these systems are complicated, and need engineering models. Also, those systems fail to generate an understandable text summary. End-to-end training models demonstrate better results in other aspects, such as face recognition, machine translation and object recognition. Currently, neural summarization models gained attention; several techniques are proposed and their uses to text corpus were demonstrated [4–7].
There are two models of neural summarization, extractive summarization and abstractive summarization. Extractive models decide and join relevant phrases from a document to generate the summary while keeping its original content. Extractive models are usually utilized for real-world applications [7–10]. A central issue in an Extractive model is to determine the salient phrases that define the key information . While, Extractive summarization models built a semantic model for the original text and produce a summary resembles a human one. The current abstractive models are very weak [12–15].
Still, neural models have some problems while used in text summarization. These models lack the underlying aspect structure of document contents. Therefore, the text summary uses only representation vector space that does not capture multi-aspect content . Another problem is that neural architectures are modifications of recurrent networks such as Gated unit and term memory. These networks can theoretically remember previous selections in the computed state vector. While in reality, this is not the situation. Also, remembering the document semantics is relatively difficult and not essential . A weighted representation vector of prior states will be utilized as an additional input to the step that determine the next state. Thus, the model can attain a state computed in past steps, so the last state will save previous states’ information .
The key contribution of this research is introducing a summarization neural-based model for extraction of relevant phrases from a text by considering the summarization as a classification process. The model calculates the score for each phrase as a phrase-member by extracting features such as content, significance, redundancy and position. Our proposed model has improved efficiency and accuracy: (i) first it uses a tree text representation; (ii) while constructing the text tree, two self-attention techniques are used at words and phrases. This allows the model to respond strongly to important content.
In this research, two problems are arising: (1) reflecting the text tree to enhance the embedding structure of phrases to learn the text semantic; (2) extracting the most significant phrases from the text to produce a preferred summary [18–22].
The main difference between our research and others: is that our proposed model utilizes a tree structure self-attention technique to produce phrase embedding. The augmented self-attention technique can enhance performance and provides understanding to the high score phrases in the summary phrases selection. To test our model performance comparing to state-of-the-art models, two datasets are utilized, the MCQ news, and the DAC set. Our model outperforms the compared models by a significant margin.
The article is organized as follows. Section 2 describes our summarization model in details. Section 3 presents the simulation results. The related models for comparison is described in Section 4. Discussion and conclusions are presented in Section 5.
2 The Proposed Model
Recurrent convolutional neural variants (RCN), have been utilized in text summarization process. To extract text tokens to be fed to these neural networks as inputs. Phrase embedding are utilized to transform text tokens to space representation vectors. Also, attention techniques  deem these extractive models scalable, letting them focus on previous portions of the input while deciding the final output.
Assume a text T of a chain of phrases and words . Phrase extraction process is used to generate a text summary from by choosing a subset of K phrases (. we classify the label of the phrase as (0, 1). The phrase labels in the text summary set are defined as .
To detail our work, we will start with a brief description of the self-attention algorithm. Assume a query and an input tokens chain , where is the embedded token representation vector. We define a function that computes the attention score between Q and a token as depicted in . Self-attention technique is an attention technique where the query Q branches from the input chain. Consequently, self-attention technique represents the token dependencies in the same chain. The function computes the dependency of on token , where the query Q is substituted by the token .
The authors in  exhibits that using of self-attention process gives higher text representation rank, for text taxonomy. In our work, we propose a new tree structured self-attention model recurrent neural model based on Extractive summarization technique [23–27]. Our text summarization platform is appropriate to most models of phrase extraction utilizing distributed input.
Our model proposes a tree self-attention method which represents the tree structure of phrases and texts, where phrases constitute a text. In the proposal model, there are double plane attention, one plane uses words and the other plan uses phrases. Fig. 1 depicts the model architecture.
2.1 Word Attention Process
Given a text document () with n phrases and m words and , where represent the phrase, and where signify a d dimensional representation vector for word embedding for the word. In our research, we employ a double pooling RCN (Bi-RC) for word encoding. The Bi-RCN extracts phrases from bi-directions. It encompasses a forward RCN which gets the phrase from to and another backward RCN which starts from to :
To compute the hidden representative state that recaps the information of a phrase centralized around , we concatenate and as depicted in the following equation:
where the count of the hidden representative states for each directional RCN is c Let denotes the whole RCN hidden representative states, as follows:
The attention algorithm pays attention to a word depending on its impact on the phrase meaning. Our goal is to transform a variable word count phrase into a fixed count using a self-attention algorithm [28–34] that reads the whole as input and produces a representation vector of weights , which are classified by the Softmax classifier using hyperbolic tangent function (tanh) function with learnable parameters.
We use the attention representation vector as a parameter to compute the phrase as a weighted average of the RCN hidden representative states biased by , as depicted in Fig. 2 and in the following equation.
2.2 Phrase Encoding
After obtaining the phrase representation vector , we can develop the text representation in the same technique. A Bi-RCN is utilized to encode the phrases:
Comparable to the word encoder, the Bi-RCN compute the hidden representative state that recaps the information of adjacent phrases centralized around the phrase , and focus on the ith phrase, as depicted.
where, c denotes the count of the hidden representative states for each directional RCN. Also, N denotes is the number of phrases in the text document, and denotes the whole RCN states computed by the following equation (the dimension of is ).
2.3 Phrase Attention
Each phrase in a text document contributes to the meaning of the text document in a different way and quantify different importance. The self-attention algorithm utilized in this research takes the RCN hidden representative states as input and output a representation vector of weights, , as depicted by the following equation:
where and are learnable factors. The is utilized to normalize the weights to add up to 1.
From the attention representation vector , we obtain the document representation vector as a weighted sum of the RCN hidden representative states weighted by , as depicted in Fig. 2, and also in the following equation.
We utilized a classifier that performs a binary classification to predict whether a phrase fits in the output or not. The prediction at the phrase is determined by the feature representation. Features such as the richness of the phrase , its prominence in the document , its originality in the accumulated summary , and its location feature . The probability that a phrase should be in the output summary is depicted by the following equation:
While, The Richness of the phrase in the text is depicted as follows:
The following equation computes the prominence of a phrase in the document :
while the originality of the phrase in the accumulated summary , is computed as follows:
where, is the summary state representation and is computed as follows:
where, is binary (1 or 0) number identifies if the phrase j is contained within in the output summary or not.
The location feature . of the phrase in the text document is computed as follows:
where, is the locational embedding of the phrase computed by the concatenating of the embedding indices of the forward and backward positions of the phrase in the text.
, , and are learned weights computed from the learning module of the model which represent the relative significance of the features.
From Eqs. (13)–(16), we can deduce the final probability for the phrase with label , as in following equation:
where is the sigmoid function, and b is the bias weight. The summary state representation, , in the score lets the model consider the previous decisions by computing the sentence affiliation. At the training stage, the likelihood of the predicted labels is optimized.
In this following subsections, we are testing the effectiveness of our proposed model starting with describing the used datasets and the experiment setting.
The proposed model was tested on MCQ  and DAC datasets . The MCQ dataset was originally constructed for questioning and replying answering process and then was utilized for extractive text summarization process [36–39]. From the MCQ dataset, we utilized 198,240 text documents for training, 23,137 text documents for validation and testing. In the combined MCQ and DAC dataset, we used 296,322 for training, 23,313 for validation testing. The mean number of phrases per document is 34. One of the main contributions of the authors in [40–42] is that they prepared the combined MCQ/DAC dataset for extractive summarization. They provided phrase-level labels for each text document, defining the membership score of the phrases.
The DAC dataset is utilized as not-in-domain test dataset. It has 513 articles fitting in 79 clusters of different topics, and the equivalent 130-word human-made summaries produced for each text (single-text), or the 130-word handatad-made multi-text summaries produced for 63 document clusters. In our model, we employ the single-text summarization process.
There are several methodologies for text summarization; for comparison, we select those that are analogous to our model employing the selected datasets as:
• Leading phrases (LP4) : which generates the first four phrases of the text as a summary. This model is considered as our base model on the MCQ and DAC datasets.
• M1: Recurrent CNN (R-CNN) presented in , is employed as a base model on the two datasets.
• The extractive algorithm proposed in  is employed as a base model on the two datasets.
• On MCQ dataset, we employed a combination of the abstractive model proposed by the authors in  and a pointer-Network in  as an abstractive base model (M2).
• We also used the topology graph phrase model (M3) , which is a graph model as base models for the DAC dataset as they reach better accuracy on DAC.
We started the word embedding process by pre-training on the MCQ dataset. The validation subset was utilized to tune the parameters. The embedding state dimension was fixed to 130 and the hidden state was settled to 230. The concatenation of the Bi-RCN yields a dimension of 460 for word and phrase encoders. Both of them have an attention content vectors of dimension 460. The vocabulary count was bounded to 153 k word. We fixed the maximum phrase length to 53 words and the maximum count of phrases per text to 130. At training, the batch included 64 documents, and CNN  was utilized for training. At testing, we ordered the output prediction probabilities for the phrase score membership and then select the phrases with the highest probabilities until we reach the compression final rate.
3.3 Model Evaluation
In this research, the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics (R)  are employed for the evaluation of the produced summaries. ROUGE metrics compare n-grams of the summary to be tested and human summarized reference. Rouge is calculated as follows: as in Eq. (20).
• Note 1: To guarantee that the recall metric will not be biased to the phrase length, we utilize the −l73 parameter in ROUGE to trim long text summaries in DAC dataset.
• Note 2: It is observed that all the bases use F1 score as a metric on the whole MCQ dataset as abstractive models learn what time stop producing word for the text summary. To guarantee unbiased comparison, we use the same score.
3.4 Experimental Results
The ROUGE software kit  and the software in [41–43] are utilized for testing. This metric is applied with the parameters that are presumed in Note 1 and Note 2. We compared the proposed model with other extractive baselines models using two datasets, MCQ and DAC. The output of the testing phase is compared to the human-produced summaries. The test results, depicted in Tabs. 1 and 2, Figs. 3 and 4, affirm that the proposed model realizes better results. From the results, we can mark the following:
• As depicted in Tab. 1 and Fig. 3, the results indicate that our model outperform other models while using the DAC dataset. This affirms that tree self-attention structure can lead to better phrase and text representations and improves the extracted features that can be utilized to produce high performance.
• In the case of MCQ dataset, Tab. 2 and Fig. 4 show the high performance of our model vs. the baseline models in terms of ROUGE measures.
• In many articles, the most significant data is in the beginning. This explains the high ROUGE metric value of the LP-4 baseline in DAC dataset; however, our proposed model has achieved higher performance.
• While ROUGE counts the n-gram state overlap crossing the produced summary and the compared reference, summaries results with better ROUGE values are not essentially the best understandable summary. One issue of automatic summarization techniques is that augmenting for an exact metric value like ROUGE will not assure an enhance the readability of the produced summary [23–27]. This justify the high ROUGE value of the abstractive summarization technique baselines compared in this work.
• Another matter linked to the ROUGE value is that the consistency of ROUGE score will improve by increasing the count of the reference text summaries for each text document. This rigidity of ROUGE deems the Rouge values on datasets with few reference text summaries for each text document to be less compared to the datasets that have several reference text summaries [25–29].
• It should be noted that our model attained better results than the state-of-the-art extractive summarization methods.
We conducted a comparative study of our model vs. similar summarization models in terms of recall, precision and the harmonic average of precision and recall (F-measure).
3.5 Computational Time
Computational time comparison is founding a good basis to compare summaries construction. We compared our model other extractive text summarization models: LP4, M1, M2 and M3. We compared the time complexity for the models while using MCQ and DAC dataset individually. (Figs. 5 and 6). As illustrated from the figures our proposed model exhibits a benefit in time cost needed to excerpt a phrase in the output summary.
Experiments proved that the computational time cost of selecting N phrases in the output summary has a time complexity in the order of (N Log (N) where N is the number of phrases in the text documents before summary. Training time is of time complexity in the order of M (N Log (N)), where M is the number of all texts used in the summarization training process. We computed the time cost that our model needed in training. to form the final summarization text. as depicted in Fig. 7.
3.6 Summarization Time Cost
The time cost required to answer a summarization demand is defined as the phrases that the model has to visit. For 2500 requests we counted the phrases selected per request. Fig. 8 illustrates the results for phrases of different number of words. It can be proven that answering a request, our model will navigate a smaller portion of the dataset than other models.
Our proposed model is using the attention technique to employ phrase embedding. The experiments depict high performance results for our model. Our results emphasize that the self-attention resultant embedding provide better state representation and improves the summarization quality. Our model achieves higher performance than similar models on MCQ and DAC datasets. This work is unlike other models in three aspects. First, it utilizes the tree attention model that reflects better document structure. Second, it utilizes the self-attention module, that generates effective embedding theme. Third, the extracted features are rooted and weighted for the learning phase taking in attention the past classified phrases. We trust that incorporating the support learning with phrase-to-phrase training goal is a motivating path for future research. Another effort has to be focused in suggesting new evaluation metric other than ROUGE score to improve the summarization task particularly for lengthy phrases.
Acknowledgement: We would like to thank for funding our project: Princess Nourah bint Abdulrahman University Researchers Supporting Project Number (PNURSP2022R113), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Funding Statement: This research was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number (PNURSP2022R113), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|