Taking into account the increasing volume of text documents, automatic summarization is one of the important tools for quick and optimal utilization of such sources. Automatic summarization is a text compression process for producing a shorter document in order to quickly access the important goals and main features of the input document. In this study, a novel method is introduced for selective text summarization using the genetic algorithm and generation of repetitive patterns. One of the important features of the proposed summarization is to identify and extract the relationship between the main features of the input text and the creation of repetitive patterns in order to produce and optimize the vector of the main document features in the production of the summary document compared to other previous methods. In this study, attempts were made to encompass all the main parameters of the summary text including unambiguous summary with the highest precision, continuity and consistency. To investigate the efficiency of the proposed algorithm, the results of the study were evaluated with respect to the precision and recall criteria. The results of the study evaluation showed the optimization the dimensions of the features and generation of a sequence of summary document sentences having the most consistency with the main goals and features of the input document.
Given the increasing spread of electronic text documents, acquiring important features and information, and accurately utilizing such resources in the fastest and shortest possible time has been one of the main challenges for researchers in recovering digital text information. A fundamental solution to this challenge is the production and use of automatic summarization tools for generating a summary document having main features and information of the input document.
Text summarization is a process receiving the original text from the input source and producing its most important contents in a compact format based on user needs [
The history of the automatic summarization activity dates back to 1950 [
In general, the production of a summary document is done in two ways: extractive summarization and abstract summarization. In extractive summarization, important sentences are selected from the original text without alteration and are exactly written in the summary text. The main advantage of this method is the simplicity of the selection system, whereas the primary disadvantage of this method is the lack of assurance in the coherence of the summary text. Abstracts summarization, concepts, and interpretations extracted from the original text are written in the summary document, and the sentences of the summary document are not necessarily the main sentences of the input document.
The techniques used in extractive summary include three statistical, linguistic and combination methods [
The structure of this research is as follows: the second section copes with a review of the literature of summarizing Persian texts and other languages. In Section 3, the proposed method will be described. In the fourth section, the implementation and evaluation of the proposed method are addressed and in the fifth part the conclusion and suggestions are addressed.
Research in the field of summarization began by Loon, 1958, which refers to the distribution of words in sentences and specification of the keywords of the document. Then, a variety of approaches were proposed in the field of summarization, based on different approaches. Most new methods are presented to improve previous methods. The proposed method by [
The FarsiSum Persian Automatic Summarization System, presented in [
Automatic summarization of online debatable texts using the vector space model [
The proposed model is a text summarization extraction method [
The system provided by [
In this research, a new hybrid method is proposed for extraction of single document texts based on the reduction and optimization of the dimensions of the input document features by identifying and extracting repetitive patterns. In the proposed method, using the vector of optimal features and using the genetic algorithm, a sequence of input document sentences is selected and a summary text is generated that covers all the main goals and characteristics of the input document. The proposed summarization process consists of three preprocessing steps, the optimization of the vector of attributes by creating repetitive patterns and summarizing, see presented
The main advantage of the proposed idea is to optimize the feature vector, as well as the high precision in choosing the sentences of the original document for generating the summary document. In this method, we solve the problems of inconsistency and ambiguity to the desirable level in the summary document.
Before performing the summarization operation, in order to produce a more precise summary document, the input text needs to be transformed into a single unit for the processing of the digest operation. Preprocessing operations include operations such as aligning, segmenting, tagging components of the word, rooting, and deleting the stop terms. For Persian language texts, one of the tools used at this stage is the ParsProcessor tool. ParsProcessor is a product manufactured and presented by the National Center for Cyberspace (Iran Telecommunication Research Center). The evaluation of this tool shows a precision of 98% for tagging and a 100% precision for normalization.
The preprocessing process involves standardizing and homogenizing the original text. One of the main problems in the interpretation of the texts, including Persian texts, is a number of forms in the same vocabulary. This has caused problems in identifying the same vocabulary. To solve this problem, in the first step, we will align the input text corpus.
After performing the normalization operations in the previous step, at this stage, the role of the term (such as: Noun, verb, adjective, conjunction, etc.) in the sentence is tagged for use in other tagging steps.
In the segmentation step, the input document is identified using the marks separating the boundaries of terms and sentences. It should be noted that segmentation operations are performed to identify words and sentences of the original text. It is worth noting that the boundary of sentences is determined by verbs or conjunctions, or by other attributes.
Stop words are vocabularies that are commonly used in text documents, but not descriptive, and not related to a specific subject. These words do not have meanings and include conjunctions and linking verbs, pronouns, prepositions, and types of adverbs.
In the rooting stage, we identify and extract the roots of the words and lexicons in each sentence from the input text.
In the proposed method, the main operation of selecting sentences and producing the text of the summary document is done in this phase. In this method, by identifying the relationship between the key words of different texts from different classes of subjects (sports, economic, political, scientific, etc.), we create the data of repetitive patterns and then use the genetic algorithm and exploit the repeated patterns of extraction in the input document and generate the summary text with the greatest similarity to the original text, see
The TF-IDF method is widely used for weighting. Initially, this method is introduced to retrieve information. TF-IDF (Term Frequency-Inverse Document Frequency), depending on the frequency of the document itself, weighs a term [
In the above formula,
The ease of use of this method is why the method is more acceptable than other methods. Therefore, with regard to the existence of classified political, sporting, and economic texts, etc., keywords are obtained using the
Apriori is an ordinary algorithm introduced for the first load in order to extract associative rules. There are two steps to extract association rules: (1) Recognition of repetitive items; (2) Creating community rules of repetitive items. Repetitive items can be extracted in two steps. First, the candidate items are generated, and then sets of items are extracted with the help of candidate items. Repetitive item support is nothing more than the minimum user-specified support [
Now, to use the Apriori algorithm and to extract repetitive patterns in the texts, the keywords obtained from the previous step are considered as the item and each text is also a transaction; then, in each text, it is examined which of the words is a keyword. After identifying the keywords, we put one under the columns containing the keyword and zero under the rest, as in the table in
Then, in order to find repetitive patterns that have been frequently repeated in all texts, they must be converted to the apriori input format according to
As we see in the above, in Phase 1, the key word table is made according to the format introduced for input to the Apriori algorithm. In Phase 2, we create a collection of repetitive patterns. In this template, we set the threshold value to 0.5. If the repeat count of a set of repetitive patterns is lower than the threshold value, then this process continues until the final repetitive pattern is detected. Here, the number of repetitive patterns of sports texts may be n. If the keywords of all sports texts k have been repeated, the repetitive patterns between the k keyword n have been identified by the repetitive pattern (
The point is that if there were repetitive patterns between the two different problems, we would generally eliminate the set because if the pattern is both political and sports, it does not matter and does not have information load.
In the end, we consider the selected features as repetitive patterns that we have in one set. For example, there are 20 ones for sports, 25 ones for politics, and 14 ones for economics, i.e., a total of 100 ones as a repetitive pattern.
Now, every text in any domain is considered as a record and columns are considered as repetitive patterns.
The Term-document matrix is an incidence matrix, which describes the number of occurrences occurring in the terms, in a set of documents. In this matrix, the rows of the matrix represent the documents and columns represent the terms. There are various ways to determine the value of each entry in a matrix, in which we use a binary vector. In the column section of the matrix, instead of placing all the terms, we use the repetitive patterns obtained from the preceding steps, so that if the document i contains a repetitive pattern j, then
Genetic algorithm is a special type of evolutionary algorithm that uses evolutionary biology techniques such as heritability, biology mutation, and Darwinian selective principles to find the optimal formula for predicting or matching the pattern. Genetic algorithms are often a good option for regression-based prediction techniques. In modeling the genetic algorithm is a programming technique that uses genetic evolution as a problem solving model. The problem that needs to be solved is the inputs that are converted into solutions through a process of genetic evolution. Then, the solutions are evaluated as candidates by the Fitness Function, and if the extraction condition for the problem is provided, the algorithm ends. We use the features of this algorithm in this paper and by generating different generations of a document, we will get the best summary text of the temporary document based on a cost function.
The nature of the genetic algorithm is a repetition-based algorithm, most of which are selected as random processes, which are composed of parts of the fitting, display, selection and modification functions. Before a genetic algorithm runs for a problem, a method for coding the genomes into a computer language should be used. One of the usual ways to code is binary strings: strings 0 and 1. We used this method in this article.
In each problem, before the genetic algorithm can be used to find an answer, two elements are needed: First, a method is needed to provide an answer in the form in which the genetic algorithm can function on it. Second, in the fundamental component of the genetic algorithm, there is a method that can calculate the quality of each proposed solution using fitness functions, which is discussed below.
In this paper, we used genetic algorithms to randomly generate several temporary summary documents for the problem and call it the primitive population and consider each document as a chromosome, the length of the chromosomes here being equal to the number of sentences of the text of the original document. During each generation, each feature of fitness value is evaluated by the fitting metric function chosen in this study by the cosine function. After selecting the best chromosomes, we combine the chromosomes together and make a mutation in them. Finally, we combine the current population with a new population that results from the combination and mutation in the chromosomes, as in
We used the above process to create a new generation of chromosomes and different from the previous generation in order to be able to extract the most appropriate summary document. The entire process is also repeated for the next generation, and the pair of documents are selected for composition, the third generation population is created, and this process is repeated until one of the conditions for termination of genetic algorithms, such as reaching a constant number of generations or the completion of time of the dedicated calculation will occur.
For example, suppose the text of sport 1 contains 10 sentences whose Term-document matrix had been obtained in terms of repetitive patterns in the previous section; now, using the above algorithm, we extract the best summary text for a single text in a genetic method.
Phase 1: Now, in accordance with phase one of the above algorithm, a chromosome with 8 genes should be produced which is then generated using the binary crash function and then the initial population of the algorithm for text 1 is generated. In fact, each of the chromosomes represents a summary document that randomly selected sentences of the text and compiled the summary text. For all chromosomes produced, a Temp Vector was created based on repetitive patterns (
Phase 2: The second phase of the above algorithm is the most important phase which calculates the evaluation. In this phase, we have used the similarity of the cosine distance
The angle between the two vectors according to
In this paper, we first obtained a Main Vector based on the Term Document for each original document
Using
After obtaining the angle of the vectors by the cosine similarity criterion according to
Phase 3: In order to be able to summarize the texts to maximize the semantic and structural similarities to the original document by creating new generations, we need to make changes in the selection of sentences and, as a result, new generations of that original text, so that we can resemble the similarity more precise in the summarized texts. For example, in the practice of combining, we have created two generations with the highest rank of similarity to the input document, randomly adding their sentences together, the composition, and the two new generations that we have reached for generations which provide a higher and more accurate comparative rating than their fathers. You can also create a mutation in a generation. For example, you can move two replace together and create a new generation.
In this research, the final summary document contains a chain of sentences that should be continuously generated without any ambiguity. It is noteworthy that in some sentences of the text of the summary document, there may be pronouns whose reference is unclear and may be ambiguous for its users.
By examining this problem, it has been found that in most cases, the reference of the ambiguous pronouns is in the preceding sentence. Therefore, to eliminate the ambiguity of an unspecified pronoun, the preceding sentence is rewritten from the original in the summary text.
In this research, the proposed method on the body of texts was evaluated from five different categories of topics (
Subject | Number of documents |
---|---|
Sport | 1000 |
Economics | 1000 |
Politics | 1000 |
Science | 1000 |
Others | 1000 |
Total number of documents | 5000 |
One of the main advantages of this method is to optimize the feature vector using dataset of repetitive patterns. After generating the data, repetitive patterns are generated using the genetic algorithm of the summary document.
In this method, using the genetic algorithm, we first randomly generate several temporary summary documents for the problem, which are called initial populations, and we consider each document as a chromosome, whose chromosome length here is equal to the number of sentences in the document text. During each generation, each feature with fitness value is measured and evaluated using the cousinial fitness function.
After choosing better chromosomes, we combine the chromosomes together and make a mutation in them. Eventually, the current population will also be generated by a new population of combinations and mutations in the chromosomes. In this method, in each generation, the most appropriate document is selected, which is less different from the original document.
To evaluate the proposed method, using the Precision and Recall criteria, we evaluated the results and efficiency of the proposed summarizer method. These criteria are defined on the basis of the following
where
The results of the evaluation of the proposed method are presented in the following
Method | P | R | |
---|---|---|---|
SweSum | 0.66 | 0.69 | 0.67 |
proposed method | 0.73 | 0.75 | 0.74 |
Given the results of the above table, the proposed summarization method has a higher precision than the SweSum summarizer with a similar set of data, see
In order to evaluate the effectiveness of the proposed method in generating a desirable summary document, readers’ satisfaction with the summaries produced by the proposed method with various topics and was examined and that the documentation of the summaries generated by 5 experts were reviewed as well. Below is the average score of satisfaction feedback from readers (
The other evaluation criterion used is the average of the common sentences of human summaries compared to the proposed method. For further explanation of the evaluation of the proposed method, the best and the weakest human-made document (by 5 experts) and the proposed synthesizer are seen below (
(A) | ||
---|---|---|
Presence or lack of presence in the proposed method | Percentage of presence in the summary by human | The number of selected sentence |
100 | 1 | |
100 | 3 | |
90 | 4 | |
50 | 6 | |
80 | 7 | |
70 | 9 | |
70 | 11 | |
50 | 12 | |
40 | 15 | |
40 | 18 | |
40 | 21 | |
30 | 25 | |
50 | 26 | |
80 | 29 | |
80 | 32 | |
(B) | ||
100 | 1 | |
100 | 2 | |
100 | 3 | |
80 | 5 | |
40 | 8 | |
60 | 11 | |
50 | 12 | |
50 | 14 | |
80 | 17 | |
80 | 20 | |
60 | 21 | |
90 | 23 | |
80 | 25 |
As the results show, the proposed method has chosen the most precision for selecting the important sentences of the original document in the production of a summary document selected by expert individuals.
In this paper, we proposed a novel method for selective text summarization using genetic algorithms and generating repetitive patterns. One of the important features of the proposed method in comparison with other previous methods is to optimize the vector of the main document’s properties in the production of a summary document by identifying and extracting the relationship between the main features of the input text and the creation of repetitive patterns.
In the proposed method, in the process of producing a summary document, using genetic algorithms, randomly, several temporary summary documents are generated for the problem, and are called initial populations, and we consider each document as a chromosome, in which is the length of the chromosomes here is the number of sentences in the text of the original document.
During each generation, any feature with fitness value is evaluated using the cousinous measurement fit function. After choosing the best chromosomes, we combine the chromosomes together and make a mutation in them. Eventually, the current population will also be generated by a new population of the combinations and mutations in the chromosomes. In this algorithm, it is established that in each generation, the most appropriate summary document with less difference to the original document is selected.
In evaluating the proposed method, on the structure of 5 categories of the subject, using the Precision and Recall criteria, we showed that the proposed method has a higher precision of approximately 0.74 compared with the SweSum summarizer with similar data. Furthermore, the average satisfaction of readers from the summary documents generated by the proposed method in various subjects is estimated at 0.71 and it also shows the highest precision in choosing the important sentences of the original document that has been selected by experts in the production of the summary. The idea for the proposed method has overcome some problems such as inconsistency and ambiguity in the summary text. A combination of semantic communication techniques in the original document to improve the proposed summarization process is proposed for future research.