Key information extraction can reduce the dimensional effects while evaluating the correct preferences of users during semantic data analysis. Currently, the classifiers are used to maximize the performance of web-page recommendation in terms of precision and satisfaction. The recent method disambiguates contextual sentiment using conceptual prediction with robustness, however the conceptual prediction method is not able to yield the optimal solution. Context-dependent terms are primarily evaluated by constructing linear space of context features, presuming that if the terms come together in certain consumer-related reviews, they are semantically reliant. Moreover, the more frequently they coexist, the greater the semantic dependency is. However, the influence of the terms that coexist with each other can be part of the frequency of the terms of their semantic dependence, as they are non-integrative and their individual meaning cannot be derived. In this work, we consider the strength of a term and the influence of a term as a combinatorial optimization, called Combinatorial Optimized Linear Space Knapsack for Information Retrieval (COLSK-IR). The COLSK-IR is considered as a knapsack problem with the total weight being the “term influence” or “influence of term” and the total value being the “term frequency” or “frequency of term” for semantic data analysis. The method, by which the term influence and the term frequency are considered to identify the optimal solutions, is called combinatorial optimizations. Thus, we choose the knapsack for performing an integer programming problem and perform multiple experiments using the linear space through combinatorial optimization to identify the possible optimum solutions. It is evident from our experimental results that the COLSK-IR provides better results than previous methods to detect strongly dependent snippets with minimum ambiguity that are related to inter-sentential context during semantic data analysis.
Due to the wide popularity of user reviews in online media, a vast amount of content has been generated over the past several years. An approach to disambiguate the context-based sentiment polarity of words, as an information recovery problem was presented in [
Semantic data analysis is a field of study in which specific data in a particular domain are analyzed by inputting a query from the search engine. Existing applications have shown that there is vast market potential for semantic data analysis [
Multiple stages of semantic composition for context-sensitive scalar objectives using the time window model is presented in [
The main goal of this work is to build up a combinatorial optimization method considering inter-sentential context at the bottom level of granularity using linear space with a knapsack called Combinatorial Optimized Linear Space Knapsack for Information Retrieval (COLSK-IR). Instead of relying on snippet and manually labeled datasets to capture diverse kinds of non-integrative terms, the planned method suggests an individual snippet influence term and a query influence by using a combinatorial factor determination.
Key information extraction is a fundamental technique in the evaluation of information retrieval evaluation and has attracted attention for decades. Based on news corpora, multi-word expression extraction using context analysis and model-based analysis is provided in [
In [
Query facets provide us with essential knowledge related to a query and hence are used to enhance the search experience in several ways. An automatic mining model through extraction and grouping of frequent lists is presented in [
Another graph-based approach to build automatically a taxonomy, resulting in the maximization of the overall associative strength is presented in [
To enhance the efficiency of latent semantic models in web search, meta-features are created in [
Our study covers both the detection of strongly dependent snippets and the reduction in ambiguity related to inter-sentential context to test whether the sarcastic use of the word has an influential factor in the COLSK-IR method. The work also covers the knapsack-based combinatorial optimization for semantic data analysis as a possible way to obtain an evidence for an effective semantic linear space representation.
The contextual polarity of a word [
The basic idea behind the COLSK-IR method is presented with a set of items, where weight and value are available for all. The combinatorial optimization model measures the number of item to be included in a set so that the calculated weight is always below or same as the given limit and the total value is as large as possible. The block diagram of COLSK-IR is shown in
The basic COLSK-IR method consists of substituting a keyword ‘
Let ‘
The iterative procedure ‘
Then the function ‘
If the sequence ‘
With the perturbed sets obtained from (7), linear space is generated for semantic data analysis. Let ‘
From (8), ‘
As shown in
From (9), ‘
The linear spaces of the snippets on non-integrative queries that will commonly occur in non-identical contexts will have entries with low absolute values. However, for integrative queries, substituting a snippet with its synonym yields constructions that are likely to occur in a number of contexts that are different from the original. They have dissimilar contextual statistics and thus greater distance ‘
1: |
2: |
3: |
4: |
5: Measure the non-integrative key using (3) |
6: Measure the integrative key using (4) |
7: Obtain the perturbation sets for query ‘ |
8: Measure the individual snippet influence using (9) |
9: Measure the query influence using (12) |
10: |
11: |
12: |
13: |
The algorithm for strongly identifying the dependent query terms with the aid of non-integrative nature is analyzed and shown in
1: |
2: |
3: |
4: |
5: Obtain the maximization formulates using (13) |
6: Design the constraints using (14) |
7: |
8: |
9: |
10: |
With the combinatorial optimized factors, although ambiguity related to inter-sentential context is reduced, the time required to evaluate a query increases. To address this, a knapsack-based combinatorial optimization for semantic data analysis is constructed. Selecting the strongly dependent snippets and inter-sentential context into the cache is a ‘
Given a knapsack with capacity ‘
From (13), ‘
The objective behind the design of the proposed work is the consideration of optimal solutions. From (14), the proposed work states that the total snippets cannot exceed the query size or capacity ‘
For example, consider a Tripadvisor dataset consisting of reviews randomly selected from several accommodations. In order to obtain the maximization, formulates (13) are used according to the design constraints from (14), with consideration of two snippets: Room file snippets and value file snippets. With these design constraints, optimal solutions are identified, thereby meeting the objectives.
The queries were simulated and the performance was measured. The COLSK-IR method was evaluated [
The dataset of approximately 200 reviews was taken from Tripadvisor.com through a random selection. It covered all five satisfaction levels (40 reviews in each level) consisting of 1,382 criticisms, 211 non-criticisms and 97 criticisms with errors. The information was collected from Tripadvisor and Edmunds. Tripadvisor had 259,000 reviews.
The experiment was conducted based on factors such as number of reviews, non-integrative key extraction time, recall rate, precision and semantic data analysis efficiency. To evaluate the performance of the COLSK-IR method, two metrics were introduced to measure the semantic data analysis and compared with the existing methods: Polarity Similarity (PolaritySim) and Domain Ontology of Web Pages (DomainOntoWP).
The performance of COLSK-IR for semantic data analysis was compared with the Polarity Similarity (PolaritySim) and Domain Ontology of Web Pages (DomainOntoWP). The experiments measured the effectiveness of non-integrative key extraction time, precision rate and recall for 150 reviews, using the method described in Section 3.
The non-integrative key extraction time measured the time required to extract the non-integrative key (i.e., extracted keys) with respect to the total number of reviews in web pages. The non-integrative key extraction time is measured as given below.
From (15), ‘
No. of reviews | Non-integrative key extraction time (ms) | ||
---|---|---|---|
COLSK-IR | PolaritySim | DomainOntoWP | |
15 | 4.15 | 7.45 | 8.3 |
30 | 7.13 | 10.13 | 12.54 |
45 | 11.17 | 13.14 | 17.43 |
60 | 15.32 | 17.21 | 24.24 |
75 | 20.13 | 22.14 | 30.16 |
90 | 28.32 | 31.32 | 36.25 |
105 | 33.14 | 35.79 | 42.39 |
120 | 36.14 | 39.32 | 43.21 |
135 | 38.25 | 41.15 | 45.61 |
150 | 41.43 | 44.23 | 49.12 |
Results are presented for 10 numbers of reviews. The non-integrative key extraction time for these 10 numbers of reviews measures the time taken for convergence on different reviews as in (1). The reported results confirm that with the increase in the number of reviews, the non-integrative key extraction time also increases. The process is repeated for 150 reviews for conducting experiments, as illustrated in
Precision rate refers to the number of relevant snippets extracted with respect to the number of returned snippets, i.e.,
From (16), the precision rate ‘
No. of reviews | Precision rate (%) | ||
---|---|---|---|
COLSK-IR | PolaritySim | DomainOntoWP | |
15 | 77.51 | 67.94 | 61.28 |
30 | 74.31 | 64.30 | 58.22 |
45 | 72.54 | 62.52 | 56.44 |
60 | 70.38 | 60.35 | 54.27 |
75 | 68.25 | 58.22 | 52.14 |
90 | 82.99 | 72.96 | 66.87 |
105 | 89.95 | 80.92 | 74.83 |
120 | 90.14 | 82.14 | 78.32 |
135 | 92.23 | 87.13 | 82.14 |
150 | 94.14 | 89.13 | 86.27 |
To increase the precision of semantic data analysis for web pages, first approximation, second approximation, and ‘
In the experimental setup, the number of reviews ranged from 15 to 150. The results for 10 different types of reviews collected from Tripadvisor and Edmunds are shown in
Recall rate measures the number of relevant snippets extracted with respect to the number of relevant snippets, i.e., the number of extracted relevant snippets returned by the web page ‘
From (17), the recall rate ‘
No. of reviews | Recall rate (%) | ||
---|---|---|---|
COLSK-IR | PolaritySim | DomainOntoWP | |
15 | 94.36 | 83.51 | 76.29 |
30 | 90.16 | 79.29 | 71.23 |
45 | 87.29 | 76.42 | 69.36 |
60 | 79.33 | 68.46 | 61.40 |
75 | 83.29 | 72.42 | 65.36 |
90 | 87.90 | 76.25 | 69.19 |
105 | 90.43 | 84.56 | 77.50 |
120 | 91.18 | 84.13 | 79.13 |
135 | 93.14 | 86.78 | 82.45 |
150 | 94.18 | 89.13 | 85.10 |
As shown in
This paper proposes a Combinatorial Optimized Linear Space Knapsack for Information Retrieval (COLSK-IR) to overcome the difficulty of detecting strongly dependent snippets and reducing the ambiguity related to inter-sentential context. This paper shows how this method can be extended to incorporate the time required to evaluate a query for efficient semantic data analysis based on the knapsack problem. This paper provides two algorithms: Linear Space Context Dependent and Knapsack Combinatorial Optimization. The Linear Space Context Dependent algorithm manages and identifies strongly dependent snippets based on the influence and frequency of snippets. The Knapsack Combinatorial Optimization algorithm reduces the ambiguity related to inter-sentential context by formulating an integer programming problem to determine the optimal solutions. The experimental results show that the COLSK-IR provides better performance than the state-of-the-art methods in terms of the parameters such as non-integrative key extraction time, precision and recall rate.