|Computers, Materials & Continua |
Deep Neural Network and Pseudo Relevance Feedback Based Query Expansion
Department of Mathematics, Bio-Informatics and Computer Applications, Maulana Azad National Institute of Technology Bhopal, Bhopal, Madhya Pradesh, 462003, India
*Corresponding Author: Abhishek Kumar Shukla. Email: email@example.com
Received: 06 August 2021; Accepted: 13 September 2021
Abstract: The neural network has attracted researchers immensely in the last couple of years due to its wide applications in various areas such as Data mining, Natural language processing, Image processing, and Information retrieval etc. Word embedding has been applied by many researchers for Information retrieval tasks. In this paper word embedding-based skip-gram model has been developed for the query expansion task. Vocabulary terms are obtained from the top “k” initially retrieved documents using the Pseudo relevance feedback model and then they are trained using the skip-gram model to find the expansion terms for the user query. The performance of the model based on mean average precision is 0.3176. The proposed model compares with other existing models. An improvement of 6.61%, 6.93%, and 9.07% on MAP value is observed compare to the Original query, BM25 model, and query expansion with the Chi-Square model respectively. The proposed model also retrieves 84, 25, and 81 additional relevant documents compare to the original query, query expansion with Chi-Square model, and BM25 model respectively and thus improves the recall value also. The per query analysis reveals that the proposed model performs well in 30, 36, and 30 queries compare to the original query, query expansion with Chi-square model, and BM25 model respectively.
Keywords: Information retrieval; query expansion; word embedding; neural network; deep neural network
Over the years the web has growing exponentially and it has become difficult to retrieve the relevant documents as per the user query. The information retrieval system tries to minimize the gap between the user query and relevant documents. Various phases of the retrieval process are affected by the vagueness of the user query. For example novice user during the formulation of the query, might be uncertain in selecting the keyword to express his/her information need. The user has only a fuzzy idea about what he/she is looking for. Due to this retrieval system retrieves irrelevant documents along with relevant documents. Query expansion appends additional terms to the original query and helps in retrieving those additional relevant documents that were left out. Query expansion technique tries to minimize the word mismatch problem. Generally, queries are categorized into the following three main categories 
(1) Navigational queries
(2) Informational queries
(3) Transactional queries
Navigational queries are those queries that are searching a particular URL or website. Informational queries are those which search a broad area of the given query and may contain thousands of documents. Transactional queries are those which search user intention to execute some task like downloading or buying some items. In information retrieval, one method of query expansion could be the use of semantically similar terms to the original query. WordNet  based methods are one of the oldest methods for query expansion. It is a semantic-based approach that finds semantically similar terms of the original query terms by using synonyms, hyponyms and, meronyms of the query terms. Word embedding is a technique to find similar terms to the original query. Word2vev  and Glove  are the two well-known word embedding techniques to find the semantically similar terms to the original query terms for query expansion. Word2vec and Glove learns the word embedding vector in an unsupervised way using a deep neural network. Word2vec and Glove find the semantically similar term of original query terms using global document collection or external resources such as Wikipedia  or similarity thesaurus . The local method of query expansion searches the similar term of the original query using the Pseudo relevance feedback method. The pseudo relevance feedback method assumes that top “k” retrieved documents are relevant to the original query. It is observed that the local method of query expansion performs better than the global method of query expansion .
The proposed method uses a deep neural network-based query expansion method using the skip-gram model. In the proposed method of query expansion semantically similar terms of the original query are retrieved from top “k” initial retrieved documents using the Pseudo relevance feedback method. Semantically similar terms are retrieved by training the terms in top “k” initially retrieved documents using the skip-gram model. In the skip-gram model, we predict the context word of the given center word. The Skip-gram model uses an un- supervised deep neural network-based training method that successively updates the weight between two successive layers. The Skip-gram model assigns each term to a lower-dimensional vector compare to the vocabulary size, in a semantic vector space. The proposed method predicts the context word of each query term and then finds the union of these context words. The combined context words are treated as expansion terms for the given query terms. Fig. 1 shows the architecture of the proposed model.
2 Related Work
Query expansion plays important role in improving the performance of the retrieval system. The most common method of query expansion is to extract the expansion terms from an external data collection such as Anchor text, Query log, and external corpus. References [8,9] used anchor text as a data source. References [10,11] used query log for query expansion. They applied correlation between query terms and documents term. They collected data source from click-through of documents on URL. Reference  used query log as a bipartite graph where query nodes are connected to URL nodes by click edges and they showed an improvement of 10%. Reference  proposed co-occurrence-based document-centric probabilistic model for query expansion. A continuous word embedding-based technique for the document was proposed by . They reported that their model performs better than LSI based model but does not outperform TF-IDF and divergence from the randomness model. Reference  proposed supervised embedding-based term weighting technique for language modeling. Reference  proposed semantic similarities between vocabulary terms to improve the performance of the retrieval system. Reference  proposed word embedding technique in a supervised manner for query expansion. Reference  proposed word embedding-based word2vec based model for expanding the query terms. Using this model they extracted similar terms of the query terms using the K-nearest neighbor approach. They reported considerable improvement on TREC ad-hoc data. Reference  used Word2Vec and Glove for query expansion for ad hoc retrieval. Reference  used fuzzy method to reformulate and expand the user query using pseudo relevance feedback method that uses top ‘k’ ranked document as a data source. Reference  proposed a hybrid method that uses both local and global methods as a data source. The proposed method used a combination of external corpus and top ‘k’ ranked documents as a data source. Reference  used a combination of top retrieved documents and anchor text as a data source for query expansion. Reference  used query log and web search result as a data source for query reformulation and expansion. Reference  used Wikipedia and Freebase to expand the initial query. Reference  used a fuzzy-based machine learning technique to classify the liver disease patients. Reference  proposed a machine learning technique that diagnoses breast cancer patients using different classifiers.
3 Query Expansion Using Deep Learning
Deep learning is a technique that is used in almost every area of computer science to learn something. In information retrieval, continuous word embedding is widely used to improve the mean average precision (MAP). There are following two deep learning approaches of word embedding technique
(1) The Continuous Bag of Words model (CBOW) 
(2) The Skip-gram model
Continuous bag of word model and Skip-gram model is widely used in query expansion method [28,29]. A continuous bag of word model is used to predict the center word of given context words. The Skip-gram model is just the opposite of the CBOW model. The Skip-gram model predicts the context word of a given center word. In this paper Skip-gram model is used to expand the query. The proposed method predicts context words for each query term and then they are combined and treated as expansion terms.
The proposed model is having three-layer of architectures, Input layer, hidden layer, and output layer. The proposed model used both the feed-forward network and the back-propagation method to predict the context word of a given center word. In the skip-gram model architecture, each query word is represented as one-hot encoding at the input layer. In one hot encoding representation if vocabulary size is 7000 words then in a 7000X1 vector is created and 0 is put at each index except at the index containing the center word. “1” is put at the index of the center word. The architecture of the skip-gram model is shown in Fig. 2. In the following diagram, the weight matrix is initialized randomly. Hidden layer is used to represent the one hot encoding into a dense representation. This is achieved through the dot product of the hot vector and weight matrix. At the next layer, we initialize another weight matrix with random weights. Then the dot product of a hidden vector and newly weighted matrix is obtained. At the next layer, activation function softmax is applied to the output value of the product of the hidden vector and newly assigned weight matrix. In the mid of the training, we have to change the weight of both the matrix so that the words surrounding the context words have a higher probability at the softmax layer. Let N represents the number of unique terms in our corpus of text, X represents the one hot encoding of our query word at input layer, N’ number of neurons in the hidden layer, W(N'XN) weight matrix between the input layer and hidden layer, W’(NXN’) weight matrix between the hidden layer and output layer, and Y a softmax layer having probabilities of every word in vocabulary, then using feed-forward propagation we have
Let uj be the jth neuron of layer u, wj be the word in our vocabulary where j is any index, and be the jth column of matrix W’ then we have
Yj denote the probability that wj is a context word
P(wj|wi) is the probability that wj is a context word wi is the input word. The goal is to maximize where j* represents the indices of context words. We have to maximize
where are the vocabulary indices of context words. Context words are range from c = 1,2,3,….,C. The loss function E is defined as negative log of Eq. (1) as
Using Back Propagation we have
The loss function is propagated from output layer to hidden layer and hidden layer to input layer from Eqs. (2) and (3). The weight W and W’ is updated as
where and are the updated weights between input layer and hidden layer and hidden layer and output layer respectively. The algorithm of the proposed method is
4 Experimental Results and Discussion
Precision and recall are the two metrics to check the performance of the retrieval system. A retrieval system with high precision and recall gives an implication to the evaluators that the proposed system is highly significant. Precision is defined as
and recall as
Mean average precision (MAP) defined as
Qj : number of relevant documents for query j
N : number of queries
P(doci) : precision at ith relevant document
We have performed our experiment on FIRE 2011 English test collection . The dataset is of size 1.1 GB containing 392577 documents. We have used terrier3.5  search engine as retrieval engine. The documents are pre-processed through the following steps.
o Text Segmentation: To split the text into sentences and then to split the sentence into tokens.
o Stop Word Removal: This step remove all the stop words containing the documents.
o Stemming: This step stem the root words of all the terms containing the documents.
We have performed pre-processing on the underlying dataset by applying Porter stemmer and Stopword to stem the root word and to remove the stop words respectively. In the proposed method documents are retrieved using InL2c1.0 model. We have performed our experiment on 50 queries. The mean average precision value of the proposed method, query expansion with the Chi-Square model, BM25 model, and the original query are 0.3176, 0.2912, 0.2970, and 0.2979 respectively. An improvement of 6.61% to the original query is observed. The performance of the retrieval system on original query, Chi-squared based query expansion, query expansion using the proposed model and BM25 model are shown in Tabs. 1–4 respectively. The performance of original query vs. query expansion using Chi-Square, original query vs. proposed model, Chi-Square vs. proposed model, and original query vs. Chi-Square vs. proposed model is shown in Figs. 3–6 respectively. From Tabs. 1–4 it is observed that the proposed model outperforms to original query model, query expansion with the Chi-Square model, and BM25 model respectively. The MAP improvement of the proposed model to the original query and query expansion with Chi-square is 6.61% and 9.07% respectively. Figs. 3–6 query by query analyses reveal that the proposed model retrieves 84 and 35 more relevant documents in comparison to the original query and query expansion with the Chi-Square model. The proposed model also performs well on 30 queries in comparison to and in 36 queries in comparison to the original query and query expansion with the Chi-Square model respectively. The sample query and their expansion terms using the proposed model are shown in Tab. 6. In the Following Figures x-axis represents query number and y-axis represents MAP value respectively.
From Tabs. 1–5 it is clear that the proposed model performs well over the other models. The proposed model improves the MAP result 6.61%, 6.93%, and 9.07% concerning original query, query BM25 model, and query expansion with Chi-Square model respectively. The proposed model also improves the result on R precision parameter 8.47%, 7.02%, and 12.13% concerning original query, BM25 model, and query expansion with Chi-Square model respectively. The proposed model improves recall value by retrieving 84, 25, and 81 additional documents compare to the original query, query expansion with the Chi-Square model, and BM25 model respectively. From Figs. 4, 5, and 7 it is clear that per query analysis reveals that out of 50 queries proposed model performs well in 30, 36, and 30 queries compare to the original query, query expansion with the Chi-Square model, and BM25 model respectively. Per query, analysis shows that more than 60% of queries proposed model performs well compare to other models. The proposed model performs well compare to the original query in query numbers Q128, Q129, Q130, Q131, Q133, Q139, Q140, Q141, Q143, Q144, Q146, Q147, Q148, Q150, Q154, Q155, Q156, Q157, Q159, Q160, Q162, Q163, Q165, Q166, Q169, Q170, Q171, Q172, Q173, and Q174. The proposed model also performs well compare to query expansion with the Chi-Square model in query numbers Q127, Q128, Q129, Q130, Q137, Q138, Q139, Q140, Q142, Q143, Q144, Q145, Q146, Q147, Q150, Q151, Q152, Q155, Q156, Q157, Q158, Q159, Q160, Q162, Q163, Q164, Q165, Q166, Q167, Q169, Q170, Q171, Q172, Q173, Q174, and Q1175. The proposed model also performs well compare to the BM25 model in query numbers Q127, Q128, Q129, Q130, Q132, Q133, Q139, Q140, Q141, Q143, Q144, Q146, Q147, Q148, Q150, Q154, Q155, Q156, Q157, Q159, Q160, Q162, Q163, Q165, Q166, Q169, Q171, Q172, Q173, and Q174.
In this paper, the word mismatch problem is minimized by applying combination of pseudo relevance feedback and deep neural network based method. In the proposed method, we have applied the skip-gram-based neural method for selecting the expansion terms. The mean average precision of the proposed method is 0.3176. An improvement of 6.61% and 9.07% is observed on MAP parameter in comparison to the original query and query expansion with Chi-square model respectively. The proposed model also retrieves 84 and 35 more documents in comparison to original query and query expansion with Chi-square model respectively. In near future, we will try to further improve the performance of the proposed method by tuning the parameters.
Acknowledgement: One of the authors is pursuing a full-time Ph.D. from the Department of Mathematics, Bio-informatics, and Computer Applications, Maulana Azad National Institute of Technology (MANIT) Bhopal (MP), India. He expresses sincere thanks to the Institute for providing an opportunity for him to pursue his Ph.D. work. The author also thanks the Forum of Information Retrieval and Evaluation (FIRE) to provide a dataset to perform his experimental work.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|