Deep Neural Network and Pseudo Relevance Feedback Based Query Expansion

: The neural network has attracted researchers immensely in the last couple of years due to its wide applications in various areas such as Data mining, Natural language processing, Image processing, and Information retrieval etc. Word embedding has been applied by many researchers for Information retrieval tasks. In this paper word embedding-based skip-gram model has been developed for the query expansion task. Vocabulary terms are obtained from the top “k” initially retrieved documents using the Pseudo relevance feedback model and then they are trained using the skip-gram model to find the expansion terms for the user query. The performance of the model based on mean average precision is 0.3176. The proposed model compares with other existing models. An improvement of 6.61%, 6.93%, and 9.07% on MAP value is observed compare to the Original query, BM25 model, and query expansion with the Chi-Square model respectively. The proposed model also retrieves 84, 25, and 81 additional relevant documents compare to the original query, query expansion with Chi-Square model, and BM25 model respectively and thus improves the recall value also. The per query analysis reveals that the proposed model performs well in 30, 36, and 30 queries compare to the original query, query expansion with Chi-square model, and BM25 model respectively.


Introduction
Over the years the web has growing exponentially and it has become difficult to retrieve the relevant documents as per the user query. The information retrieval system tries to minimize the gap between the user query and relevant documents. Various phases of the retrieval process are affected by the vagueness of the user query. For example novice user during the formulation of the query, might be uncertain in selecting the keyword to express his/her information need. The user has only a fuzzy idea about what he/she is looking for. Due to this retrieval system retrieves irrelevant documents along with relevant documents. Query expansion appends additional terms to the original query and helps in retrieving those additional relevant documents that were left out. Query expansion technique tries to minimize the word mismatch problem. Generally, queries are categorized into the following three main categories [1] (1) Navigational queries (2) Informational queries (3) Transactional queries Navigational queries are those queries that are searching a particular URL or website. Informational queries are those which search a broad area of the given query and may contain thousands of documents. Transactional queries are those which search user intention to execute some task like downloading or buying some items. In information retrieval, one method of query expansion could be the use of semantically similar terms to the original query. WordNet [2] based methods are one of the oldest methods for query expansion. It is a semantic-based approach that finds semantically similar terms of the original query terms by using synonyms, hyponyms and, meronyms of the query terms. Word embedding is a technique to find similar terms to the original query. Word2vev [3] and Glove [4] are the two well-known word embedding techniques to find the semantically similar terms to the original query terms for query expansion. Word2vec and Glove learns the word embedding vector in an unsupervised way using a deep neural network. Word2vec and Glove find the semantically similar term of original query terms using global document collection or external resources such as Wikipedia [5] or similarity thesaurus [6]. The local method of query expansion searches the similar term of the original query using the Pseudo relevance feedback method. The pseudo relevance feedback method assumes that top "k" retrieved documents are relevant to the original query. It is observed that the local method of query expansion performs better than the global method of query expansion [7].   The proposed method uses a deep neural network-based query expansion method using the skipgram model. In the proposed method of query expansion semantically similar terms of the original query are retrieved from top "k" initial retrieved documents using the Pseudo relevance feedback method. Semantically similar terms are retrieved by training the terms in top "k" initially retrieved documents using the skip-gram model. In the skip-gram model, we predict the context word of the given center word. The Skip-gram model uses an un-supervised deep neural network-based training method that successively updates the weight between two successive layers. The Skip-gram model assigns each term to a lower-dimensional vector compare to the vocabulary size, in a semantic vector space. The proposed method predicts the context word of each query term and then finds the union of these context words. The combined context words are treated as expansion terms for the given query terms. Fig. 1 shows the architecture of the proposed model.

Related Work
Query expansion plays important role in improving the performance of the retrieval system. The most common method of query expansion is to extract the expansion terms from an external data collection such as Anchor text, Query log, and external corpus. References [8,9] used anchor text as a data source. References [10,11] used query log for query expansion. They applied correlation between query terms and documents term. They collected data source from click-through of documents on URL. Reference [12] used query log as a bipartite graph where query nodes are connected to URL nodes by click edges and they showed an improvement of 10%. Reference [13] proposed co-occurrencebased document-centric probabilistic model for query expansion. A continuous word embeddingbased technique for the document was proposed by [14]. They reported that their model performs better than LSI based model but does not outperform TF-IDF and divergence from the randomness model. Reference [15] proposed supervised embedding-based term weighting technique for language modeling. Reference [16] proposed semantic similarities between vocabulary terms to improve the performance of the retrieval system. Reference [17] proposed word embedding technique in a supervised manner for query expansion. Reference [18] proposed word embedding-based word2vec based model for expanding the query terms. Using this model they extracted similar terms of the query terms using the K-nearest neighbor approach. They reported considerable improvement on TREC ad-hoc data. Reference [19] used Word2Vec and Glove for query expansion for ad hoc retrieval. Reference [20] used fuzzy method to reformulate and expand the user query using pseudo relevance feedback method that uses top 'k' ranked document as a data source. Reference [21] proposed a hybrid method that uses both local and global methods as a data source. The proposed method used a combination of external corpus and top 'k' ranked documents as a data source. Reference [22] used a combination of top retrieved documents and anchor text as a data source for query expansion. Reference [23] used query log and web search result as a data source for query reformulation and expansion. Reference [24] used Wikipedia and Freebase to expand the initial query. Reference [25] used a fuzzy-based machine learning technique to classify the liver disease patients. Reference [26] proposed a machine learning technique that diagnoses breast cancer patients using different classifiers.

Query Expansion Using Deep Learning
Deep learning is a technique that is used in almost every area of computer science to learn something. In information retrieval, continuous word embedding is widely used to improve the mean average precision (MAP). There are following two deep learning approaches of word embedding technique (1) The Continuous Bag of Words model (CBOW) [27] (2) The Skip-gram model Continuous bag of word model and Skip-gram model is widely used in query expansion method [28,29]. A continuous bag of word model is used to predict the center word of given context words.
The Skip-gram model is just the opposite of the CBOW model. The Skip-gram model predicts the context word of a given center word. In this paper Skip-gram model is used to expand the query. The proposed method predicts context words for each query term and then they are combined and treated as expansion terms.
The proposed model is having three-layer of architectures, Input layer, hidden layer, and output layer. The proposed model used both the feed-forward network and the back-propagation method to predict the context word of a given center word. In the skip-gram model architecture, each query word is represented as one-hot encoding at the input layer. In one hot encoding representation if vocabulary size is 7000 words then in a 7000X1 vector is created and 0 is put at each index except at the index containing the center word. "1" is put at the index of the center word. The architecture of the skipgram model is shown in Fig. 2. In the following diagram, the weight matrix is initialized randomly. Hidden layer is used to represent the one hot encoding into a dense representation. This is achieved through the dot product of the hot vector and weight matrix. At the next layer, we initialize another weight matrix with random weights. Then the dot product of a hidden vector and newly weighted matrix is obtained. At the next layer, activation function softmax is applied to the output value of the product of the hidden vector and newly assigned weight matrix. In the mid of the training, we have to change the weight of both the matrix so that the words surrounding the context words have a higher probability at the softmax layer. Let N represents the number of unique terms in our corpus of text, X represents the one hot encoding of our query word at input layer, N' number of neurons in the hidden layer, W(N'XN) weight matrix between the input layer and hidden layer, W'(NXN') weight matrix between the hidden layer and output layer, and Y a softmax layer having probabilities of every word in vocabulary, then using feed-forward propagation we have h = w T .x and u = w T .h Let u j be the jth neuron of layer u, w j be the word in our vocabulary where j is any index, and V w j be the jth column of matrix W' then we have Y j denote the probability that w j is a context word P(w j |w i ) = y j = e u j N j e u j P(w j |w i ) is the probability that w j is a context word w i is the input word. The goal is to maximize P(w j * |w i ) where j * represents the indices of context words. We have to maximize The loss function is propagated from output layer to hidden layer and hidden layer to input layer from Eqs. (2) and (3). The weight W and W' is updated as where W ij New and W ij New are the updated weights between input layer and hidden layer and hidden layer and output layer respectively. The algorithm of the proposed method is  6. leq ← l/len 7. for each t in query Q retrieve the indices at y which has top "leq" values. 8. for each t in query Q retrieve the words corresponding to indices at y and merge them to Q m 9. Append these words to exp←Q+Q m 10. return exp

Experimental Results and Discussion
Precision and recall are the two metrics to check the performance of the retrieval system. A retrieval system with high precision and recall gives an implication to the evaluators that the proposed system is highly significant. Precision is defined as  [30]. The dataset is of size 1.1 GB containing 392577 documents. We have used terrier3.5 [31] search engine as retrieval engine. The documents are pre-processed through the following steps.           In this paper, the word mismatch problem is minimized by applying combination of pseudo relevance feedback and deep neural network based method. In the proposed method, we have applied the skip-gram-based neural method for selecting the expansion terms. The mean average precision of the proposed method is 0.3176. An improvement of 6.61% and 9.07% is observed on MAP parameter in comparison to the original query and query expansion with Chi-square model respectively. The proposed model also retrieves 84 and 35 more documents in comparison to original query and query expansion with Chi-square model respectively. In near future, we will try to further improve the performance of the proposed method by tuning the parameters.