The Impact of Semi-Supervised Learning on the Performance of Intelligent Chatbot System

: Artificial intelligent based dialog systems are getting attention from both business and academic communities. The key parts for such intelligent chatbot systems are domain classification, intent detection, and named entity recognition. Various supervised, unsupervised, and hybrid approaches are used to detect each field. Such intelligent systems, also called natural language understanding systems analyze user requests in sequential order: domain classification, intent, and entity recognition based on the semantic rules of the classified domain. This sequential approach propagates the downstream error; i.e., if the domain classification model fails to classify the domain, intent and entity recognition fail. Furthermore, training such intelligent system necessi-tates a large number of user-annotated datasets for each domain. This study proposes a single joint predictive deep neural network framework based on long short-term memory using only a small user-annotated dataset to address these issues. It investigates value added by incorporating unlabeled data from user chatting logs into multi-domain spoken language understanding systems. Systematic experimental analysis of the proposed joint frameworks, along with the semi-supervised multi-domain model, using open-source annotated and unannotated utterances shows robust improvement in the predictive performance of the proposed multi-domain intelligent chatbot over a base joint model and joint model based on adversarial learning.


Introduction
Natural language understanding (NLU) and Speech understanding (SU) play a significantly important role in human-computer interaction (HCI) applications. Intelligent NLU systems, including chatbots, robots, voice control interfaces, and virtual assistants, are well-known HCI applications developed to communicate with humans via natural language. HCI is now a global trend and has drawn attention from different communities with the advancement and rapid development of machine learning (ML) and deep neural network (DNN) and reinforcement learning. ELIZA [1] was the first machine with ability to exhibit human behavior to understand human language and communicate with humans using pattern matching to respond to user. The modeling process of a single domain conversational system or intelligent chatbot consists of detecting intent and recognizing entities from the user query. Virtual customer assistants, or chatbots, reduce information overload and call center efforts, enabling better customer experience (CX) on HCI applications or company websites. Some institutions also deploy role-based assistants that can significantly help improve interactions with their customers, business partners, and employees. By reducing the complexity of data and rules, organizations can focus on repetitive and simple interactions where customer needs are well-satisfied and understood. Organizations are struggling to manage the growth of such user query data. They have been implementing intelligent chatbot to provide service to customers 24/7 with or without call center help to address these issues. Such intelligent systems have three most important parts: domain classification, intent detection, and entity recognition. For a multi-tasking chatbot, the domain classification model first classifies the domain and then intent and entity are recognized based on the frames of the classified domain, as shown in Fig. 1. A large amount of user-annotated data is needed to train a multi-domain dialog system. Major intelligent chatbot systems, such as Amazon Alexa, Apple Siri, Google Dialogflow, IBM Watson, Microsoft Cortana, and Samsung Bixby support multi domain conversation [2]. A typical multi-tasking or multi-domain chatbot system (as shown in Fig. 1) mainly has domain classification, intent prediction, entity recognition, and response generation or dialog management parts. Most intelligent chatbot process user queries in a sequential order: domain classification, intent prediction, slot prediction. Each has its separate machine learning (ML) model and is predicted in the sequential order. A large number of user-annotated examples of utterances in each domain is essential before training the model. In addition, separate models are generated for the domain, intent, and entity, making it difficult to manage large sets of models. Furthermore, with this approach, an error in the domain prediction step may lead to errors in intent prediction and entity recognition, ultimately reducing predictive performance of the chatbot. Typical supervised ML algorithms such as Bayesian algorithm, Support Vector Machine (SVM), Logistic Regression, and Neural Networks (NNs) could extract domain and intent from user queries with separate model. However, the advanced deep learning (DL) approaches, increased computing powers, generating large number of open-source dataset enable training a single joint model for domain classification, intent prediction, and entity recognition using a single set of utterances [3] containing multiple domain, intent, and slot or entity information, reducing the number of trained ML models [4].
This study reduces human efforts for manual annotation of utterances by incorporating unannotated datasets from various data sources, such as user query logs into a DNN algorithm, i.e., a single jointly trained long short-term memory (LSTM) based NLU model of a multi-domain intelligent chatbot. The single jointly trained LSTM based NLU model reduces the number of classification and recognition models used in sequential approaches and attempts to mitigate downstream error propagation. LSTM was proposed in 1997 by Hochreiter and Schmidhuber for sequential modeling [5], which is a concept of adding an extra memory cell to a recurrent neural network (RNN), achieving better performance in representing and storing historical information. In the standard LSTM network, information transmission is one-way, and each memory cell can use historical information but cannot use the future one. Bidirectional LSTM (Bi-LSTM, shown in Fig. 2) was introduced to transmit and store past and future information in each memory cell.
The principle of Bi-LSTM is to connect the same output of each input cell with two opposite timings. Forward LSTM can forward historical information to the next step, and LSTM networks directed backward can obtain future contextual information. Furthermore, extra unlabeled data [6] contributes to an increase in the information gain for DL model trained with the LSTM algorithm.
A single semi-supervised multi-domain joint model (SEMI-MDJM) based on LSTM outperforms a joint base model and an adversarial multi-domain joint model in each task i.e., domain classification, intent prediction, and entity recognition.  The remainder of this work is structured as follows. Next section presents related prior work on intelligent dialog system. Section 3 provides a proposed LSTM based semi-supervised joint framework. Section 4 presents the experimental results and detail analysis of predictive accuracies, and additional analyses on the importance of unannotated data in the context of a general chatbot having multiple domains. Finally, Section 5 concludes with a discussion and interesting areas for future study. The first idea of an HCI application comes from the Turing test or "imitation" game created by Alan Turing in 1950. ELIZA was the first conversational system developed in 1966 based on pattern matching that respond to a user using keywords from the user query [1]. In 1980s, another HCI application called ALICE was developed using artificial intelligence markup language (AIML) [7] to mitigate the drawbacks of ELIZA. The performance of AIML was further improved [8] by applying multiple parameter design pattern to the decomposition rules. With the rapid development and advancement of ML algorithms, emergence of DL techniques, and natural language processing (NLP), these intelligent chatbot systems are gaining popularity in various fields. Conversational systems help reduce various costs by automating the workflow of a customer or call center, resulting in rapid response to customer queries [9]. Almansor and Hussain classified conversational systems into non-task-oriented and taskoriented categories [10]. Non-task-oriented systems are generally retrieval-based chatbots, which provide a similar or highly ranked list of information related to user input. In contrast, task-oriented conversational systems are supervised or unsupervised models performing users' specific tasks based on ML algorithm rather than decomposition rules or keyword filtering. Recently, commercial chatbot systems such as Microsoft Cortana, IBM Watson, Amazon Alexa, Google Dialogflow, Apple Siri, and Bixby are gaining interest from organizations [11]. These systems are mainly implemented in medical education, health assistance, educational system, e-learning, e-commerce [12], sports, games, privacy, infrastructure, and other fields [2]. Recently, public administrators have begun implementing chatbot systems for real-time customer services [13]. Autonomous vehicles and smart home systems also embed natural language interactions applications [14]. The implementation of these dialog systems requires technical knowledge about NLP and NLU [15][16][17]. Recent new studies report various new NLP and NLU, such as bag-of-concepts and bag-of-narratives [18].
Although there are several technical and logical parts involved in implementing intelligent chatbot systems, NLU is at the core part of a chatbot. In an intelligent chatbot, the role of NLU is to parse the user query and learn what the user means. NLU systems contain three main subsystems: domain classifier, intent detector, and entity recognition models [19]. Generally, a multi domain chatbot has three unsupervised or supervised ML models for recognizing each field. Different supervised and unsupervised learning algorithms include term frequency and inverse document frequency (TF-IDF), bag of words, word2vec, SVM, Bayes algorithm, NNs, boosting, maximum entropy, and deep belief networks [20] are widely applied to extract intent and slots in sequential NLU models. These separate pipelined ML models are created using a large number of utterances or examples [3]. Creating and annotating these utterances demands huge human efforts. Recently, much open research shares previously annotated large datasets from diverse domains in multiple languages. In addition, unannotated user-query data can be used and analyzed in the future. Vedula et al. [21] curated and released an annotated dataset of 25 k utterances for developing an intent model. Schuster et al. [22] curated 57 k annotated examples for English, Thai, and Spanish languages for three different domains -weather, alarm, and reminder -to develop cross-lingual dialog system. Larson et al. [23] evaluated for out of scope with a dataset containing 150 intent classes from 10 different domains. Furthermore, these sequential frameworks are at a high risk of introducing downstream errors to the intent detection and entity recognition phase. Since each predictive model is trained with sequence of text corpus, contextual information of the previous step has significant importance for traditional ML algorithms and recent DL approaches. These text data i.e., utterances or examples are time-series in nature, for which an LSTM-based DL framework demonstrates state-of-the-art performance [24].

Domain Prediction
Domain prediction is the process of filtering user input to a specific category in a multi-tasking dialog system. Many previous works on domain prediction exist. Hakkani-Tur et al. [25] proposed a semi-supervised domain prediction model using AdaBoost with user click logs on Bing web search engine. Zheng et al. [26] proposed an out-of-domain detection mechanism to avoid unnecessary responses to user input. Xu et al. [27] proposed a contextual domain classification mechanism to reduce consecutive queries by a user to different domains. Gupta et al. [28] proposed an RNN-based context encoding method to improve the predictive accuracy and computational efficiency of an NLU model using two different domains.

Intent Detection
Intent prediction is the main part of NLU system. Intent means what a user means or wants to obtain from the system. Although traditional intent predictor models are based on SVM and ANN, with the advancement in DL and sequence modeling, RNN and LSTM algorithms have demonstrated state-of-the-art performance in text classification tasks. Liu et al. [29] proposed attention-based RNN to predict intent and slot. In addition, a hybrid approach that combines LSTM and a convolutional neural network (CNN) shows performance improvement in intent prediction using the ATIS dataset [30]. Goo et al. [31] proposed Slot-Gated Bi-LSTM model with an attention mechanism to predict intent. Systems can make errors for similar words that appear in different contexts. Confusion2vec [32] can reduce confusing errors and predict the intent of user input. For multi-task and multi-turn dialog systems, previous domain information can be used as contextual information for new turns to improve the performance of dialog systems [33]. In addition, incorporating previous contextual session information [34] into intent and slot prediction models can improve predictive performance.

Entity Extraction or Slot Filling
Entity extraction, also called entity recognition (NER), extracts attributes such as location, place, date, and time from user query text. Entity extraction aims to extract entities of interest from user input text. As important information of user input can appear at any position, entity extraction becomes a more challenging process [24], making it difficult to extract entities from text. Early NER prediction systems relied on rules or dictionaries created by humans. After that, supervised learning based on SVM, decision trees, hidden Markov chain, conditional random fields, and dynamic vector representations [35] have been used to extract entities from text. Recently, ANNs and DL techniques such as LSTM, CNNs [36] have been used to extract entities from user text. Liu and Lane introduced slot filling based on RNN algorithms [29]. Dernoncourt et al. [37] proposed NeuroNER tools based on ANN for non-expert users of ANNs. Generally, models trained over previously build NER algorithms such as a distantly supervised slot-filling system [38] proposed at Stanford and a tweeter-based NER system [39] can improve the performance of entity extraction systems. The main challenges and misconceptions for NER system development were investigated in detail by Ratinov et al. [40] to improve prediction accuracy on the CoNLL dataset. An entity extraction model based on sequence modeling [41] can further improve its predictive performance.
Although these individual training approaches improve the performance of an individual model, there will be a lack of contextual sharing between each model, and the total number of models increases with the total number of domains. The total number of models for a typical traditional dialog system is calculated as Eq. (1). Total Predictive Models = (2 × N) + 1 (1) where N represents the total number of domains. The total number of predictive ML models in a typical traditional multi-domain chatbot system is the sum of domain predictive model, N number of intent, and N number of slot models. If the number of domain increases, the number of predictive models also increases. Thus, various joint training approaches that incorporates higher correlation information between intent and entity show better performance with a single joint predictive model.

Joint Training for Multi-Domain Intelligent Chatbot System
Joint training based on LSTM in a conversational system involves sharing cost or loss functions among domain, intent, and entity predictors. There are some prior works on joint modeling for intent detection and entity recognition. Liu et al. [29] proposed a joint model based on Attention Bi-RNN to recognize intent and entity with higher predictive performance. Ma et al. [30] introduced a sparse attention patterns to a jointly trained model based on LSTM for intent detection and slot extraction. Bekoulis et al. [42] applied adversarial learning to a joint model for various datasets, such as biomedical data, real estate data, and news data, achieving state-of-the-art performance for entity and relation extraction. Goo et al. [31] added related information for joint training between intent detector and slot extractor model. Zhang et al. [43] applied the hierarchical relationship between slots and intent to the joint model based on capsule neural networks. Recently, transfer learning i.e., pre-trained models, such as DialogGLUE (BERT show state-of-the-art-performance for joint model [44]. For multi-taskoriented conversational systems, a predictive domain model is trained separately, which could bring downstream error propagation, i.e., if an intelligent chatbot system fails to classify the domain then intent predictor and entity extractor does not work anymore [3].
There are some prior works on multi-task-oriented joint models based on LSTM with a single cell. Hakkani-Tur et al. [4] introduced the RNN-LSTM framework for a multi-task oriented chatbot. Kim and Lee used real user chatting logs from Microsoft Cortana and jointly trained the model with Bi-LSTM algorithm to inhance the classification accuracy by mitigating downstream error propagation [3]. We refer readers to Abdul-Kader et al. [12] and Ahmad et al. [15] studies, which provide comprehensive literature reviews of various ML and rule-based techniques used in chatbot systems or NLU studies.

Adversarial Learning
Adversarial learning regularizes neural networks and improves the classification accuracy of DNN algorithms by combining small noise or perturbations with annotated data, thereby increasing the loss function of a DL model [45]. Many DNN algorithms have recently been used in NLU and SU systems. Miyato et al. [6] observed the incorrect decision for DNNs with intentional random noise to the DNNs along with input examples. Furthermore, they proposed an object detection algorithm based on DNN using an adversarial learning to improve the classification accuracy of a ML model [6].
Semi-supervised learning with adversarial perturbations shows classification improvement for intelligent chatbot system having multi-domains [46]. Adversarial learning to DNNs (as shown in Fig. 3) generates small perturbations to the embedding layer along with input examples that gives variations to input, which the learning model can easily misclassify.

Semi-Supervised Learning for NLU
Semi-supervised learning is the process of training ML model with both annotated and unannotated utterances. First, the supervised model developed with annotated or labeled dataset and then predicts and labels unannotated samples. Afterward, retraining the originally annotated datasets, along with machine-annotated datasets, creates new predictive supervised models. This entire training, predicting, and retraining process using predicted datasets, along with originally labeled utterances, presents the concept of semi-supervised learning shown in Fig. 4. The semi-supervised technique helps reduce human efforts in the manual annotation of utterances and helps create a self-learning model with robust information gain, ultimately improving a predictive performance or accuracy. A semi-supervised learning approach can help annotators annotate new user inputs with a small userannotated dataset.
There are extensive prior studies on semi-supervised learning approaches for developing SLU and NLU models for intent prediction and entity extraction. Diverse techniques have been used to predict intent for a single domain dialog system using a semi-supervised learning approach [47]. A semi-supervised joint model for intent prediction and slot fillings [45,48] reduces human efforts in annotating examples, improving the model's performance with robust information gain. For further investigation, we refer readers to Singh et al. [11], which provide comprehensive literature reviews of data extraction, data processing, various data sources, and reinforcement and ensemble learning methods used in NLU studies.
Although semi-supervised learning has recently been used in multi-domain dialog systems, this study is, to the best of our knowledge, the first to apply semi supervised-learning to a single joint model Our proposed SEMI-MDJM (shown in Fig. 4) focuses on self-automating the annotating process with user chatting logs, which could be the important data source for intelligent chatbot. Each component of SEMI-MDJM is discussed in the following subsection.

Data Preprocessing
User chatting logs are unstructured text data and should be converted into a structured example that a DNN algorithm can use it to train the model. Bag of words, term-frequency-matrices, and vector space [49] methods are widely applied to transform unstructured data into structured dataset. TF-IDF uses term frequency matrices to extract information from text data. Creating these matrices involves various data cleansing and wrangling approaches including tokenization, stemming, POS tagging as shown in Fig. 5. Then a word embedding set is created from the preprocessed cleaned corpus. Furthermore, the previously developed joint model is used to predict unlabeled user chatting logs, and annotated utterances are added to the previous training dataset, retraining the model to increase the information gain for the LSTM cell. Then, utterances are preprocessed and fed into a Bi-LSTM cell to extract previous and future information. Then the single LSTM model predicts domain, intent, and extract the entity.

Embedding and Bi-LSTM Layer
The embedding layer feeds the sequential data to a LSTM cell by creating embedding vector of words. Word embedding for word sequence w 1 . . . w n ∈ W is given as Eq. (2): Word embedding : e w ∈ R 64 for each w ∈ W (2) Fig. 2 shows a Bi-LSTM model with forward and backward propagation of information. Due to the bidirectional information propagation, previous and future contextual information can be memorized for each LSTM cell.
The final training objective of MDJM is to minimize shared loss among domain, intent, and entity. Total cumulative loss is calculated using Eq. (3): The losses l d , l i , l t of each output layers are calculated for each annotated utterance. Then, the shared loss among domain, intent, and entity is calculated as l d + l i + l t in each gradient step. Finally, the model is optimized using the shared loss θ . The algorithm of the proposed semi-supervised multidomain chatbot system is designed as follows: 7: Optimize the model using Adam optimizer 8: Predict unlabeled data using the model created from Step 1 to Step 7 9: Add predicted dataset to the original training dataset 10: Retrain the model by following Step 1 to Step 7

Evaluation and Optimization 3.3.1 Evaluation Criteria
There are many standard performance matrices and criteria for comparing predictive performance between various classifiers [50]. The widely used measures in text classifications are predictive accuracy (ACC) and F-score. A detail description of these criteria can be clarified using the confusion matrix described in Tab. 1. The classification or predictive accuracy of a predictive model is defined as in Eq. (4): In the above equation, TP denotes the false positive rate for a predictive model on all classes, whereas TN denotes the true negative rate. FP denotes the false positive, and FN is the false-negative rate of the model. The precision or positive predictive value of a given classification model is calculated as in Eq. (5): Recall, which is also called sensitivity or true positive rate of a classifier, is calculated using Eq. (6): Specificity, also called selectivity or true negative rate, is calculated as in Eq. (7): Another criterion is F1-score, which is the harmonic mean of precision and recall of a ML model, is calculated using Eq. (8): Area under curve (AUC) shown in Eq. (9) is another famous criterion to measure the accuracy ML algorithm: In the above equation, sensitivity is the interaction between sensitivity and 1-specificity; specificity is the percentage of false ratings predicted as false. In this study, classification accuracy is used for performance comparison.

Optimization
Adam, stochastic gradient descent (SDG), and RMSProp are the three most widely used optimizers for ANNs and DL models. This study used the Adam optimizer for training our proposed model, adversarial model, and joint base model. The Adam optimizer helps control the sparse gradient problems of a model. It is a widely used optimization mechanism for DL applications such as NLU and SU models by expanding stochastic gradient descent.

Experiment
This study used 43 k of user annotated dataset containing weather, alarm, and reminder domains (shown in Fig. 6) of multi-domain intelligent chatbot system [22,51]. The dataset contains three different domains with12 intent labels and 11 unique entities.  Fig. 7. Furthermore, the publicly available large unannotated user chatting log dataset [52] of 25 k user utterances from 21 domains are collected and only 2 k, i.e., 2,510 of the unannotated user queries (alarm, reminder, and weather) dataset is used for semi-supervised learning.
The utterances are restructured (as shown in Fig. 7) into annotated sets of user queries, entities, intent labels in respective order. User queries are enclosed with the BOS and EOS symbols. Dataset are then divided into training, evaluation, and testing dataset in 70:20:10 ratios, as shown in Tab. 2. Annotated and unannotated utterances are then preprocessed using Python NLU tokenization library. Each input example size is fixed to 50 characters and created word embedding of size 64. Then LSTM model from TensorFlow library is used to train and predict user queries.  The experiments were conducted by using tensorflow 1.10.0 library on python 3.6. The experimental platform runs Windows 10 with an Intel Core CPU at a clock speed of 1.60 GHz with 8 GB RAM.
To evaluate SEMI-MDJM, we conducted experimental analysis and compared with a prior MDJM and "multi-domain joint model with adversarial learning" (MDJM-ADV) [51]. SEMI-MDJM is created by annotating publicly available user chatting logs using MDJM and retraining the proposed model by adding this predicted dataset to the original training sets. LSTM cell of each model is created with 100 hidden neurons. Then the model is trained for 20 epochs and optimized with Adam optimizer. The learning rate is set to 0.01 and batch size of training dataset is set to 16. The MDJM shares the loss function among domain, intent, entities predictors, whereas MDJM-ADV further adds the adversarial loss to the original MDJM model. Incorporating user chatting logs into the base MDJM provides information gain for each output layer. Fig. 8 shows the training and test loss for MDJM, MDJM-ADV and SEMI-MDJM.

Conclusion
In this study, we proposed a semi-supervised joint model, SEMI-MDJM, for intelligent chatbot system to extract the domain, intent, and entity of user queries using a single model ML model based on LSTM to mitigate the propagation of downstream error. This is a limitation of the typical sequential approach and reduces the effort required to manage a large number of NLU predictive models and manual data annotation. Experimental results showed a significant improvement in the predictive performance of each model, i.e., -domain, intent, and entity-predictions, based on semisupervised learning compared to the joint base model and joint model with adversarial learning. The proposed SEMI-MDJM reduces the number of trained models to one along with the self-annotation process, which reduces human effort necessary to annotate and manage multiple intent detector and entity extractor. In addition, it provides a self-learning approach to the conversational dialog system by continuously incorporating domain-related utterances from user chatting logs into the initially developed MDJM. Furthermore, it reduces the human effort required to annotate a large number of the domain, intent, and entity examples. We encourage testing our proposed SEMI-MDJM model with domain related to education, health for various languages with large datasets for future study. In addition, incremental prediction and annotation of all unannotated dataset can also improve and reduce the proposed model's overfitting problem.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.