Adversarial Training for Multi Domain Dialog System

Natural Language Understanding and Speech Understanding systems are now a global trend, and with the advancement of artificial intelligence and machine learning techniques, have drawn attention from both the academic and business communities. Domain prediction, intent detection and entity extraction or slot fillings are the most important parts for such intelligent systems. Various traditional machine learning algorithms such as Bayesian algorithm, Support Vector Machine, and Artificial Neural Network, along with recent Deep Neural Network techniques, are used to predict domain, intent, and entity. Most language understanding systems process user input in a sequential order: domain is first predicted, then intent and slots are filled according to the semantic frames of the predicted domain. This pipeline approach, however, has many disadvantages including downstream error; i.e., if the system fails to predict the domain, then the system also fails to predict intent and slot. The main purpose of this paper is to mitigate the risk of downstream error propagation in traditional pipelined models and improve the predictive performance of domain, intent, and slotall of which are critical steps for speech understanding and dialog systemswith a deep learning-based single joint model trained with an adversarial approach and long shortterm memory (LSTM) algorithm. The systematic experimental analysis shows significant improvements in predictive performance for domain, intent, and entity with the proposed adversarial joint model, compared to the base joint model.


Introduction
Speech understanding, natural language understanding (NLU) and natural language processing (NLP) are important parts of Human-computer interaction (HCI). Conversational systems or dialog systems, including virtual assistants, voice control interfaces, chatbots, and robots, are some examples of HCI application deployed to interact with users using human language. ELIZA [1] was the first machine to understand natural language for conversations with humans, based on decomposition rules that respond to humans via by keywords appearing in the user input text. The modelling process of a typical dialog system consists of predicting intent, i.e., what the customer really means, and extracting entities from the user input. In the case of a multi-task-oriented dialog system, the domain predictive model first predicts the domain; intent and entity are then extracted in a stepwise manner based on the predicted domain, as shown in Fig. 1. Major dialog systems, such as Microsoft Cortana, Apple Siri, Google Dialogflow, Amazon Alexa, Samsung Bixby, and IBM Watson, support multi task-oriented dialog systems for different domains [2]. A multi-domain dialog system mainly has four key parts: domain prediction, intent detection, entity extraction, and dialog generation. Most language understanding systems process user query in a stepwise order; domain, intent, and slots each have their own trained model and are predicted in the respective order. However, this approach may result in prediction error for intent and entity. An error in the first step, i.e., domain prediction, leads to error in downstream tasks, i.e., intent detection and entity extraction, which ultimately reduces prediction accuracy for intent and entity. Typical machine learning algorithms such as Support Vector Machine (SVM), Bayesian algorithm, and Artificial Neural Network (ANN) have only been able to train individual predictive models for domain, intent, and entity. However, the emergence of deep learning technology and increased computing powers enable training of large joint single models for domain, intent, and entity with a single large dataset [3] which shares information between domain, intent, and entity and reduces the number of predictive models [4].
The main purpose of this study is to jointly train domain prediction, intent detection, and slot filling-all of which are critical steps for dialog systems or chatbots-with a single cell of bidirectional LSTM (Long Short Term Memory) algorithm and adversarial learning, to mitigate downstream error propagation and reduce the number of predictive models used in traditional pipelined approaches, ultimately helping to reduce the amount of human effort necessary to manage predictive models. LSTM is the first memorybased recurrent neural network (RNN) proposed by Hochreiter et al. [5] for time series or sequential data, and adversarial learning [6] helps to increase the loss function of the deep learning model with small inline perturbations to the LSTM cell along with user input. Jointly trained single models with adversarial learning (MDJM-ADV) based on a single cell of LSTM algorithm for multi-domain dialog systems show significant improvements in each task over a base joint multi-domain dataset.

Structure
The remainder of this study is structured as follows. Section 3 presents related prior work. Section 4 provides a proposed joint framework based on adversarial LSTM for a multi-domain dialog system. Section 5 presents the systematic analysis and experimental results for the respective predictive performances, and additional analyses regarding the importance of adversarial learning and joint models in the context of a multi-domain dialog system. Finally, Section 6 concludes with a discussion and interesting areas for further study.

Literature Review
Since the first human computer program, ELIZA, was developed at MIT in 1966, there has been a tremendous amount of research in the field of natural language understanding and speech understanding. ELIZA is based on decomposition rules [1] that respond to humans through keywords appearing in the user input text. To mitigate some of the drawbacks of ELIZA, another chatbot called ALICE (Artificial Linguistic Internet Computer Entity) was developed in the 1980s based on AIML (Artificial Intelligent Markup Language) [7]; performance of AIML is further improved in pattern recognition with the help of multiple parameter design [8]. These kinds of applications are mainly enabled with the maturity of Human Language Understanding & Processing by computers and are called conversational agents, also called dialog systems or simply chatbot systems. With the rapid development of technology and advancement of machine learning algorithms, the HCI applications such as ELIZA and ALICE are gaining popularity in academia, as well as in organizations. Chatbot systems help to reduce operating costs by reducing and automating the workload to the customer center and higher management levels, resulting in increased availability for customer services [9].
Although there are various parts and systems involved in implementation of the dialog system, natural language understanding is at the heart of the conversational system. In a dialog or chatbot system, the role of natural language understanding is to receive user input, and to parse, preprocess, and understand what the user really means. Natural language understanding systems further contain three different parts: domain predictor, intent classifier, and entity or slot extractor models [10]. Generally, for task-oriented dialog systems, a natural language understanding system has three supervised or unsupervised learning models for domain prediction, intent classification, and entity extraction. Different machine learning techniques such as TF-IDF, word-to-vector [11], Bayes algorithm, SVM, ANN, Deep Belief Networks, boosting, and maximum entropy [12] are widely used to predict intent and entity in a traditional pipelined NLU system. However, these pipelined models are at increased risk of propagating downstream error to predict intent and entity. Given that all of these individually-trained predictive models (i.e., domain, intent, and entity predictive models) are based on text data-where contextual information has significant importance for any type of traditional machine learning algorithm or recent deep learning approaches, this text data can be treated as sequence or time series datasets, for which LSTM-based deep learning algorithm shows state-of-the-art performance [13].

Joint Training for Dialog System
Joint training in a dialog system is the process of sharing loss functions between each predictive model i.e., domain, intent and entity extractor. There is extensive prior research on joint modelling for intent and entity extraction. Liu et al. [14] proposed the Attention BiRNN joint model to predict intent and slot with higher accuracy. Ma et al. [15] introduced the sparse attention mechanism for a jointly-trained model for intent classification and slot filling using LSTM. Bekoulis et al. [16] proposed a model for joint entity and relation extraction with adversarial training for diverse data sets, such as news, real estate, and biomedical data that achieved state-of-the-art effectiveness. Goo et al. [17] added related information between intent and slot for joint training. Zhang et al. [18] considered the hierarchical relationship among each layer-words, slots, and intent-in joint models using Capsule Neural Networks. Recently pretrained model or transfer learning on large unlabeled corpora or utterances such as BERT (Bidirectional Encoder Representations from Transformers), DialogGLUE shows the state-of-art-performance on joint intent prediction and slot filling models [19]. In the case of multi-domain dialog systems, the domain model is generally separated from the intent and entity models, which could directly cause downstream errors, i.e., if the dialog system fails to predict the domain then it fails to predict intent and entity based on predicted domain frames [3]. Joint domain, intent, entity approaches have been proposed in prior studies as a way to reduce the risk of downstream error.
There are also several studies on multi domain joint models that leverage existing domains by predicting domain, intent and slot with a single cell of LSTM for multitask/multi-domain dialog systems. Hakkani-Tür et al. [4] proposed RNN-LSTM joint architecture for a multi-domain NLU model. Kim et al. [3] used real user data and jointly trained with bidirectional LSTM to improve predictive performance by reducing downstream error.

Adversarial Training
Adversarial learning is the process of regularizing the neural networks that help to enhance the predictive performance of neural network or deep learning approaches by combining small perturbations or noises along with the training data which increases the loss of a deep learning model [20]. Many neural network and deep learning approaches have recently been exploited in various natural language understanding and speech understanding systems. However, Miyato et al. [6] observed that adversarial examples, i.e., intentional small scale perturbations to the input of deep learning models may lead to incorrect decisions with high confidence. Further, they proposed an image recognition algorithm using an adversarial regularization approach, where a mixture of clean and small perturbations are applied along with an input dataset to enhance the predictive performance of the learning model [6]. If x and u are the respective input and various parameters of a predictive model, the adversarial training to the predictive model adds the following term to its cost function: log pðy j x þ r adv ; hÞ where r adv ¼ arg min log pðy j x þ r;ĥ Þ where r is an adversarial example on the input data andĥ denotes a set of constant parameters of a predictive model. At each training step, machine learning algorithms identify the worst case perturbations r adv against the current trained model pðy j x;ĥ Þ in the above equation with respect to h.
Adversarial training with semi-supervised learning approaches shows performance improvement for multi-domain dialog systems [21]. Although adversarial training has recently been applied in NLP [6] task, this paper is, to the best of our knowledge, the first to apply adversarial training in a multi-domain joint dialog system. Unlike other regularization methods such as dropout or word dropout that introduce

Adversarial Multi Domain Joint Model
Our proposed multi domain joint model (shown in Fig. 3) mainly focuses on the embedding layer with inline perturbations to the input, which is the main source of knowledge for sequence modeling for domain, intent, and entity in the LSTM cell. Each layer and data preprocessing step is curated in the following subsections.

Data Preprocessing
The textual data, which is normally unstructured in nature, should be transformed into a structured dataset that can be employed by a typical machine learning algorithm. Approaches such as frequencybased and vector space [22] methods can be used to extract information from unstructured textual data and transform it into structured data. TF-IDF is one such frequency-based algorithm for unstructured text processing based on the term frequency matrices. Generally, vector space methodology is widely used for text data embedding. It converts the original text documents into a vector feature space. As presented in Fig. 4, the unstructured text documents need to be converted into a structured dataset via various preprocessing techniques, since unstructured datasets cannot be incorporated into a machine learning model directly. The text processing process includes text cleansing, such as tokenizing, and removing special characters. Part-of-speech tagging annotates words with a syntactic category, e.g., nouns, verbs, etc. The stemming process replaces each term with its relative root stem. For example, the term "be" is the stem for words such as "is," "am," and "are." Stemming helps to reduce the number of features derived from examples. The number of features in the corpus can further be reduced by filtering irrelevant words. From this preprocessed cleaned corpus, a word-embedding vector space model is created.
As shown in Fig. 2, in case of the joint approach, there is a BiLSTM layer with three output layers performing domain, intent, and entity prediction, which are final outputs of the BiLSTM layer. We cumulate the losses of these three outputs, which will be minimized and optimized jointly. Crucially, the optimization of the joint model updates shared loss simultaneously, which is suitable for the domain, intent, and entity prediction process, allowing the joint model to not only avoid any downstream error but also helps to improve the predictive performance of each individual task by sharing loss function. Further, the small worst perturbation in the embedding layer maximizes shred loss. The adversarial loss is then added to the joint loss of domain, intent, and slot to enhance the predictive performance of the multidomain NLU model shown in Fig. 3. This architecture mainly focuses on the embedding layer, which is the main source of knowledge for sequence modeling in LSTM. Utterances are embedded along with the generated small perturbation i.e., worst noise data. The embedded examples are fed to a bidirectional LSTM to extract sequential and contextual information. The shared single LSTM cell then routes to three different output layers: domain prediction, intent detection, and slot prediction.

Embedding Layer
The first part of the LSTM cell is the input layer. We construct a word embedding vector layer in order to feed the input to the LSTM. Let w 1 . . . w n 2 W denote a word sequence, and then word embedding is given as follows: Word embedding : e w 2 R 64 for each w 2 W (2)

Adversarial Training Layer
For the adversarial training, small perturbations are fed to neurons along with input x. If h is the parameter of a predictive model, adversarial cost function is calculated as in the following equation: log pðy j x þ r adv ; hÞ where r adv ¼ arg min log pðy j x þ r;ĥ Þ where, r is an adversarial example on the input data. Andĥ denotes a set of constant parameters of a predictive model. At each training step, machine learning algorithms identify the worst case perturbations r adv against the current trained model pðy j x;ĥ Þ in above equation with respect to h.

BiLSTM Layer
A BiLSTM layer containing forward and backward propagation is shown in Fig. 2, where two LSTM cells-forward and backward-are created. Due to the forward and backward LSTM cells, past and future information can be obtained for each memory cell.

Joint Predictive Layer
The final training objective of multi domain joint model is to minimize total cumulative loss, i.e., the sum of the domain, intent, entity, and adversarial loss: The loss for domain, intent, entity, and adversarial learning for domain, intent, and entity l d ; l i ; l t ; l da ; l ia ; l ta is calculated separately for each annotated example. Then the gradient step is applied and shared loss is calculated as l d þ l i þ l t þ l da þ l ia þ l ta . Using the shared loss parameters Â, domain, intent, entity, and adversarial training are optimized jointly for all three predictive tasks using Adam optimizer. The detailed steps of a multi-domain dialog system are as follows:

Experiment
This study used a publicly available annotated dataset (shown in Appendix A, Fig. A1) with 43k examples from three different domains-Alarm, Reminder, and Weather-of multi-domain personal assistants [23]. The dataset contains 11 unique entities and 12 intent labels as shown in Tab. 1, with some overlapping entities between domains. A sample of the processed dataset is shown in Appendix A, Fig. A2.
The dataset is converted into utterance sets of input, slot-tags, intent label, and domain label in a sequential order, as shown in Fig. A2. Input data is enclosed with BOS (Beginning of Sentence) and EOS (End of Sentence). The examples or utterances are then divided into train, dev, and test datasets at a 70 to 20 to10 ratio respectively, as shown in Tab. 2. Then the annotated datasets are preprocessed and tokenized using a widely-used NLU tokenization python library to create word embedding of size 64 that has an input with length 50. Finally, the word-embedding vector is created based on collected dataset and utilized in a TensorFlow-based LSTM python library for learning and predicting processes with the inline addition of perturbation along with input.
To evaluate the Multi-Domain Joint Model with adversarial learning (MDJM-ADV), we conducted a comparative experiment with a prior Multi-Domain Joint Model (MDJM) using the same dataset. Each

Algorithm 1: Multi domain joint model
Step 1: Prepare input data containing sequence of input, domain intent, and slot in the respective order Step 2: Construct word embedding, e w 2 R 64 for each w 2 W Step 3: Create forward and backward LSTM cells Step 4: Create encoder & decoder for input, domain, intent, and slot Step 5: Predict and calculate loss function a. Calculate slot loss function using seq2seq loss b. Calculate intent, and domain loss using cross entropy Step 6: Calculate loss function ð Þ, loss for each fields -Add random perturbation, log pðy j x þ r adv ; uÞ where r adv ¼ arg min log pðy j x þ r;ĥ Þ -Calculate loss function by adding adversarial loss Step 7: Optimize the model based on Adam optimizer IASC, 2022, vol.31, no.1 model is trained at the learning rate of 0.01 with 16 batch size and 100 hidden layers, and optimized with the Adam optimizer. The LSTM-based MDJM-ADV shares the loss function between the domain, intent, entities, and adversarial layers. Addition of perturbations along with input increases the loss function of the LSTM cell, and minimizing the loss function using the Adam optimizer equally shares information between domain, intent, and entity and also reduces the number of predictive models to one. We trained the MDJM and MDJM-ADV models for 20 epochs with 16 batch sizes at the learning rate of 0.001 using 100 hidden neurons. Figs. 5 and 6 show the training loss for MDJM and MDJM-ADV, respectively.

Conclusion
In this paper, we proposed an adversarial joint model, MDJM-ADV, for dialog systems to predict domain, intent, and entity using a single LSTM cell to reduce the risk of downstream error propagation that is present in the typical pipelined approach. Experimental results showed a significant improvement in the predictive performance of each model-domain, intent, and entity-based on adversarial learning in comparison to the base joint model. The proposed model reduces the number of predictive models to one, which ultimately reduces the level of human effort necessary to manage multiple predictive models. Further research would benefit from testing our proposed MDJM-ADV model with various languages and datasets of large volume. Conflict of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.