Intelligent Automation & Soft Computing DOI:10.32604/iasc.2022.017559 | ![]() |
Article |
CVAE-GAN Emotional AI Music System for Car Driving Safety
1Dept. of Health and Marketing, Kainan University, Taoyuan City, 330, Taiwan
2Master Program of Sound and Music Innovative Technologies, National Chiao Tung University, Hsinchu City, 300, Taiwan
*Corresponding Author: Chih-Fang Huang. Email: jeffh.me83g@gmail.com
Received: 03 February 2021; Accepted: 04 March 2021
Abstract: Musical emotion is important for the listener’s cognition. A smooth emotional expression generated through listening to music makes driving a car safer. Music has become more diverse and prolific with rapid technological developments. However, the cost of music production remains very high. At present, because the cost of music creation and the playing copyright are still very expensive, the music that needs to be listened to while driving can be executed by the way of automated composition of AI to achieve the purpose of driving safety and convenience. To address this problem, automated AI music composition has gradually gained attention in recent years. This study aims to establish an automated composition system that integrates music, emotion, and machine learning. The proposed system takes a music database with emotional tags as input, and deep learning trains the conditional variational autoencode generative adversarial network model as a framework to produce musical segments corresponding to the specified emotions. The system takes the music database with emotional tags as input, and deep learning trains the CVAE-GAN model as the framework to produce the music segments corresponding to the specified emotions. Participants listen to the results of the system and judge whether the music corresponds to their original emotion.
Keywords: Car driving safety; musical emotion; AI music composition; automated composition; deep learning; CVAE-GAN model
In present-day transportation, most car drivers drive in heavy traffic daily. To reduce the probability of car accidents, certain smart sensors or methods have been developed [1,2]. At present, self-driving cars are an immature technology that can neither replace a human driver nor ensure safe driving [3,4]. Listening to music can rejuvenate the driver and thus reduce the probability of traffic accidents [5,6]. The present study applied the conditional variational autoencoder-generative adversarial network (CVAE-GAN) method proposed in [7,8] to develop an emotionally intelligent system that automatically composes music to ensure safe driving. The proposed system automatically generates music depending on the driver’s emotional state, reducing the probability of car accidents. However, pop music is musically rich and varied and its production requires much labor and talent; moreover, buying the intellectual property rights to pop music is highly expensive. To address the aforementioned problems, increasingly popular automated composition systems have been formulated, which require pre-calculated models or machine learning systems. These systems randomly generate music based on principles in music theory, such as pitch, rhythm, and harmony, through algorithmic standardization. A well-known system for this purpose is the hidden Markov model (HMM)-produced soundtrack [9]. Due to rapid advances in science and technology, artificial neural networks (ANNs), which originally relied on expensive hardware computation, has now been improved. In addition to the statistical basis of the HMM, the ANN now comprises additional model features, which can substantially reduce the preparatory work required in generating music. This study aims to use a simple, neural network based automated system to compose music that relates the listener to their current emotion. Human emotions are extremely complex, and one’s emotions changes depending on the music they are listening to [10,11]. Researchers have provided differing definitions of basic types of emotions. For example, in 1972, Ekman defined the six basic emotions by analyzing facial expressions [12]. In 1980, Russell developed an emotional circle model with arousal, on the horizontal axis, indicating positive emotions, and valence, on the vertical axis, indicating negative emotions to distribute common emotions in a two-dimensional plane to how emotions are correlated with each other [13]. Since then, this model has been extensively applied in different fields to explore the relationship between emotions. In 2007, Gomez et al. explored the relationship between emotion, organized in two-dimensional planes, and musical characteristics [14], and they, after conducting a series of experiments, proposed formulas corresponding to various musical characteristics and emotions. Since then, several scholars have analyzed the relationship between emotion and music. With the development of similar neural networks, various models have been proposed, such as DNN, CNN, RNN, and generative adversarial network (GAN) models, and scholars have applied machine learning to music (e.g., the papers published by MidiNet [15] and MuseGAN [16]). Many repetitive tasks, including music theory analysis and music information retrieval, which are necessary when using the HMM, have been simplified, and the efficiency of automated composition has improved; these advances have all owed nonmusicians to conduct research on music composition. At present, few scholars have discussed emotion, music, and machine learning simultaneously; the present study thus aims to do so, specifically by using emotion as the conditional information for neural network–based automated music composition. This contribution of the present study will gradually simplify the steps involved in song conversion and serves as a prototype for multidisciplinary research.
Emotion-related research is based on the emotional circle model proposed by Russell and has been extended to other domains. For example, in 2009, Laurier et al. used a two-dimensional emotional plane as their basis and, through self-organized mapping, established a new emotional distribution plane [17]. In this novel plane, two additional but similar emotional words are distributed at a closer distance. This distribution indicates the similarity between emotional words and the trend in group classification. Furthermore, some researchers have used various calculation models to analyze two types of emotion classification methods in music: semantic classification and dimensional squares classification [18]. Through the distribution, the similarity between emotional words and the trend of group classification can be seen. In addition, some researchers used different calculation models to further discuss and study the two kinds of emotion classification methods in music, including semantic classification and dimensional squares [18].
3 Proposed AI Emotional Music System for Safe Car Driving
Most transportation accidents occur in part due to the driver’s emotional state [19,20]. This study’s emotionally intelligent system for automated music composition (Fig. 1) uses the driver’s emotional state as the input and generates a corresponding music composition to stabilize the driver’s emotions; in doing so, the likelihood of a traffic accident is reduced.
Figure 1: Safe car driving through the emotionally intelligent system for automated music composition
4 Proposed Music Generation System
The proposed system mainly comprises the three following parts: creating a music library, establishing the system model, and obtaining the system output. The system is detailed as follows (Fig. 2).
Figure 2: Proposed music generation system
The architecture of the proposed system is based on the CVAE-GAN model (Fig. 3). The encoder and decoder, as the same generator, are connected in series in a sequence-to-sequence (Seq2Seq) fashion, and the remaining generators (decoders), discriminators, and classifiers are connected in a general CGAN fashion; each component is based on a multilayer GRU model. Several preliminary steps must be followed when using music as the input vector in the model, as shown in (Fig. 3).
Figure 3: Flowchart of the proposed system
When raw music data obtained from a database are entered into the model, they are initially expressed in the form of a one-hot vector; subsequently, through embedding, their dimensions are reduced. In addition to yielding computational savings, this process can also avoid the formation of considerable one-hot vector data. This is because the waste generated by the occurrence of zero values is reduced [21]. In the ADAM algorithm [22] which is used for the first-order gradient-based optimization of stochastic objective functions based on adaptive estimates of lower-order moments, the original one-hot vector data have lengths of 99 dimensions. After embedding, the number of dimensions is reduced to 24, of which pitch and pitch length occupy 8 and 16 dimensions, respectively. In the model code, the input data are represented as a shape (number of songs, maximum number of notes in a song, and number of pitches). For example, a data point represented by the shape (4, 6, 8) indicates four inputs of eight-dimensional vectors of length six. After the data are encoded, the tile function is used to condition the emotion (called attribute in this experimental code); its length is expanded in correspondence with each note, and the Concat function is used to connect the emotion and the input data. All the system model parameters in this article has been listed in Tab. 1.
The deep learning framework of this study’s system is similar to one using a deep learning library [23,24] where the developer can freely add models, classifiers, algorithms, and other required components to substantially lower the barriers to writing machine learning code. Machine learning frameworks are usually open source, and most of them provide multiple open-language interfaces. Users can choose a suitable framework to write in according to their requirements. Most deep learning frameworks comprise several parts, including tensors, various operations based on these tensors, computation graphs, automatic differentiation tools, and their own expansion packages for each framework. A tensor represents data and forms the core of the deep learning framework. It lists the properties of the deep learning frameworks Caffe, Neon, TensorFlow, Theano, and Torch as of 2 August, 2016 [25], which are used by the proposed system to ensure safe car driving. It also shows that all these frameworks support languages such as Python and C++. Thus far, the mainstream framework is dominated by TensorFlow, although PyTorch is increasingly popular.
The Music Extensible Markup Language (MusicXML) is an open file format based on XML for encoding Western sheet music. The format is open for recording and can be freely used in accordance with the W3C community’s license agreement [26,27]. The most common file format for sheet music is MIDI, which can represent complex compositions and is relatively playable. However, it is more difficult to read music information in this format. By contrast, MusicXML precisely defines the display format in the music score, such as pitch and duration. Thus, it can open the same file in different formats and display the same score format and music information covered [28,29], as shown in Tab. 2. To ensure that the training model’s data are unified and complete, this paper uses MusicXML as the file format for the training data and integrates the data into a standardized database under a given set of specifications.
Seq2Seq is composed of two RNNs: the encoder and decoder. The input sequence is digested by the encoder and absorbed into a vector (context vector); subsequently, the text is generated by the decoder according to the context vector. The encoder is responsible for compressing a sequence of length M into a 1-vector, whereas the decoder generates N outputs based on this 1-vector. Under the complementarity of M→1 and 1→N, an M-to-N model is constructed. Thus, Seq2Seq can handle any input and output sequence of variable length; one of its common applications is a translation system, as shown in Fig. 4 [30,31]. In addition, the model used by the encoder and decoder in Seq2Seq can be replaced by any other model, and it is thus widely applicable.
Figure 4: Schematic of sequence-to-sequence learning
A variational autoencoder (VAE) is a generative model through which a distribution model is constructed to approximate the unknown data distribution and to make the generated sample similar to the actual sample. VAE uses two sets of parameters, the mean and variance, to convert the abnormally distributed data into a more meaningful normal distribution. As indicated by the VAE structure shown in Fig. 5, each sample passes through a normal distribution and is sampled within a specified range to avoid the generation of discretely distributed information.
Figure 5: Variable encoder structure
The sampled value of z in Fig. 6 is the coordinate value of latent space. The difference between the sampled z value and the expected latent value is used by the KL divergence to calculate the loss difference. The closer the KL loss can be to 0, the better, which can be expressed as normally distributed data [32,33]. A conditional variational autoencoder (CVAE) is another generative model that integrates the vector of a specific label into the encoder and decoder of the VAE to generate samples that meet specific requirements [34].
Figure 6: Schematic of variational autoencoder
9 Generative Adversarial Network
AGAN is an unsupervised learning method that learns by letting two neural networks confront each other. This network comprises a generative network and a discriminative network. The generative network samples a randomized input from a predefined latent space, and its output must imitate the real samples in the training set as much as possible. The input of the discriminative network is the real sample or the output of the generative network; the output of the generative network ought to be as distinct from the real sample as possible. Moreover, the generative network must deceive the discriminative network as much as possible. Through the confrontation between the two networks and the constant adjustment of parameters, the discriminative network is expected to be unable to judge whether the output of the generative network is true [35–37], as shown in Eq. (1).
Fig. 7 shows the GAN training process [22], where the black dots in the middle indicate the real data distribution, the zigzag dashed line on the left indicates the discriminator distribution, the solid line on the right indicates the generator data distribution, the horizontal z axis indicates the noise, and the upper horizontal x axis is where the real data fall under. The mapping relationship is expressed as x = G(z). In the figure, (a) is the initial state, (b) and (c) are the training stages, and (d) indicates that the graph has converged and the distributions of the generated and real data are overlapping. Thus, the discrimination network cannot distinguish the real data from the generated data.
Figure 7: Training process of generative adversarial network
As shown in Fig. 8, the CVAE-GAN system structure comprises four neural networks (the encoder, generator, classifier, and discriminator) whose structures complement each other. The encoder (abbreviated as E in the figure) produces a latent vector z, which is expected to satisfy the Gaussian distribution through the given raw data x and category c; the generator (abbreviated as G) provides the latent vector z and category c, and then produces the relevant generated data x′; the classifier (abbreviated as C) outputs the category to which it belongs after inputting data x or x′; the discriminator (abbreviated as D) inputs the information x or x′ and distinguishes the input information into real information or generator-generated information. This is one of the main structures of GAN and is competitive with G [35]. As mentioned, VAE forms the front part of the CVAE-GAN structure and GAN forms its back part. In addition, the generated data must be C, which meets the category, where G is the generator in the VAE structure. In CVAE-GAN, the generator part covers three types of losses:
Figure 8: Structural diagram of the CVAE-GAN system
11 Musical Elements that Affect Emotions
Music has many elements that each have their unique effects on the listener’s mood. In 2007, Gomez listed 11 musical characteristics that are pertinent to emotion and explored their relationships with emotional direction (positive vs. negative valence) and emotional arousal [36]. Juslin and Timmers also explored the emotional circle model, specifying emotions and their corresponding musical characteristics. As indicated in Fig. 9, volume, timbre, speed, and the player’s technique influence emotional arousal, whereas musical tonality and timbre noise influence emotional valence [37].
Figure 9: Juslin and Timmers’ emotional circle model and the relationship between music elements (arranged by those two authors)
Among these musical elements, tempo, rhythm, tonality, and pitch exert the greatest influence on emotion. For example, music with a fast tempo, clear rhythm, and a major key induces joy and excitement, whereas music with a slow tempo and minor key induces melancholia. Through this inductive correspondence, the composer can create music that is more consonant with the listener’s mood. In this study, tempo and tonality were used as the major elements when selecting music for CVAE-GAN system training.
12 Musical Structure that Affects Emotions and Tensions
In addition to the aforementioned musical elements, song structure is another important factor affecting emotions. In pop music, the typical song structure is intro–verse–prechorus–chorus–bridge–outro, where the verse and chorus are indispensable. The verse is the main storytelling passage in a pop song [38,39], where the melody varies little and the music is simple; the verse is indispensable to emotionally priming the listener for the climax that is the chorus, where lyrics and melody are repeated to intensify the emotions induced by the verse that preceded it. According to narratology (i.e., the theory of storytelling), a piece of artwork ought to tell a story based on its various tensions with emotion arousal [40–42], such as the musical structure of verse and focus. Fig. 10 illustrates the structure of a pop song and the corresponding tensions of the song’s various parts. Using these tensions, the proposed CVAE-GAN system can train the music dataset to generate a typical pop song that induces emotion through tensions in the song structure.
Figure 10: Flow of musical tension in the structure of a typical pop song
13 Model Structure of the Proposed CVAE-GAN System
This paper uses the CVAE-GAN model as the main architecture, as shown in Fig. 11. The encoder and the decoder (the same generator) are connected in series using the Seq2Seq method, while the remaining generators (that is, the decoder), discriminators and classifiers are used per the general CGAN, and each component is based on the implementation of the multilayer GRU model. Several preliminary steps must be executed when music is used as the input vector in the model (Fig. 8). When the raw data of the database enters the model, it is initially expressed in the form of a one-hot vector; subsequently, through embedding, the original music data will be reduced in dimension. In addition to yielding computational savings, this method can also avoid generating a large number of one-hot vectors by eliminating the waste caused by the occurrence of a zero value [30]. As indicated in Tab. 3, the original one-hot vector data have lengths of 99 dimensions. After embedding, the data are reduced to 24 dimensions, of which pitch occupies 8 dimensions and pitch length occupies 16 dimensions. In the model code, the input data are represented as a shape (number of songs, maximum number of notes in a song, and number of pitches). For example, a data point represented by the shape (4, 6, 8) indicates four inputs of eight-dimensional vectors of length six. After the data are encoded, the tile function is used to condition the emotion (called attribute in this experimental code); its length is expanded in correspondence with each note, and the CONCAT function is used to connect the emotion and the input data.
Figure 11: Model structure of the proposed CVAE-GAN system
In its experiment, this study administered a questionnaire survey to an ethnically diverse sample of young adults (in their 20s and 30s). The questionnaire contained four question groups, each of which covered two pieces of music and inquires into participant judgments of a piece of music with regard to its emotional content. The questions are scored on a five-point scale. The questionnaire covered two broad aspects: the first is the relationship between melody and emotion, and the second is whether this relationship is affected by phrase length. The following four groups of emotions were considered:
A (happy, excited, surprised), B (angry, discouraged), C (sorrowful, melancholic), and D (calm, relaxed, comfortable). Tab. 4 shows the scoring statistics for two generated pieces of music.
As shown in Fig. 12, after epoch 500, the loss value of each element did not change much between epochs 280 and 320, and the convergence was completed in this interval. In the training process, multiple steps were run in each epoch and the number of steps was determined from the batch size and the size of the dataset. Each step output the current loss value to Fig. 9; thus, any two epochs have different steps.
Figure 12: Error convergence curve
15 Conclusions and Future Work
The establishment of the music database in this experiment requires a long period of manual collection and review. Therefore, building a music database is expensive, whether financially or in labor. Careful evaluation and consideration are required during the selection of models, the musical characteristics that affect the listener’s mood, the number of tracks, and the file format of the input data. The experimental results indicated that first, the emotional category of the music clips produced after model learning had a higher similarity score than the preset emotional category and second, the other three categories differed significantly in their emotional similarity scores. Thus, most of the emotional similarity scores could be learned. In addition, the participants were found to be highly satisfied with generated music. In the future, the CVAE-GAN emotionally intelligent system for automated music composition, which functions to improve driving safety, can be applied using biofeedback sensors, such as brainwave EEG [43] or heart rate trackers [44], to detect the physical and mental state of the driver in real time. In addition, the response to the proposed system and the generated music can be used automatically with more accurate music elements to reduce the probability of traffic accidents.
Funding Statement: The authors appreciate the support from Taiwan’s Ministry of Science and Technology (MOST 108-2511-H-424-001-MY3).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
![]() | This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |