|Intelligent Automation & Soft Computing |
Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network
Department of Computer Science and Engineering, School of Computing, SRM Institute of Science and Engineering, Kattankulathur, Tamil Nadu, India
*Corresponding Author: A. Pandian. Email: email@example.com
Received: 02 February 2022; Accepted: 01 April 2022
Abstract: Speech enhancement is the task of taking a noisy speech input and producing an enhanced speech output. In recent years, the need for speech enhancement has been increased due to challenges that occurred in various applications such as hearing aids, Automatic Speech Recognition (ASR), and mobile speech communication systems. Most of the Speech Enhancement research work has been carried out for English, Chinese, and other European languages. Only a few research works involve speech enhancement in Indian regional Languages. In this paper, we propose a two-fold architecture to perform speech enhancement for Tamil speech signal based on convolutional recurrent neural network (CRN) that addresses the speech enhancement in a real-time single channel or track of sound created by the speaker. In the first stage mask based long short-term memory (LSTM) is used for noise suppression along with loss function and in the second stage, Convolutional Encoder-Decoder (CED) is used for speech restoration. The proposed model is evaluated on various speaker and noisy environments like Babble noise, car noise, and white Gaussian noise. The proposed CRN model improves speech quality by 0.1 points when compared with the LSTM base model and also CRN requires fewer parameters for training. The performance of the proposed model is outstanding even in low Signal to Noise Ratio (SNR).
Keywords: Speech enhancement; convolutional encoder-decoder; long short-term memory; noise suppression; speech restoration
Speech enhancement plays an important role in speech processing applications and voice communication by separating speech and non-speech noise . In recent years, research interest in speech enhancement has been increased consistently to address the challenges that have occurred in robust speech recognition, mobile speech communication, and hiring aids. Most of the speech enhancement algorithms work in the short-time Fourier transform (STFT) domain along with the weighting rule. To compute the weighting rule, the signal-to-noise ratio (SNR) with noise power is used. Various algorithms are available to estimate the SNR  and noise power , in which a frequently used technique is assuming noise to be more stationary than speech. This technique does not provide good performance since this assumption cannot handle highly non-stationary noise types like babble noise or restaurant noise. Mask-learning  and feature-learning  are two major approaches available for enhancing the single-channel speech over the data that is collected through the stereo. Mask-learning outperforms when compared with feature-learning due to its fastness and dynamic range. The masking process removes the additive noise by assuming the scale of the masked signal is the same as the clean speech signal. Mask learning is carried out by two approximations, the distance between the target mask and the learning mask is minimised directly by mask approximation (MA) . Similarly, the distance between the target signal and the distorted signal was minimized by signal approximation (SA) .
In , the researcher proposed a Smart Larynx (SL) device for vocal cord affected patients to increase the clarity of their Tamil speech. The SL device contains a smartphone with an inbuilt vibrator application along with a frequency ranging between 250 to 450 Hz. The Radial Dilation Wavelet Transformation (RDWT) algorithm is used for processing the Tamil speech signal. This proposed work achieves nearly 85% accuracy in obtaining clean speech signals from noisy environments.
Deep learning algorithms show better performance when compared with machine learning and other regular methodologies in most fields, including automatic speech recognition, image processing, computer vision, and speech enhancement  and researchers used phase spectrum approximation (PSA) loss to differentiate the noisy and clean speech signals. PSA outperformed when compared with MSA. Most of the models used for speech enhancement belong to the feedforward Deep Neural Network (DNN). The performance of speech enhancement was increased by using deep learning since no assumption on the stationarity of noise and loss function was made in the training phase along with the topology of the neural network. Deep learning usually predicts the label from each frame, but such a method does not hold on to long-term contexts. In the paper [10,11], the researcher suggested using sequence-to-sequence mapping for speech separation to strengthen the long-term context. In , the researcher shows better performance by using a large number of different noise types for multi-condition training that directly maps the noisy features with their corresponding clean speech features. In , the researcher concludes that speech separation can be done efficiently by estimating time-frequency (T-F) based on an ideal ratio mask (IRM) rather than estimating the clean spectrogram directly. Mask spectrum approximation (MSA) loss in the domain of the speech spectrum In , the researcher used Long Short-Term Memory (LSTM) that contains temporal dynamics, which shows better performance in speech separation tasks. By using LSTM, temporal dependencies are taken into account and more focused on the target speaker, which provides better speaker generalization. Similarly, the Convolutional Neural Network (CNN) model is also widely used in speech enhancement tasks with an increase in performance. Due to its encoder-decoder architecture, CNN is a popular model in the fields of image processing and computer vision due to its encoder-decoder architecture. But this CNN model uses the pooling layer for compressing the feature dimensions alone. The encoder compresses the features and the decoder will decompress the features by using up-sampling layers . With the help of skip connection, high-resolution information is conserved by adding the same number of layers of the encoder to the decoder in a task like speech enhancement. That learns the clean speech spectrum by mapping it with the noisy speech spectrum.
Each network topology will have its own advantages over speech enhancement. To leverage the advantage of this different topology is to combine it into the multi-stage identical model. With such a formulation, CNN and Recurrent Neural Network (RNN) have been combined for noise and speaker-independent speech enhancement. In [16,17], the researchers proposed Recurrent Convolutional Encoder-Decoder (R-CED) that incorporates repeated convolution layers with ReLU as an activation layer to reduce the noise signal. In , a multi-stage feedforward DNN model is used for separation and enhancing the separated signal that is obtained from the music source. In , the researcher proposed a speaker-independent model that contains LSTM with 4 hidden layers to handle the noise. The proposed model shows better performance for untrained speakers than DNN and also performs well on short-time objective intelligibility. To distinguish between noise and clean speech signals, a model that can handle long-term temporal dependencies is required.LSTM is widely used in image restoration since it can handle long-term temporal dependency. With its success in image restoration, LSTM is also used in speech enhancement. In , researchers estimated the clean speech along with the noise linear prediction coefficient using the Deep Neural network-based LSTM model. In this work, residual background noise is reduced by applying post-processing. LSTM perform well in speech enhancement task but it is highly difficult to implement due to its complex network. To overcome the above drawbacks of LSTM, two models were proposed recently for Speech Enhancement namely Gated Recurrent Unit (GRU) and Single Gated Unit (SGU) [21,22]. The GRU and SGU are easy to implement when compared with LSTM but the performance of SGU and GRU is not up to the level in speech enhancement application. In , the researcher introduces a new methodology to reduce the risk of over-pruning by evaluating the pruning and fine-tuning in each iteration using short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) metrics.
We designed a new CRN architecture for real-time speech improvement based on motivation from existing work . When compared to speech enhancement created with the LSTM model alone, the suggested model integrates the CNN encoder-decoder gives superior performance. To assess the noise power spectral density, researchers used a model that included deep minimum mean-square error and a ResNet Temporal Convolutional network . Using a Temporal Convolutional Neural Network (TCNN), a noisy speech input was directly translated to a clean speech signal. The researcher presented a methodology for compressing the RNN using pruning and integer quantization . This decreases the size of the RNN by 38% while decreasing the SNR.
The rest of the paper is organized as follows: in Section 2, the baseline model and the proposed model is described in detail. In Section 3, experimental setup and evaluation results are presented followed by that in Section 4, conclusion and future work are given.
2 Traditional and Deep Neural Network Based Architecture for Speech Enhancement
The signal model for estimating the clean speech signal from the noisy signal is given in Eq. (1).
Where, denotes the clean speech signal, denotes the noisy speech signal with discrete-time sample index n and y(n) denotes the noisy microphone signal. Similarly, the window function is applied frame-wise to compute the STFT representation by using k-point Discrete Fourier Transform (DFT) as shown below in Eq. (2).
where, l denotes the frame length and m denotes the frequency bin index. Mostly frame and frequency bin-wise gain functions are used in traditional and DNN approaches to compute the clean speech as shown in below Eq. (3)
where, the gain function is denoted by .
2.1 Traditional Approaches
In the traditional approach, gain function depends on prior and posterior signal-to-noise ratio (SNR) as shown in Eq. (4). In [27,28], the researcher used minimum mean-square error log spectrum amplitude with the decision-directed approach for estimating the prior SNR along with its posterior SNR obtained by using minimum statistics (MS) for estimating the noise power.
where, denotes the prior and denotes the posterior SNR.
2.2 Deep Learning Approaches
In deep learning-based approaches, initially, neural networks are trained by mapping the input feature vector with the output feature vector to perform the speech enhancement task with the help of the Activation function present in each neuron and parameters that are required for training the network topology. In the case of the recurrent neural network, a hidden state will be given an additional input to compute the temporal context in each frame.
Gain function is often used for separating the clean speech from the noisy signal as a T-F mask. The masks can be improvised by reducing the MA loss function as shown in Eq. (6). The Mask Approximation loss function doesn’t directly maximize the goal of decreasing the variance and clear speech spectrums. Estimating the loss between each frequency bin is depends on the clean speech signal and noisy signal. Based on this mask value can be estimated up to a certain range alone, to overcome this mask can be estimated by using masked spectrum approximation loss function as shown in Eq. (7)
In (6) and (7), the loss function is computed based on spectral magnitude by using noisy speech. Similarly, loss function can be computed based on clean speech using the ideal complex mask. Those two-loss function has been combined to form a complex cMSA loss function as shown in Eq. (8), which provides better performance over speech enhancement.
Using Eq. (9) enhanced signal is calculated by applying cMSA loss function in neural network
Using Eq. (10) real and imaginary parts of the clean speech spectrum can be estimated directly by applying cSA.
2.3 Convolutional Recurrent Neural Network
In the proposed model, speech denoising and restoration are carried out in two different stages. In the first, speech denoising is performed by training the LSTM with cMSA loss function. In the second stage, a Convolutional encoder-decoder is used for restoring clean speech obtained from the denoising stage. CED performs well in restoring the slightly corrupted structured signal. In stage two direct spectral mapping is done since cSA loss function is used to train the CED network. The reason for using cSA loss function is restoring will be done more efficiently in missing T-F regions when compared with the mask-based approach. Similarly mapping the output of the same domain as the input is possible in the CED network.
3 System Descriptions
The proposed model is based on two-fold architecture, in the first stage, raw speech with noise will be given as an input for noise suppression. Normalized feature vectors are obtained during feature extraction by using mean and variance normalization (MVN) . Similarly to extract the current frame, the context of previous and next frames are concatenated during feature extraction. Separate real value masks, as well as real and imaginary noisy speech spectrum, are estimated using input features and parameters that are obtained during the training phase of the noise suppression network. These mask values are applied in Eq. (8) to obtain the clean speech spectrum. Fig. 1 shows the architecture of the proposed model.
In between the Noise suppression and Speech restoration stage, the frequency resolution of the speech signal is increased for mapping CED network with high-resolution spectrum to obtain the clean speech signal by using interpolation. Interpolation controls the spectral leakage from negative frequency to estimate the significant errors when the signal contains small cycles. Such interpolation works by applying Inverse Discrete Fourier Transform (IDFT) followed by zero padding in the time domain and respective modification is made to the frequency domain as well.
After interpolation de-noised speech signal is processed to the speech restoration stage. This stage also contains feature extraction using MVN normalization along with standard deviation. Later extracted features are directly mapped with enhanced speech spectrum along with trained parameters. Finally enhanced time-domain signals are reconstructed by using IDFT and windowing .
3.1 LSTM Based Noise Suppression
In the First stage noise suppression is carried out using recurrent neural network type LSTM as shown in Fig. 2. LSTM performs well in the long-term context that helps in tracking the target speaker. LSTM  is a type of RNN that contains memory cells that shows a successful performance in temporal modeling in the field of the acoustic model over Automatic speech recognition. All RNNs have a chain of reiterating neural network modules. Likewise, LSTM has an equivalent structure, yet the repeating module has a substitute structure. There are four interfacing in an excellent way rather than having a singular neural system layer as shown in Fig. 3.
Steps involved in LSTM:
Step 1: Sigmoid layer in forgot gate layer chooses what data the cell state has to discard.
Step 2: Combine also gets fed into the input layer. This layer decides what data from the candidate should be added to the new cell state as shown in Eq. (9).
At − 1 represents the output of the previous cell (or) LSTM, Bt input at that particular time, Ct − 1, Ct, t represents old cell state, new cell state and new candidate value, ft forgot gate state, Ot output gate, it input gate, σ sigmoid function, x bias for the respective gate, W weight for the respective gate
In stage 1 for noise suppression, feature vectors that are extracted after MVN normalization are given as the input. The size of the feature vector is based on the previous and next frame size with a constant value 1.
where, and denotes the previous and next frame features that can be extracted to create a current frame. In some situations, the next frame can also be considered as zero.
Initially, noise suppression consists of one fully connected feed-forward neural network that helps to identify the efficient features then forward it to the LSTM layers. Finally, three Feed Forward layers are attached with LSTM to estimate the T-F mask. Each feed-forward layer consists of 425 nodes with the rectified linear unit (ReLU) as an activation function.
3.2 Speech Restoration Using CED
In the proposed work, CED is used for speech restoration with the architecture shown in Fig. 4. Each layer is represented with frequency, frame axis and feature maps sizes. The output that is obtained from speech denoising is given as the input for speech restoration stage. Normalized feature are obtained by using separated feature maps for the real and imaginary parts.
The feature maps that are obtained should match the real and imaginary parts of the CED kernel in the first network layer. Frequency axis size is obtained by applying zero padding and multiplying by 4 to make the total dimension factor as 4 for the encoder. Similarly, decoder features will also be reconstructed to match the factor of the encoder [32–35]. After replacing the noise that is obtained from the speech de-noising stage, clean speech is estimated. Later without applying any algorithm delay or any other information in the future frame, speech restoration is generated. In speech restoration, the convolutional layer is denoted Conv with filter kernel F, frequency axis size N, and constant 1 is added. On the other hand, transposed convolutional layers are denoted by ConvT. Both Convolutional layer and transpose convolutional layers use ReLu as an Activation function with zero padding to map the input feature vector. The Convolutional and maximum pooling layers are used to reduce the feature size concerning the frequency axis in the encoder part. Similarly, Transpose Convolutional and upsampling layers are used to increase the frequency axis in the decoder part [36,37].
3.3 Data Preprocessing and Training the Models
3.3.1 Data Set
Audacity 3.0.2 is used to record the Tamil speech signal from various native Tamil speakers. Audacity is open-source software. It is a basic audio editor which can trim, copy, record, and manipulate sounds. It can be used to adjust the speed and pitch of the audio, or add an equaliser to it. For experimental purposes, a total of 83 speakers (42 male and 41 female) from different age groups were selected. For recording purposes, the WO Mic client interface was used to connect with Audacity.
In total, 7138 utterances of speech signals were collected. Each speaker recorded 100 samples of Tamil sentences. From the entire data set, 60% of recorded samples are used for training, 20% of samples are used for development, and 20% of samples are used for testing purposes. Overlapping between the speakers is avoided. The parameters that are considered for recording speech signals are listed in Tab. 1 shown below.
The samples that are collected from various speakers are mixed with common model-independent noises that are downloaded from https://www.sound-ideas.com. For testing purposes, babble and cafeteria noises were used that are downloaded from https://www.auditec.com. The entire dataset contains 500 hrs of the mixed speech signal. Random utterances were selected from the clean speech and mixed with random noise based on the signal-to-noise ratio (SNR). Figs. 5 and 6 show the Speech signal waveform in Audacity. After mixing the noise with the recorded sample SNR value is set between 0 to 15 db with an equal interval of around 5 db. Speech to noise ratio is obtained by using effects in Audacity. Figs. 7 and 8 show the various noises that we are used in experiments. A separate window is used to obtain the noisy data by adding noise data with collected speech samples. The frame length is set to 256 and frameshift is set to 128 along with time-domain signals to compute the input feature and target for the proposed model. The size of DFT is set to 256 for training and evaluation.
3.3.2 Training LSTM and CNN Based CED for Noise Suppression and Restoration
Back-propagation through time (BPTT) is used for training the LSTM model for noise suppression along with cMSA loss function and Adam optimizer. The parameters like batch size and learning rate are set to 25 and 0.001 respectively for Adam optimizer. Overfitting is avoided by setting up the weight decay as 0.0002. Speech utterance is set to fixed-length size say 100 for training in BPTT. Speech utterances that are less than 100 sizes are zero-padded to match the size. In each epoch, the loss is calculated and if there is no change in the loss for the continuous three epochs then the particular loss is considered as least and the learning rate will be updated accordingly based on the development loss. During the experiment, we reached a learning rate up to 0.0001 which is considered as a minimum.
In Speech restoration, back-propagation is used for training the CED network along with cSA as a loss function. The Parameters like batch size and learning rate are set to the same values that are used in LSTM training for Adam optimizer. In each epoch, the loss is calculated and if there is no change in the loss for continuous two epochs then the particular loss is considered as least and the learning rate will be updated accordingly based on the development loss. In the proposed work, 88 filter kernels and 24 as the size of frequency axis are used for training the CED network.
4 Experimental Results
In the proposed work, we used PESQ and STOI as evaluation metrics and compared them with the traditional models. Based on the experimental result, the proposed deep learning model can attain 0.88 for unprocessed speech signal as an STOI value. On the other hand, traditional models can able attain the average of 0.75 as the STOI value. Similarly, the PESQ value also increased considerably while using Deep learning models. Tabs. 2 and 3 show the STOI percentage for processed and unprocessed noise speech signals based on the trained and untrained speaker. Tabs. 4 and 5 show the PESQ for processed and unprocessed noise speech signals based on the trained and untrained speaker. Fig. 9 depicts the comparison of the model based on the evaluation metrics. In the low Speech to Noise ratio condition, the deep learning model performs well in stage 1 noise suppression. LSTM based on cMSA loss function shows outstanding performance when compared with other traditional approaches like LSTM-MSA and LSTM-IRM. Since LSTM-cMSA processes the information by making a clean speech spectrum into two separate parts like real and imaginary. In stage 2 speech restorations, the CED network is used along with cSA and perform well in high SNR condition as well. The proposed two-fold architecture noise suppression is implemented using LSTM-cMSA and speech restoration carried out using CED-cSA. The performance of the proposed model shows improvement even in low SNR conditions. In the case of -5 db, the proposed model increases 2% over STOI and 0.1 over PESQ. Similarly, when speech signal of untrained speaker mixed with unknown noise then proposed model shows nearly 18.12% increase in STOI and 0.54 increases in PESQ at -5db
Features such as time taken for processing a single frame, several parameters are considered to analyze the computational complexity of the proposed model. Intel Core i7 9th Gen. Hexa Core, 2.6 GHz Clock Speed machine is used for measuring the time complexity frame. For the proposed model by using a 16-millisecond frameshift, the average time frame processing is calculated as 10.2 milliseconds. In stage 1 for noise suppression, LSTM along with up-sampling and pooling layers were used to increase the real-time factor up to 1.91 as shown in Tab. 6.
In this study, we have proposed a twofold architecture for speech enhancement using Recurrent and convolutional neural networks. In the first stage, LSTM is used for speech denoising by using cMSA as a loss function. Later in the second stage, the speech signal obtained from stage 1 is processed for speech restoration using CED with cSA as a loss function. The proposed model performs well in the speech denoising stage up to 5db in signal-to-noise ratio. On other hand in speech restoration using CED has shown very less improvement. Combining these two stages, the proposed model shows nearly 2% improvement in STOI and 0.1 Mean Opinion Score (MOS) points in PESQ. In addition, we found that the proposed model can able to reduce the computational complexity. Since proposed Convolutional Recurrent Network can able to perform well with fever number parameters. In recent years Automatic Speech Recognition are widely used in various real-world applications, we believe that the proposed model can be used in preprocessing stage to increase the accuracy of the ASR system.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|