Neutral network (NN) and clustering are the two commonly used methods for speech separation based on supervised learning. Recently, deep clustering methods have shown promising performance. In our study, considering that the spectrum of the sound source has time correlation, and the spatial position of the sound source has short-term stability, we combine the spectral and spatial features for deep clustering. In this work, the logarithmic amplitude spectrum (LPS) and the interaural phase difference (IPD) function of each time frequency (TF) unit for the binaural speech signal are extracted as feature. Then, these features of consecutive frames construct feature map, which are regarded as the input to the Bi-directional long short-term memory (BiLSTM). The feature maps are converted to the high-dimensional vectors through BiLSTM, which are used to classify the time-frequency units by K-means clustering. The clustering index are combined with mixed speech signal to reconstruct the target speech signal. The simulation results show that the proposed algorithm has a significant improvement in speech separation and speech quality, since the spectral and spatial information are all utilized for clustering. Also, the method is more generalized in untrained conditions compared with traditional NN method e.g., deep neural network (DNN) and convolutional neural networks (CNN) based method.
As a fundamental algorithm in signal processing, speech separation can improve the performance of whole speech signal processing system such as speech recognition and speech interface systems, which has a wide range of applications [
Speech separation is to separate target speech from background interferences including noise, reverberation and interfering speech. Human auditory system can extract the target sound source in complex acoustic environment, which inspires the research on the perception mechanism of the human auditory system. The researchers propose the concept of computational auditory scene analysis (CASA) [
A recent approach treats segmentation algorithm as a supervised learning problem, which includes the following components: learning machines, training targets and acoustic features. Thus, related research are carried out in the above three aspects.
Rickard [
Deep learning has been introduced into speech separation due to the excellent learning ability. Narayanan et al. [
In previous studies, only the information of the current TF unit is utilized. Considering that the speech spectrum for the same voice source has time correlation, and the spatial position of the voice source has short-term stability, this paper utilizes BiLSTM as the encoder to map the logarithmic amplitude spectrum and the inter-aural phase difference (IPD) function to the high-dimensional vectors, which are used to classify the time-frequency units by K-Means clustering. This approach combines the spectral and spatial features of consecutive frames to perform speech separation. Compared with CNN-based networks, it has an improvement in SAR, SIR and SDR in trained and untrained environment.
The remainder of the paper is organized as follows. Section 2 presents an overview of binaural speech separation system based on deep clustering and the extraction of spectral feature and spatial feature. Section 3 describes the structure and training of Deep Clustering Networks. The simulation results and analysis are provided in Section 4. The conclusion is drawn in Section 5.
In the training, the logarithmic amplitude spectrum and phase difference function of the binaural speech signal [
The model for binaural speech signals in reverberant and noisy environments can be formulated as the
where
Short-time Fourier transform (STFT) [
where
Logarithmic spectrum of
Logarithmic spectrum only relies on the amplitude, which ignores the spatial information brought by the binaural signal, thus Inter-aural Phase Difference (IPD) is calculated as spatial feature.
IPD,
where
Cosine and sine function of IPD is formulated as:
The logarithmic spectrum and IPD function are constructed a new feature vector for each TF unit:
The feature vectors of every T frames are combined to obtain the feature map shown in
Deep clustering is mainly composed of encoding layer and clustering layer. Only the encoding layer is used during training. In the testing, the high-dimension vectors are obtained through the encoding layer, and each TF unit is classified in clustering layer. The structure is shown in
The encoding layer is composed of a BiLSTM [
In the training stage, the speech signal is preprocessed to obtain the logarithmic spectrum,
where
The BP algorithm [
HRIR is convolved with mono speech signals to obtain a directional binaural signal [
Also, the Gaussian white noise is added to the binaural mixed speech as the ambient noise. There are 4 types of SNRs for training, which are 0 dB, 10 dB, 20 dB and no noise. For SNR generalization, the SNR for testing includes 5 dB and 15 dB. The total training sentence is about 80 hours. The testing speech is different from the training speech. Therefore, this simulation can be regarded as speaker-independent speech separation.
The reverberation time (RT60) for training are 0.2 s and 0.6 s. The RT60 for testing is 0.3 s, which verifies the generalization of the proposed algorithm to reverberation.
At the same time, in order to distinguish the noise segment from the silent segment during training, Voice Activity Detection (VAD) [
Sources to Artifacts Ratio (SAR), Source to Distortion Ratio (SDR), Source to Interferences Ratio (SIR) [
We compare the performance of the proposed method, which is binaural speech separation based on deep clustering, with several other related methods for binaural speech separation. The two algorithms involved in the comparison are: DNN-based method with IBM, and CNN-based method.
First, we evaluate the performance of the above algorithms in the noisy environment. SAR, SIR, SDR and PESQ of different algorithms are shown in
SNR(dB) | IBM-DNN | CNN | Deep Clustering |
---|---|---|---|
0 | 0.07 | 2.02 | 1.57 |
5 | 2.71 | 4.54 | 4.02 |
10 | 6.02 | 6.95 | 7.15 |
15 | 7.81 | 8.01 | 8.54 |
20 | 8.34 | 8.77 | 9.12 |
Noiseless | 8.85 | 9.03 | 9.44 |
SNR(dB) | IBM-DNN | CNN | Deep Clustering |
---|---|---|---|
0 | 14.42 | 15.19 | 14.79 |
5 | 15.14 | 16.01 | 16.18 |
10 | 15.98 | 16.45 | 16.92 |
15 | 16.41 | 16.70 | 17.01 |
20 | 16.71 | 16.87 | 17.35 |
Noiseless | 17.14 | 17.02 | 17.58 |
SNR(dB) | IBM-DNN | CNN | Deep Clustering |
---|---|---|---|
0 | −0.77 | 1.54 | 0.79 |
5 | 3.02 | 4.41 | 4.16 |
10 | 5.31 | 6.02 | 7.14 |
15 | 6.95 | 7.21 | 8.15 |
20 | 7.52 | 7.85 | 9.02 |
Noiseless | 7.96 | 8.31 | 9.79 |
SNR(dB) | IBM-DNN | CNN | Deep Clustering |
---|---|---|---|
0 | 1.42 | 1.85 | 1.67 |
5 | 1.7 | 2.07 | 1.94 |
10 | 1.79 | 2.17 | 2.11 |
15 | 1.95 | 2.24 | 2.25 |
20 | 2.21 | 2.45 | 2.39 |
Noiseless | 2.41 | 2.57 | 2.52 |
According to the performance comparison, in the low SNR, the performance of the proposed algorithm is close to that of CNN, while in the high SNR, deep clustering significantly improves the separation performance compare with the IBM-DNN and CNN. Also, for the unmatched SNR, 5 dB and 15 dB, the proposed algorithm maintains the speech separation and speech quality.
At the same time, we analyze the reverberation generalization of the proposed method and the CNN method. The RT60 of testing data is 0.3 s, which differs from that of training data. The comparison results are shown in
SNR(dB) | CNN | Deep Clustering |
---|---|---|
0 | 1.89 | 1.32 |
5 | 4.07 | 3.95 |
10 | 6.61 | 6.70 |
15 | 7.45 | 7.79 |
20 | 8.26 | 8.71 |
SNR(dB) | CNN | Deep Clustering |
---|---|---|
0 | 14.77 | 14.51 |
5 | 15.82 | 15.94 |
10 | 15.91 | 16.41 |
15 | 16.54 | 16.63 |
20 | 16.68 | 16.72 |
SNR(dB) | CNN | Deep Clustering |
---|---|---|
0 | 1.02 | 0.34 |
5 | 3.57 | 3.46 |
10 | 5.21 | 6.71 |
15 | 6.57 | 7.35 |
20 | 7.25 | 8.07 |
The separation performance of the proposed algorithm is much better than that of the CNN method under non-matching reverberation, indicating that the reverberation generalization of the separation method based on deep clustering.
In this paper, we presented speech separation algorithm based on deep clustering. Considering that the frequency spectrum of the speech signal has time correlation, and the spatial position of the speech signal has short-term stability, the proposed algorithm combines the spectral and spatial features to form the feature map, which are sent to BiLSTM. After the encoding layer, feature map is mapped to high-dimensional vectors, which are used to classify the TF units by K-Means clustering. Speech separation based on deep clustering has shown its ability to improve speech separation and speech quality. At the same time, the proposed algorithm also maintains performance in unmatched SNR and reverberant environment, demonstrating noise and reverberation generalization.