Audio signal separation is an open and challenging issue in the classical “Cocktail Party Problem”. Especially in a reverberation environment, the separation of mixed signals is more difficult separated due to the influence of reverberation and echo. To solve the problem, we propose a determined reverberant blind source separation algorithm. The main innovation of the algorithm focuses on the estimation of the mixing matrix. A new cost function is built to obtain the accurate demixing matrix, which shows the gap between the prediction and the actual data. Then, the update rule of the demixing matrix is derived using Newton gradient descent method. The identity matrix is employed as the initial demixing matrix for avoiding local optima problem. Through the real-time iterative update of the demixing matrix, frequency-domain sources are obtained. Then, time-domain sources can be obtained using an inverse short-time Fourier transform. Experimental results based on a series of source separation of speech and music mixing signals demonstrate that the proposed algorithm achieves better separation performance than the state-of-the-art methods. In particular, it has much better superiority in the highly reverberant environment.
In the classical “Cocktail Party Problem”, the collected sound signals are the mixtures of multiple sounds [
When the number of sources is equal to the number of sensors, it is a determined mixture. Taking into the effects of echo and reverberant consideration, the mathematical model of mixing signals can be similar to the convolutive model. To solve the source separation problem of convolutive mixing signals, Blind Source Separation (BSS) is an effective source separation method, which can separate the unknown source signals from the mixing signals without knowing any channel information [
Determined BSS based on time-frequency masking is a very popular speech separation algorithm [
In this paper, we propose a novel Determined Reverberant Blind Source Separation (DR-BSS) algorithm to separate the speech and music mixing signals for the convolutive mixture case. First of all, the time-domain reverberant convolutive mixing signals are transformed into frequency-domain linear mixing signals via Short Time Fourier Transform (STFT). In order to obtain the accurate demixing matrix, a new cost function is built and the update rule of the demixing matrix is derived using Newton gradient descent method. To avoid local optima problem, the identity matrix is used as the initial demixing matrix for the iterative updating process. The frequency-domain sources are reconstructed based on the demixing matrix. Then, the time-domain sources are obtained using inverse STFT.
The main novelty of this paper can be summarized as: DR-BSS algorithm is designed, where the update rules of the demixing matrix are obtained via strict mathematical theory derivation. Frequency-domain sources are obtained using the real-time iterative update of the demixing matrix. Experimental results show that source separation performance of this proposed DR-BSS algorithm is better than the state-of-the-art methods, especially in much higher reverberation circumstances.
This article starts with an introduction, the remaining is organized as follows. Section 2 describes the reverberant convolutive system model. Section 3 proposes the DR-BSS algorithm to separate speech and music mixing signals. Experimental results based on source separation performance of speech and music convolutive mixtures will be demonstrated in Section 4. Finally, conclusions are proposed in Section 5.
The reverberant mixing model can be represented as the convolution of each source
Using the STFT, the source signals and mixing signals in each time-frequency slot are defined as
where
where
where
The overall structure of the study is summarized as follows. First of all, the mixing signals in the time domain are transmitted into the frequency domain by using STFT. Then, a new cost function is built to obtain the accurate demixing matrix. Furthermore, the update rule of the demixing matrix is derived using the Newton gradient descent method. Through the real-time iterative update of the demixing matrix, the frequency-domain sources are obtained. Finally, the time-domain sources can be obtained using inverse STFT.
In order to obtain the accurate demixing matrix
where
where the point of Taylor expansion changes from
where
so that
where
Substitute
Then, using the normalization:
Additionally, by using Newton gradient descent method and Taylor expansion
In the experimental section, the proposed algorithm is applied to speech and music signal separation problems, the convolutive mixing signals used in the experiments are generated in a virtual room with artificial RIRs [
The dataset comes from the public development dataset of the 2011 Signal Separation Evaluation Campaign (SISEC 2011) [
Signal | Data name | Source | Time | Frequeny |
---|---|---|---|---|
Speech 1 | dev1-female3 | src-1 | 10 s | 16 kHz |
Speech 2 | dev1-female3 | src-2 | 10 s | 16 kHz |
Speech 3 | dev1-female3 | src-3 | 10 s | 16 kHz |
Signal | Data name | Source | Time | Frequeny |
---|---|---|---|---|
Music 1 | dev1-wdrums | src-1 | 11 s | 16 kHz |
Music 2 | dev1-wdrums | src-2 | 11 s | 16 kHz |
Music 3 | dev1-wdrums | src-3 | 11 s | 16 kHz |
To evaluate the separation performance, signal-to-interference ratio (SIR) is selected as the evaluation criteria, which is defined as [
The average SIR determines the amount of cross-talk and is an established evaluation technique. The higher the value is, the better the separation performance is. Therefore, the discussion of the separation performance is mainly based on the average SIR in the following experiments.
To show the superiority of the proposed DR-BSS algorithm, the FastICA algorithm [
Due to the experimental environment is affected by multiple mixed factors, including reverberation time, the distance between microphones, the distance between sound sources, the distance between microphones and sound sources, and the number of sound sources and microphones, such that the separation results did not show a certain regularity. In the following experiment, the distance between microphones is fixed at 0.5 m and the distance from the sound source to the microphone is fixed at 1 m.
First of all, we consider the effect of reverberation time RT60 to source separation performance, the number of sources is 2 and the number of sensors is also 2. The reverberation time RT60 varies from 100 to 900 ms. RT60 of a room is defined as the time it takes for sound to decay 60 dB, which reflects the convolution complexity. The locations of sources and sensors in the room are shown in
In order to visualize the separation results, we compare the separated speech sources with the origin speech source signals. Separation results are shown in
Then, we test the convolutive speech mixtures where the number of sources is 3 and the number of sensors is also 3. The reverberation time RT60 varies from 100 to 500 ms, and the locations of sources and sensors in the room are shown in
In the following section, we consider the source separation performance of music mixtures, the convolutive environments are the same as the speech convolutive environments. Firstly, we test the two-music mixtures, experimental results are shown in
Additionally, we compare the separated music sources with the origin music source signals. Separation results are shown in
Secondly, we test the convolutive music mixtures where the number of sources is 3 and the number of sensors is also 3. The source separation performance SIR
According to above experimental results, the proposed algorithm can be used to separate convolutive speech and music mixing signals in the different reverberate environments. Especially, for the two-channel convolutive mixture situation, the advantage of algorithm is still suitable for a much higher reverberate environment. However, the shortcoming of this algorithm is that the separation performance of the algorithm decreases with the increase of the number of channels and reverberation time. In addition, with more sensors and sources, it brings the complexity of convolutive mixtures, resulting in the gradual decrease of SIR value. Thus, the SIR value decreases with more sensors and sources. In addition, due to the complexity of the real-life environment, the model can not fully describe the actual problem, which leads to inaccurate modeling of the actual problem, thus the design algorithm is limited. To better improve the accuracy of the model, it is necessary to establish an adaptive mathematical model according to specific practical problems.
Firstly, the computational complexity of the algorithms is considered to compare the proposed method with other existing methods. All the experiments are conducted on a computer with Intel(R) Core (TM) i9- 10900 CPU@2.80 GHz, 16.00 GB memory under Windows 10 system and the programs are coded by Mat- lab R2019a installed in a computer workstation. A two-channel convolutive mixed speech signal is tested, where the reverberation time RT60 is selected as 300 ms. The mean test time of the proposed algorithm, FastICA, PARAFAC-SD, Pro-SPA, and Low-Rank NMF for 400 trials are 7.69, 7.75, 7.78, 0.35, 10.10 s respectively. Compared with the computational time, it indicates that the computational complexity of proposed algorithm is better than Low-Rank NMF and weaker than Pro-SPA. However, the proposed algorithm achieves better separation performance than the compared methods.
In order to test the effect of Gaussian white noise on the source separation performance of the algorithm, Gaussian white noise is added to the two-channel convolutive mixed speech signals. The reverberation time RT60 is selected as 300 ms, and source-to-noise ratio (SNR) varies from 5 to 30 dB. Experiments are performed for 400 trials, the average value is used to analyze the effect of noise on source separation performance. The effect of noise on source separation performance of different algorithms is shown in
In the paper, we proposed a DR-BSS algorithm to separate speech and music mixing signals. By building a new cost function, the novel update rule of demixing matrix was derived using Newton gradient descent method. Then, the frequency-domain source signals were obtained using the updated demixing matrix. By testing the separation performance of speech and music mixing signals, experimental results verify the effectiveness of DR-BSS algorithm. By comparing the state-of-the-art algorithms, the DR-BSS algorithm achieves better superiority and robustness. Thus, the DR-BSS algorithm designed in this paper has better advantages in solving the determined reverberation environment. It can be applied not only to audio signal separation but also to communication signal processing and biological signal processing.
It is worth noting that the mixing matrix of DR-BSS algorithm must be invertible. When the number of sources is less than or equal to the number of sensors, the DR-BSS algorithm is effective. However, when the number of sources is greater than the number of sensors, the mixing matrix is irreversible. It is invalid in the underdetermined mixture case. Therefore, the underdetermined convolutive BSS problem needs to be further studied in future work.
This research was partially supported by the
The authors declare that they have no conflicts of interest to report regarding the present study.