Microphone array-based sound source localization (SSL) is widely used in a variety of occasions such as video conferencing, robotic hearing, speech enhancement, speech recognition and so on. The traditional SSL methods cannot achieve satisfactory performance in adverse noisy and reverberant environments. In order to improve localization performance, a novel SSL algorithm using convolutional residual network (CRN) is proposed in this paper. The spatial features including time difference of arrivals (TDOAs) between microphone pairs and steered response power-phase transform (SRP-PHAT) spatial spectrum are extracted in each Gammatone sub-band. The spatial features of different sub-bands with a frame are combine into a feature matrix as the input of CRN. The proposed algorithm employ CRN to fuse the spatial features. Since the CRN introduces the residual structure on the basis of the convolutional network, it reduce the difficulty of training procedure and accelerate the convergence of the model. A CRN model is learned from the training data in various reverberation and noise environments to establish the mapping regularity between the input feature and the sound azimuth. Through simulation verification, compared with the methods using traditional deep neural network, the proposed algorithm can achieve a better localization performance in SSL task, and provide better generalization capacity to untrained noise and reverberation.
With the development of artificial intelligence, sound source localization (SSL) based on speech processing systems has become a new research hotspot. The task of SSL is to obtain the position information of a sound source relative to an array by processing the sound signal which collected by a sensor when the sound source is unknown. Typical applications of sound source localization technology include: video conferencing, robot hearing, speech enhancement, speech recognition, etc. [
After years of development, there are more theories and methods in regard to sound source localization based on microphone arrays, and the traditional methods can be classified into three categories: time difference of arrivals (TDOA) estimation methods, steered response power beamforming methods [
All of the above studies are based on traditional algorithms to achieve SSL. Recently, SSL based on supervised learning have been proposed, and the majority of the approaches utilize deep neural networks (DNNs). In [
Traditional sound source localization techniques are fail to achieve satisfactory performance in adverse noisy and reverberant environments. The structure of ResNet introduced into the convolutional residual network (CRN) model can reduce feature loss and decrease the training difficulty. Research shows that when the DRN model is similar to the CNN model in terms of the number of layers, the CRN model not only has a rapid drop in loss function during training and good model convergence performance, but also has better performance. Therefore, we propose a method using CRN to improve localization performance. The spatial features including TDOAs between microphone pairs and SRP-PHAT spatial spectrum are extracted in each Gammatone sub-band. The spatial features of different sub-bands with a frame are combine into a feature matrix as the input of CRN. Simulation verified that compared with the methods using traditional deep neural network, the proposed algorithm can achieve a better localization performance in SSL task, and provide better generalization capacity to untrained noise and reverberation.
The remainder of the paper is laid out as follows. Section 2 illustrates the CRN-based SSL algorithm, which includes four sections with the overview of system, extraction of feature parameters, CRN architecture and the training of CRN. Section 3 formulates the experimental results and analyses, and conclusions are stated in Section 4.
As illustrated in
The model of microphone array received signals can be expressed as:
The Gammatone filter is used to decompose the signals into sub-band, whose expression can be written as:
GCC function within the Gammatone sub-band is calculated as:
The number of microphones is
The SRP-PHAT within the Gammatone sub-band is expressed as:
τ(
In a given frame, the TDOA and SRP-PHAT of all sub-band form a spatial feature matrix, and that can be calculated as follows:
The structure of the CRN model used in this paper is depicted in
In this paper, the Adam optimizer is used during the training process. During the training process, information is propagated forward and errors are propagated backward, and the model parameters are updated accordingly. The initial learning rate is 0.001, and the amount of batch data 200. Moreover, the value of ε in the BN layer is set to 0.001, and the decay coefficient is taken to be 0.999. Outstanding parameter initialization will make the network training easier. We use Xavier to initialize the parameters, which automatically adjusts to the most appropriate distribution according to the number of input and output nodes in the network layer, thus making the parameters moderate in size. The cross-validation approach divides the training data into 70% training sets and 30% validation sets at random.
The simulated room’s dimensions are stated as 7 m × 7 m × 3 m. A uniform circular array which consists of six omnidirectional microphones with a diameter of 0.2 m is placed in (3.5, 3.5, 1.6 m). The clean speech sampled with 16 kHz which are adopted as sound source signals are taken at random from the TIMIT database. Between any two positions, the image method generates the room impulse response. By convolving the clean sound source signal with the ambient impulse response and adding uncorrelated Gaussian white noise, the microphone signal is generated. Then, divide the microphone signals into 32-ms frames and window using the Hamming window. Windowing of the framed signals can reduce the truncation effect between signal frames and reduces spectral leakage.
The source is placed in the far-field, and the distance between the array and the training position is adjusted to 1.5 m, with a training azimuth range of 0° to 360° in 10° steps. During the training phase, SNR is set to five scenarios: 0, 5, 10, 15 and 20 dB, and T60 has two set values: 0.5 and 0.8 s. The training data is derived via combining microphone signals in various reverberation and noise environments during the training stage for robustness.
The proposed algorithm’s performance is compared to three baseline methods, SRP-PHAT [
In this section, we compare and analyse the performance of different algorithm in the same setting of training and testing.
As shown in
As shown in
In this section, we compare and analyse the performance of different algorithm in the different settings of training and testing. The untrained SNR is set to five scenarios: 0, 5, 10, 15 and 20 dB, and the untrained reverberation time T60 has two set values: T60 = 0.6 s and T60 = 0.9 s.
As shown in
As illustrated in
In the paper, a novel CRN-based SSL algorithm is proposed. In the proposed algorithm, TDOAs between microphone pairs and SRP-PHAT spatial spectrum in Gammatone sub-band are extracted as spatial features. Since CRN introduces residual structures based on convolutional networks, it reduces the difficulty of the training process and accelerates the convergence of the model. Experimental data express that the proposed algorithm achieves improved performance of localization and more robust against noise and reverberation.
The authors would like to show their deepest gratitude to the anonymous reviewers for their constructive comments to improve the quality of the paper.