Automatic Speaker Recognition Using Mel-Frequency Cepstral Coefficients Through Machine Learning

: Automatic speaker recognition (ASR) systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals. One of the most commonly used methods for feature extraction is Mel Frequency Cepstral Coefficients (MFCCs). Recent researches show that MFCCs are successful in processing the voice signal with high accuracies. MFCCs represents a sequence of voice signal-specific features. This experimental analysis is proposed to distinguish Turkish speakers by extracting the MFCCs from the speech recordings. Since the human perception of sound is not linear, after the filterbank step in the MFCC method, we converted the obtained log filterbanks into decibel (dB) features-based spectrograms without applying the Discrete Cosine Transform (DCT). A new dataset was created with converted spectrogram into a 2-D array. Several learning algorithms were implemented with a 10-fold cross-validation method to detect the speaker. The highest accuracy of 90.2% was achieved using Multi-layer Perceptron (MLP) with tanh activation function. The most important output of this study is the inclusion of human voice as a new feature set.


Introduction
The voice signal contains infinite information and voice instances can be used for extracting information about speech words, expression, style of speech, accent, emotion, speaker identity, gender, age, health state of the speaker etc. Advances in biometrics and computer science have provided identifying some of the characteristics of individuals. ASR systems are widely used in the field of security and forensic science, for instance, to create voice signature and to identify suspects. The main motivation behind ASR is to convert the acoustic voice signal into a computer-readable format and to identify the speakers depending upon their vocal characteristics [1].
Analysing and synthesizing the voice signal is a complex process. To simplify, two factors have been developed; feature extraction and feature matching. The traditional ASR systems were built on Gaussian mixture models (GMMs) and Hidden Markov models (HMMs) to perform the feature matching process. Herein, HMMs are used to deal with the temporal variability of speech and GMMs used to determine how well each of the HMMs fit into a frame or brief window of coefficients representing acoustic input [2]. As an example of the feature extraction methods; Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) were used to extract feature vectors from acoustic signal data, especially with HMMs. Davis and Mermelstein introduced the MFCC features in the 1980's [3]. These features have been widely used and have been regarded as the state-of-art since that date.
MFCCs are coefficients that represent the audio based on human perception [4]. They are derived from the Fourier Transform of the audio clip. The difference is that in MFCC method the frequency bands are positioned logarithmically. As the perception of the frequency content of the human speech signal by the human does not follow a linear scale, applying logarithmically positioning in MFCCs, makes it more closely to human perception [5].
Korkmaz et al. [11] proposed a novel MFCC extraction system, which is faster and more energyefficient method than conventional MFCC realization. They used low-pass filter instead of highpass pre-emphasizing filter. Since pre-emphasizing is also required for enhancing the energy of the signal in high frequencies they implemented a bandpass filter that performs highpass filter. They stated that the most time-consuming part in conventional method is FFT with the cost of 72,67% and they discarded this phase.
Lalitha et al. [12] changed the conventional MFCC structure and offered a new model to voice activity detection. In contrast to triangular filterbanks employed during the MFCC process, they proposed new smoother and DCT involved method.
Sangeetha et al. [13] investigated an alternative approach to conventional DCT method. They stated that traditional DCT is not as efficient as the proposed method in terms of de-correlation of filterbank features. They offered a new distributed DCT method for MFCC extraction, which reduces the correlation and feature count.
Upadhya et al. [14] tried a new method to recognize hand-written numbers using MFCC features and HMM. They used MNIST and Fashion MNIST dataset and converted 2D image arrays to 1D sound array. Then, they extracted MFCCs from this 1D array. They input the HMM model with 39 MFCC feature vectors and an accuracy value of 86.4% is obtained.
Since the MFCC feature extraction process already have a phase where image patterns called spectrograms are produced, we applied spatial pattern recognition techniques on these mel spectrograms in this study. After applying pre-processing and MFCC processing steps to the speech signals, we obtained mel-scale power spectra, convert them into spectral energy decibels (dB) features and saved each spectrum pixel as a power spectrogram image. Each spectrogram has a characteristic pattern and each pixel of a spectrogram represents our features for the classification model. In signal processing phase, we produced these spectrograms applying MFCC steps and create our dataset instances. Each instance includes a 1D array of pixel values of the spectrogram and a label indicating the speaker. In classification section, we trained machine learning models using the training dataset and chose the model giving the best performance in terms of accuracy. Detailed information about methodology is given in Section 2.

Proposed Methodology
In this study, we investigated the usage of mel-scale spectrograms as an input to a deep neural network to recognize Turkish speakers. A new voice dataset is created and used to test the real-time performance of the ASR system. The participants are informed about the details of the experiment before the data collection process to minimize the artifacts and noise of voice signal. We also applied the spectral subtraction [15] to obtain clean voice signal. The ASR system proposed in this article is intended for people who use voice-controlled systems in daily life. In such systems, security comes first, the person giving the command is important. That's why we focused on improving our speaker recognition performance rather than speech recognition.
The first step in designing an ASR system is to determine the appropriate data set. Although there are many English voice dataset available on the Internet, there are limited Turkish voice dataset. However, each instance in the dataset had to be labeled carefully with the corresponding individual. Whenever we needed a precise command from a particular person, we would have to search for it. This was difficult and time consuming to implement in the real-time system. We collected our own voice dataset from undergraduate and graduate students. In this way, we have full control over the dataset for the system we will develop. More details on the data collection process are given in 2.1. Finally, the real-time performance of the ASR system in voice-controlled systems such as voice command phone unlocking is investigated. The system will unlock a phone only if the command is given by the owner.
The signal processing is one of the most sensitive parts of ASR systems. Although we recorded voice data in a quiet laboratory environment, noises may occur due to both external factors and the sound recording device. In the first step of signal processing, the noise removal and speech enhancement technique called spectral subtraction is applied to each voice signal in Matlab.
Speeches are trimmed to a length of 5 s to extract features of the same size. Lyons' Python Speech Features library [16] is used to extract speech features.This library supports the following voice features; MFCCs, Filterbank Energies, Log Filterbank Energies and Spectral Subband Centroids. Log Filterbank Energies were used to get power spectrogram and pixel features. To detect the speaker, we applied several machine learning algorithms on Orange 3. It is basically a python-based visual data mining programming unit. These processes are illustrated in Fig. 1. The detailed information about the dataset and the speaker recognition processes is given in the Sections 2.1 and 2.2.

Turkish Speakers' Voice Dataset
The voice dataset is collected from 15 people (7 men and 8 women) in a noiseless laboratory. In the data collection phase, all participants read 40 specific sentences that involve the characteristics of a Turkish speech selected by the Free Software Foundation [16]. Each participant read these sentences that were recorded using a smart phone. The sample rate of 48000 Hz and the number of bits per second encoded in the record file of 1411 kbps were set for each record. Each recording lasted 5 s and speakers read a single sentence in each record. These sentences are available in the Google Docs [17]. Data acquisition process is represented in Fig. 2. Sections 2.2 and 2.3 describe our dataset in depth.

Implementatation Steps of MFCC
MFCC is based on a concept called cepstrum or spectrum. Cepstrum also known as a quefrency [18]. Oppenheim and Schafer [19] defined the Cepstrum transform as composite of the following transactions; Fourier transform, followed by Complex Logarithm and implementation of Inverse Fourier transform. Davis and Mermelstein developed this theory and applied a non-linear filterbank in frequency domain. The implementation steps of their algorithm are given in Fig. 3.

Figure 3: Obtaining the mel-filterbank features from the MFCCs process
A normal MFCC extraction includes DCT phase. During the MFCC process highly correlated features are extracted. This high correlation may be problematic for conventional machine learning algorithms. DCT decorrelates the highly correlated MFCC features. On the other hand, with the development of deep neural networks which are less sensitive and capable to handle correlated data this will not a big problem anymore [20]. In our ASR design, we discarded DCT phase and applied spatial pattern recognition on mel-scale spectrograms.
The Mel-scale relates the perceived frequency of a pure tone to its actual measured frequency. The actual frequency was converted to the mel-scale frequency by the Eq. (1).
At the first step, the pre-emphasis process is applied to the speech signal to amplify the high frequencies by Eq. (2). Pre-emphasising is crucial for (1) balancing the frequency spectrum since high frequencies usually have smaller magnitudes compared to lower frequencies, (2) avoiding numerical problems during the Fourier transform operation and (3) improving the Signal-to-Noise Ratio (SNR) [20].
Finally, the number of triangular filters set 26 as default and log filterbank energy features computed. This step is the difference of MFCC from FFT because filterbanks are non-linear whereas Fourier transform is linear-based. Normally, in the MFCC method, DCT is applied after the implementation of filterbanks. DCT is a linear transformation and it discards some important information in the speech signal that is non-linear [20]. Therefore, we didn't prefer to use DCT, the origin of our features in the dataset are filterbank energy features as shown in Fig. 4.
At the second step, framing and windowing processes were applied. After the speech signals preemphasised and divided into frames, well known windowing method Hamming [21,22] was applied. Then, Discrete Fourier Transform (DFT) was calculated for each windowed spectrum as given in Eq. (3), while the periodogram estimated power spectrum was calculated for the speech frame as given in Eq. (4).
where S(n) demonstrates the signal domain and S i (n) is a framed signal. S i (k) represents the frame in the time-domain, P i (k) denotes the power-spectrum of frame i. h(n) is N sample long analysis window (e.g., hamming window), while K is the length of the DFT [23].

Creating Spectrogram Feature
After the extraction of logarithmically positioned mel-scale filterbanks, "librosa" [24] a Python library for audio and music signal analysis, was used to convert power spectrums (amplitude squared) to decibels (dB). Herein, librosa's power_to_db method was applied and the units were saved as mel-scale spectrograms with the size of 800 × 600 pixels representing MFCC features. Each spectrogram contained a five-second characteristic speech signal information for each individual. These mel-spectrograms were subjected to certain image processing operations before the classification stage. Each image instance in the dataset contained 480000 features, which were multiplied by 5517 800 × 600 pixels. To cope with training time and complexity of the model, each image was size reduced to 80 × 60 pixels as seen in the Figs. 5a and 5b.

Classification
Orange3 [25] machine learning tool used to evaluate the accuracy of the model. In this study, ML algorithms were attempted to be trained with the dataset. Since the human voice is nonlinear in nature, linear models are not suitable for ASR systems. The nonlinear ML algorithms such as deep neural network (DNNs) are more dominant pattern recognition techniques [26]. In this study we prioritized three nonlinear algorithms in terms of ASR performance. These are SMO [27], Random Forest (RF) [28], and a 3-layer NN called Multilayer perceptron (MLP) [29] algorithms. SMO is an SVM based classification algorithm that implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. RF introduced by Breiman to construct random trees in classification. The DNN classifier used in our model consists of 3-hidden layers and 64 neurons in each layer. The extracted 4800-pixel features are inputs and 15 speakers are outputs as seen in Fig. 7. Finally, the 10-fold cross validation method was used for evaluation of each algorithm. This study held on NVidia GeForce GTX 860M laptop and Python platform. Data features were extracted from the collected Turkish speakers' voice instances using MFCCs method and Lyons' Python Speech Features library and resulted in a new dataset. This library supports the following voice features; MFCCs, Filterbank Energies, Log Filterbank Energies and Spectral Subband Centroids. Log Filterbank Energies were used to get power spectrogram and pixel features. To detect the speaker, we applied several machine learning algorithms on Orange 3. It is basically a python-based visual data mining programming unit.
In this study, we tried a novel approach and used more features than MFFCs. If the complexity of a dataset increases the DNNs as shown in Fig. 8 are a good choice to train it. So, one of the most satisfying and promising result of this study was getting the highest evaluation score with DNN model. Before choosing the optimum model, several classifier methods used and the evaluation results in Tab. 1 was obtained.
The best model achieved for our dataset was with tanh activation functions. When look at the confusion matrix, we can see that the misclassification is more in women voices. This situation may show that women's voices in the dataset are more similar in terms of dB and mel-scale energy.  This was a preliminary study for the Turkish speaker recognition system. We introduced a new approach to speaker recognition using MFCCs. Mel spectrogram pixels are used instead of traditional MFCCs as our feature set. Although the feature size is larger and correlation is higher than MFCCs, our proposed model operates over DNN, which can handle complex and correlated dataset. And the near future, we are planning to develop a more robust model for use in real-time speeches. Since we are working with spectrograms, which having voice information, CNN model may be applicable in the future works. The Turkish speakers dataset produced in this study is a novel dataset. During the pandemic, we were unable to collect new data and conduct experiments on them. However, we aim to improve our dataset in the near future. The most important output of this study is the picture of human voice investigated as a new feature set. Therefore, we believe that the mel spectrograms may be used as voice fingerprints in the near future.