|Computers, Materials & Continua
Hybrid In-Vehicle Background Noise Reduction for Robust Speech Recognition: The Possibilities of Next Generation 5G Data Networks
1VSB–Technical University of Ostrava, Faculty of Electrical Engineering and Computer Science, Department of Cybernetics and Biomedical Engineering, 708 00, Ostrava-Poruba, Czechia
2VSB–Technical University of Ostrava, Faculty of Electrical Engineering and Computer Science, Department of Telecommunications, 708 00, Ostrava-Poruba, Czechia
*Corresponding Author: Lukas Danys. Email: firstname.lastname@example.org
Received: 30 April 2021; Accepted: 01 July 2021
Abstract: This pilot study focuses on employment of hybrid LMS-ICA system for in-vehicle background noise reduction. Modern vehicles are nowadays increasingly supporting voice commands, which are one of the pillars of autonomous and SMART vehicles. Robust speaker recognition for context-aware in-vehicle applications is limited to a certain extent by in-vehicle background noise. This article presents the new concept of a hybrid system, which is implemented as a virtual instrument. The highly modular concept of the virtual car used in combination with real recordings of various driving scenarios enables effective testing of the investigated methods of in-vehicle background noise reduction. The study also presents a unique concept of an adaptive system using intelligent clusters of distributed next generation 5G data networks, which allows the exchange of interference information and/or optimal hybrid algorithm settings between individual vehicles. On average, the unfiltered voice commands were successfully recognized in 29.34% of all scenarios, while the LMS reached up to 71.81%, and LMS-ICA hybrid improved the performance further to 73.03%.
Keywords: 5G noise reduction; hybrid algorithms; speech recognition; 5G data networks; in-vehicle background noise
Speech recognition systems are one of the fundamental parts of future smart vehicles. Voice-activated technology is slowly introduced in almost every manufactured models of various car brands. It is often connected to infotainment system of the vehicle and can be used to control various features, spanning from sat-nav to radio, media or phone. While the technology is still maturing, the reliability of different systems can vary greatly. The simpler systems are only relying on predefined set of commands, while the more advanced are capable of learning the driver's voice over time and understand phrases and words easily. It is basically utilized to boost the safety and convenience of the driver, so that he/she can focus on the road and not interact with various physical buttons and knobs .
The car however is a very specific everchanging environment. While the higher-end model tends to be sound insulated really well, the lower tier of cars is built around certain manufacturing price, cutting unnecessary costs. The outside environment and certain sounds can therefore penetrate into the driver's cabin, influencing speech recognition systems. These lower-end and cheaper vehicles also tend to have much slower infotainment hardware, slower or simpler on-board infotainment systems or limited microphone arrays. In addition, the certain sounds caused by varying quality of roads (mainly by interaction between tires/wheels and potholes) also influence the precision of speech recognition systems. While the cabin might at first seem like an ideal place for voice recognition system, it is one of the toughest places for its implementation. While it is possible, it is difficult to pull out speech from noisy environment, especially in the lower-tier vehicles, which are the most susceptible to higher ambient noise levels.
Voice recognition and fluent understanding of human speech and voice command is computationally demanding. The vehicular electronics is often built around harsh environmental conditions and automotive grade processors are often outdated and build for specific tasks, offering only limited performance. That's why the systems with certain vocabularies were introduced – the system only has to partially recognize the command, picking from one of the predefined words. These systems are often designed for single words, so the driver must go through multiple steps to achieve the desired outcome .
Everything is slowly changing with introduction of modern digital voice assistants, which are well known from mobile devices. Google Assistant , Apple's Siri or Amazon Alexa are nowadays relying on powerful cloud solutions for analysis and recognition of complex voice commands. While these systems are certainly useful, they rely on internet connection and are often used via Android Auto or Apple CarPlay . They are also influenced by ambient noise, which has to be filtered out for proper command recognition. The mobile devices and on-board wireless modems are nowadays connected via LTE and will slowly transfer into the 5G era.
The quality and performance of individual car brands is slowly approaching comparable levels, thus making it difficult for individual manufacturers to offer something new and interesting. The user experience and quality of on-board system is one of the only remaining ways to differentiate between each brand. It is certain, that with the ongoing development of smart and even autonomous vehicles, the on-board voice assistants will be an inseparable part of modern cars.
As was mentioned, the conditions in driver's cabin are varying greatly. Apart from ideal conditions, the voice recognition system requires the best possible input source. While these conditions are difficult to achieve, it is possible to leverage the powerful adaptive systems to filter out unrequired noise, effectively extracting the most important information for evaluation .
There are multiple scenarios, which can be improved by deployment of adaptive systems. We can introduce a concept of vehicle 4.0, which would employ an advanced array of onboard microphones in combination with either powerful infotainment or reliable 5G link. Let's say there is a set of potholes on a road and multiple vehicles passes through them. Drivers on board of these vehicles are either calling or using voice commands, so they need to filter out any unnecessary noise from their voice signal . When the first car pass through the mentioned potholes, the on-board adaptive system would react straight away, filtering at least part of the noise. However, it is likely, that the installed system is not capable of real-time denoising. As the adaptive algorithms often needs a bit of time for their training the first vehicle in a row would send the small dataset with filter parameters either to cloud or directly to the other passing vehicles via link with lower latency, such as 5G network . The next vehicle can start to process the problem straight away, slowly preparing for the encounter with potholes. When another vehicle passes through the potholes, it could be already prepared for real-time filtration or at least have better filter parameters for the next passing vehicle. This system would heavily rely on highspeed, and low-latency link offered by 5G, as the speed and distance of individual vehicles can vary greatly. The precision of the processing algorithm can be refined even further by introducing other parameters, such as real-time telemetry data, tire size, speed etc. In addition, certain older vehicles could leverage the power of newer models, gathering their optimized on-road data and input information from more complex microphone arrays. As the newer 5G standards are capable of rollback to earlier releases or even 4G, the newer vehicles can act as an important source of information for older, partially outdated models. The whole system can be seen in Fig. 1.
As presented by Bisio et al.  the audio processing technologies are a key feature of modern vehicles. They can be employed by a vast array of commercial applications. Speech is nowadays not only limited to simple commands but can also be used for security services (such as speaker verification and authentication) or accessibility solutions (speech-to-text, text-to-speech, hands-free). Moreover, the modern vehicles are often relying on touchscreen controls. While its certainly useful, some basic functionalities should still stick to robust control, or the system should at least offer an alternative way of controlling these important subsystems. One example is the recently released Skoda Octavia gen 4. Some systems, such as air conditioning are used via touchscreen controls. Some users have reported that the infotainment system can sometimes freeze and must be restarted. While the manufacturer will offer bugfixes to overcome this issue, it can cause discomfort and problems for the driver. The onboard voice assistant Laura offers an alternative way of controlling the previously mentioned system. However, it relies on cloud calculation of advanced voice commands, therefore the vehicle must be connected to mobile network. The offline functionality offers only basic commands, which are present in pretrained vocabularies. According to Bisio et al., the next generation of human-vehicle interfaces will incorporate biometric person recognition for customized on-board entertainment or driver monitoring and profiling applications. The speaker identification, mood of users or number of users are important information, which can be only extracted, when the voice is properly filtered out.
2 Speech Signal Processing
Automatic speech recognition systems are very sensitive to different types of noise. For example, ambient noise makes speech recognition very difficult. This is the reason, why recorded signals are processed by some advanced processing method before speech recognition is performed . Advanced signal processing methods have a great importance for elimination of unwanted signal parts. Basically, there are two fundamental types of methods: adaptive and non-adaptive.
Adaptive methods are characterized by the ability to adapt to a given system. Basically, these methods are based on learning system, which can adapt its own properties to changing working environment. This means that adaptive methods can automatically set the coefficients according to the changing values of the system. During speech recognition, these methods use the primary signal, which contains speech signal with noise, and the reference signal, which contains only noise. While the linear filtering can be used for narrowband interference, it is unsuitable for broadband interference. Adaptive methods can be divided into nonlinear and linear adaptive techniques. Nonlinear adaptive techniques include, for example, artificial neural networks (ANN), methods using hybrid neural networks (HNN), adaptive neuro-fuzzy inference systems (ANFIS) or genetic algorithms (GA). Linear adaptive methods include algorithms based on the principles of Kalman filtering (KF), least mean squares filter (LMS), recursive least squares filter (RLS) or methods based on the principle of adaptive linear neuron (ADALINE) [9–12].
Non-adaptive methods do not use any learning system and work with selected parameters and coefficients. These methods can be divided into single channel and multichannel methods. Single channel non-adaptive methods include Fourier transform (FT), wavelet transform (WT) and empirical mode decomposition (EMD). Multichannel non-adaptive methods include mainly blind source separation methods (BSS), which include independent component analysis (ICA), principal component analysis (PCA) and singular value decomposition (SVD) [13–16].
In this article, LMS algorithm and ICA method were used for creation of automatic speech recognition system. These methods were chosen based on compromise between accuracy, computation cost and calculation speed. Subsections below deals with mathematical apparatus and limitation of used methods.
2.1 Least Mean Squares Filter
LMS algorithm is based on a gradient optimization for determining the coefficients. This algorithm is based on the Wiener filtering theory, stochastic averaging, and the least squares method. This method (same as another adaptive algorithms) is basically attempting to minimize output error
During elimination of noisy part of speech signal, primary signal and reference signal are the inputs of LMS algorithm. After application of LMS algorithm, reference signal is adjusted with respect to the primary signal and prepared for subtraction. Then the adjusted reference signal is subtracted from primary signal. After this procedure, a clean speech signal and separated error signal are obtained.
2.2 Independent Component Analysis
This method belongs into group of BSS methods and is based on higher order statistics. The aim of this method is finding linear representation of non-Gaussian data. These data need to contain statistically independent components. During separation of speech signal, ICA method requires at least two microphones. Each microphone
There is a significant number of ICA based algorithms. Among them are FastICA, JADE, SOBI, Infomax, FlexiICA, kICA, RADICAL ICA, AMUSE etc [22,24–26]. All these algorithms require performing ICA preprocessing in form of centering (creation of zero mean vector) and whitening (creation of vector with unit scattering). The most widely used and very promising algorithm is FastICA, which is also used in this study. First, maximum number of iterations k and criterium of convergence
1) Random normalized vector
3) Normalization of recalculated vector
4) Checking if scalar multiplying between
5) If condition in previous step is false, then repeat steps 2)–4).
2.3 Hybrid Speech Recognition System
In this article, a hybrid system based on LMS algorithm and ICA method was used for automatic speech recognition system. First, primary signal, which contains speech signal with noise, and the reference signal, which contains only noise, are preprocessed by bandpass finite impulse response (FIR) filter with 300 Hz lower limit frequency and with 3400 Hz upper limit frequency. Then, primary signal
The consecutive estimation of ideal LMS parameters can be seen in Fig. 3. The trajectory and estimation of LMS algorithm is highly dependent on the performance of onboard system-on-chip (SoC). The low-end vehicles can either rely on cloud computing or other vehicles located in the vicinity, which offers untapped higher performance. When the vehicles are calculating the ideal parameters, they could basically rely on each other to specify the parameters and pinpoint the ideal algorithm parameters.
3 Conducted Experiments
Speech signal filtering methods were verified by a set of conducted experiments in two separate vehicles. The first scenario was designed to represent the worst-case scenario. A Skoda Felicia (1994–2001) vehicle was selected as a suitable candidate. Its combustion engine has only 50 kW and it can reach up to 152 kmph. This archaic vehicle has limited sound insulation and the in-vehicle environment is highly influenced by background and environmental noise. The second vehicle was much more recent. A battery electric vehicle (BEV) first generation 80 kW (top speed –144 kmph) Nissan Leaf was selected to represent the newer models. Since this vehicle is powered by electricity, the background noise caused by the engine is minimal. This vehicle can be therefore used in a scenario, when only the environmental noise is important, representing the future all electric vehicles.
Four measuring microphones were used in each scenario. The primary microphone (index #0) situated near the rearview mirror was used for both speech and interference recording. Remaining reference microphones (indexes #1, #2, #3) were mounted in each window compartments and recorded acoustic interferences caused by the vehicle itself. The precise diagram with microphone locations can be seen in Fig. 4.
Samples were gathered at different driving speeds (20 kmph, 50 kmph, 100 kmph and 130 kmph) with scenarios with differently opened windows. In the beginning all windows were closed, then they were all opened and, in the end, only one of them was opened, while the rest was closed.
The measuring system consisted of a professional Steinberg UR44 sound card and four Rode NT5 microphones. The system was managed through a PC with software based on virtual instrumentation. The UR44 sound card has a total of 4 analog inputs, which can be used to connect either a microphone array or a musical instrument. It supports various standardized communication protocols such as ASIO, WDM or Core Audio. The resolution of the AD conversion is up to 24 bits with different standardized sampling frequency values (from 44.1 to 192 kHz). The sound card also provides phantom power for connected microphones (from +24 VDC to +48 VDC).
The Rode NT5 microphone is a small compact microphone with an XLR connector. The diaphragm is of 1/2” size and consists of an externally deflected capacitor. The membrane is gold-plated, which improves its properties. The microphone is directional with cardioid directional characteristics, the frequency range of the microphone is between 20 Hz and 20 kHz (corresponds to the range of human hearing). In order to use the microphone, it must be connected to the input of a sound card supporting phantom power.
LabVIEW was chosen as a suitable programming environment, since it offers an extensive library of signal processing functions and is capable of fast development of multi-threaded appliacations. Available ASIO API libraries provides another undeniable advantage since they offer a complete WaveIO library.
The application was designed to be highly modular to make any future modifications as fast as possible. QMH (Queued Message Handler) was chosen as a core application architecture. Each microphone can be therefore considered as a separate unit or input source.
A commercially available recognizer integrated into the Windows OS was used as a speech recognizer. The Speech SDK 5.1 must be installed to maintain a reliable connection to LabVIEW. The recognizer converts the speech into text, which is then analyzed to estimate the correct command.
In order for the signal to be modified or filtered by any adaptive filtering method, it is necessary to adjust the signal path. As the speech recognizer runs in the background of the OS as a service, i tis not possible to select any other than the default audio inputs – i.e., it is not possible to select LabVIEW output. To solve this issue, the signal path was adjusted by a VB-Cable software, which emulates both the inputs and outputs of the sound card.
The front panel of the application can be seen in Fig. 6, which faithfully replicates the standard dashboard of Nissan Leaf. There are 4 alarm indicators on the front panel: revs, speed, temperature and fuel level. After the initial start of the application, is necessary to say the “Start engine” command, which will start the vehicle and the simulation itself. The recording of car idle status will be maintained and the system is therefore ready for input commands. Subsequently i tis possible to control the application according to a predefined vocabulary set. To switch the simulated vehicle off, it is first necessary to stop the vehicle by manuály setting the speed value to 0 kmph and then say “Stop engine” command. This will turn off all indicators and the simulated engine will shut down as well. The application is then waiting for a restart (“Start engine” command). A simplified diagram of the whole application can be seen in Fig. 5.
The application-supported vocabulary can be seen in Tab. 1. The vocabulary consists of two parts – first part is focused on the front panel (i.e., the vehicle) while the second one can be used to activate various interference sources.
4 Results of Experiments
The recognition results were estimated based on the recognized/unrecognized status. To verify the whole experiment a 100 repetitions were performed. Tabs. 2–Tab. 4 represent various scenarios measured with experimental vehicle and their individual recognition rates. A significant improvement of sucessful recognition can be seen in Fig. 7. When the driver's front window was opened, the original success rate was only 39% on average. After applying the LMS algorithm, the average success rate was improved to up to 95%. The “Accept call” command offered the lowest recognition rate from all the analyzed commands while running the 80 kmph scenario – 57%. A combination of LMS and ICA offered average recognition rate of 98% and the “Accept call” command reached even 100%. It is important to mention that the LMS and ICA combination can have a negative effect on some specific commands such as “Radio Off”. While the standalone LMS offered a 100% recognition rate, the LMS+ICA combination had only 78%. On the other side, when the worst-case scenario was measured (all windows opened) a LMS+ICA combination offered significantly better results than the standalone LMS. Exact results of the whole vocabulary measured at 80 kmph can be seen in Tab. 2.
When the speed was increased to 100 kmph, the results deteriorated even further due to the higher acoustic pressure changes, which caused background hum. On the average, the ICA method again offers better results (by approx. 5%). There are however two specific cases, in which the LMS + ICA combination reached unsatisfactory results – the “Winker left” and “Winker right” commands. While the LMS managed to recognize the driver in about 80% of all cases, the LMS + ICA maintained only 9% and 3% respectively. Similar results were maintained when the windows were opened. The results were probably caused by the nature of the interference (pressure waves caused by changing gusts of wind). A bar graph presenting the results for 100 kmph can be seen in Fig. 8, while the exact results can be seen in Tab. 3.
For the last measurements, the maximal permitted speed in Czech Republic was chosen – a 130 kmph. Compared to the previous results, the table was expanded and also offers values with closed windows, as the noise penetrating from the surroundings into the car was significant. Prior to filtering, the recognition success rate with closed windows was only 58% on average, for example the “Radio On” command has not been recognized even once. After applying the adaptive LMS algorithm, the recognition rate was 89%, while the hybrid LMS + ICA offered even 93%. When the driver's window was opened, the average pre-filter recognition value dropped to 27%. A total of 7 commands were not even recognized. After the adaptive LMS algorithm was introduced, the recognition rate improved to an average of 66%. The LMS + ICA hybrid improved the rate by further 6%. After opening all windows, there was a very significant drop in recognition rate for all scenarios. Before the filtration, the recognition rate was only 7%. After the LMS was used, the recognition rate was improved to 29% and ICA managed to improve it further to 30%. Conditions in interior were already quite extreme and the functionality of the whole platform was borderline unusable. The speech was basically overshadowed by huge pressure waves caused by wind. A bar graph presenting the results for 120 kmph can be seen in Fig. 9 and the exact results are visible in Tab. 4.
In Fig. 10 the immediate course of the “decline call” command before and after the application of the LMS algorithm can be seen. It can be noticed that the algorithm effectively removes noise and interference, and the words “decline” and “call” remain isolated. With gradually increasing speed and thus more noise pollution, it can be seen that the LMS algorithm manages to isolate speech. However the amplitude of the useful signal decreases, since it is partially filtered as well. The filter order N was set tof 530, while the
5 Discussion and Conclusion
Based on the presented testing scenarios, both LMS and LMS+ICA combination managed to significantly improve the system reliability. The speech processing is particularly important in the worst-case scenarios. While the non-filtered speech was successfully recognized only in 7% of all cases, the LMS offered up to 29% and LMS-ICA combination up to 30%. In this specific scenario, the difference between LMS and LMS-ICA might be insignificant, and the computational complexity is probably unjustified. The employment of advanced algorithms or their combinations will depend on the hardware equipment of specific vehicles. The signal can be further enhanced by machine learning and neural networks – while these techniques are certainly powerful, they also tend to be much more demanding than conventional methods. The future deployment of AI is currently planned.
Our future research will be focused on testing of different types of hybrid systems for automatic speech recognition. While the LMS+ICA combination offered satisfactory results, other algorithms can be used instead. There are different types of ICA based algorithms, each with different advantages and disadvantages. For example, during the presented initial tests, a fastICA was used. In the future JADE, flexICA, SOBI, InfoMax, RADICAL, robustICA etc., can be used in place of fastICA. LMS, which was chosen based on its low computational complexity, simplicity and accuracy. Choosing the ideal adaptive algorithm is difficult and this area will be explored further as well. Recursive least squares (RLS) algorithm can offer higher accuracy in certain areas but has a higher computational complexity. There is also a RLS type with lower complexity called fast transversal filter (FTF), which seems like an ideal candidate for further testing.
Technical University of Ostrava (VSB-TUO) has recently acquired two fully customizable Skoda Superb testing vehicles. These vehicles offer the latest Volkswagen hardware, which is partially unlocked for development at university. The conducted tests can now be tested in these modern vehicles and the speech recognition system can be deployed together with Skoda proprietary Laura voice assistant, comparing the performance of the already integrated system to modified scenario with the presented algorithms.
The presented systems can be also deployed in different areas. Based on the previous conducted tests, the system is also capable of speech recognition in production plants – operating even in harsh environments. System with minor adjustments filtered voice commands and adjusted parameters on the fly, while working next to the press machine. The article covering this problematic is currently in processing and will be published shortly. Testing of other scenarios (voice recognition in trains or planes) are currently scheduled, and the results will be compared to current research.
Both research branches will be further explored in newly built VSB-TUO testbed CPIT TL3. This specialized building is focused on three main development areas – smart factory, home care and automotive – offering sophisticated building management systems, energy flow monitoring , integrated extensive network of various advanced sensor systems and high speed data transmissions. CPIT TL3 will be opened in 6/2021.
The presented article offered the first insight into our adaptive speech recognition system. The platform was built around professional hardware components (Steinberg and Rode), which was integrated into real vehicle (Skoda). While the platform had its limiting factors, it still managed to significantly improve measured values. When comparing the unfiltered voice commands to the LMS and LMS+ICA combinations, the system reached up to 7 times better performance. The best results were achieved in the worst-case scenarios, when the car was driving at higher speeds with opened windows. When the car was driving at lower speeds (i.e., 100 kmph), the LMS+ICA combinations improved the system reliability by up to 50%.
Acknowledgement: This work was supported by the European Regional Development Fund in the Research Centre of Advanced Mechatronic Systems project, project number CZ.02.1.01/0.0/0.0/16_019 /0000867 within the Operational Programme Research, Development and Education, and in part by the Ministry of Education of the Czech Republic under Project SP2021/32.
Funding Statement: This research was funded by the European Regional Development Fund in the Research Centre of Advanced Mechatronic Systems project, project number CZ.02.1.01/0.0/0.0/16_019 /0000867 and by the Ministry of Education of the Czech Republic, Project No. SP2021/32.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.