Extracto
Contents
Danksagung
Abstract
1 Introduction
1.1 Motivation
1.2 Overview
2 Digital Filters
2.1 FIR-Filter
2.2 IIR-Filter
2.3 Wiener Filter
2.3.1 Solution in the Time Domain
2.3.2 Solution in the Frequency Domain
2.4 Adaptive Filters
2.4.1 LMS Algorithm
2.4.2 NLMS Algorithm
3 Acoustic Echo Cancellation
3.1 Problem Definition
3.2 Adaptive Filter
3.3 Voice Activity Detection VAD
3.4 Pre-Emphasis/De-Emphasis
3.5 Residual Echo Suppression
3.6 Matlab Results
4 Speech Separation
4.1 Beamforming
4.1.1 Diffuse Noise Field and Directivity
4.1.2 Delay-and-Sum Beamformer
4.1.3 MVDR Beamformer
4.1.4 Superdirective Beamformer
4.1.5 Zelinski Postfilter
4.2 Echo Suppression Postfilter ES
5 Experiments
5.1 Corpus
5.2 Automatic Speech Recognition System ASR
5.3 Experiments and Results
5.3.1 Superdirective Beamformer
5.3.2 Zelinski Postfilter
5.3.3 Echo Suppression Filter
5.3.4 Row of Echo Suppression Systems
5.3.5 Zelinski Postfilter and Echo Suppression System
6 Summary, Conclusions and Future Work
List of Tables
5.1 WER for SDB
5.2 WER for Zelinski Postfilter
5.3 WER for Echo Suppression Filter
5.4 WER for Zelinski Postfilter and Echo Suppression Filter
List of Figures
2.1 FIR Filter Structure [Tho11]
2.2 IIR Filter Structure [Tho11]
2.3 Structure of a Wiener Filter
2.4 Structure of an Adaptive Filter
3.1 Audio Conference System
3.2 Acoustic Echo Cancellation System
3.3 Speech Signal
3.4 Signal Power of a Signal
3.5 Voice Activity of a Signal
3.6 Example for Pre-Emphasis
3.7 Far-End Speaker Signal
3.8 Desired Signal
3.9 Estimated Impulse Response
3.10 Result without Near-End Speaker
3.11 Desired Signal with two Speakers
3.12 Result with Near-End Speaker
3.13 Echo Suppressed Signal with Postfilter
3.14 Echo Cancellation System
4.1 Task Speech Separation
4.2 Beamforming
4.3 Plane Wave on Linear Microphone Array
4.4 Delay-And-Sum Beamformer [VT02]
4.5 MVDR Beamformer
4.6 Zelinski Postfilter
5.1 SDB
5.2 Bar Plot of the WER for Zelinski Postfilter
5.3 SDB with Zelinski Postfilter
5.4 SDB with Echo Suppression Filter
5.5 SDB with up to 4 ES Filters
5.6 WER for different β
5.7 SDB with Zelinski and ES
6.1 Speech Separation System.
Danksagung
Zunächst möchte ich mich bei Herrn Prof. Dr. Dietrich Klakow bedanken, der mir die Möglichkeit gab meine Arbeit am Lehrstuhl für Sprachsignalverarbeitung der Universität des Saarlandes anzufertigen.
Des Weiteren gilt mein Dank allen Mitarbeitern des Lehrstuhls, die mich bei tech- nischen, organisatorischen oder fachlichen Problemen großzügig und freundlichst unterstützten. Besonders ist an dieser Stelle mein Betreuer Friedrich Faubel zu nennen, der mich im kompletten Verlauf meiner Arbeit hervorragend betreute.
Ein ganz spezieller Dank gilt auch meiner Familie und meinen Freunden. Beson- ders die tatkräftige Unterstützung meiner Mutter während meiner schulischen und akademischen Laufbahn bildet die Grundlage für das Gelingen dieser Arbeit.
Abstract
This bachelor thesis deals with acoustic echo cancellation and speech separation. An acoustic echo cancellation system is implemented and then parts of this system are used for the speech separation process. The speech separation process is evaluated with an automatic speech recognition system and the filter that is used leads to a significant improvement of the speech separation. The separation system achieved a word error rate of 44 . 20 %. This is an improvement of 24 % in comparison to a superdirective beamformer.
Zusammenfassung
Diese Bachelorarbeit umfasst die Themen Acoustic Echo Cancellation und Speech Separation. Zunächst wird ein Acoustic Echo Cancellation System in Matlab implementiert und anschließend werden Teile dieses Systems für die Sprachtrennung genutzt. Die Experimente zur Sprachtrennung werden mit einem automatischen Spracherkennungsystem ausgewertet und mit Hilfe des benutzten Filters ist eine deutliche Verbesserung der Sprachtrennung zu beobachten. Das System erreicht eine Word-Error-Rate von 44 , 20 %. Dies entspricht einer Verbesserung von 24 % im Vergleich zum Superdirective Beamformer.
1 Introduction
1.1 Motivation
In recent years the communication between people changed. Hands-free communi- cation becomes a more and more important part nowadays. It is used for example in teleconferencing systems, mobile phones, home entertainment, and car informa- tion systems. It is very comfortable to use these systems, but the coupling between microphones and loudspeakers introduces echoes that can disturb the conversation. The solution to this problem is an acoustic echo cancellation system that can sup- press the disturbing echo.
The presence of more than one person in a room, like in teleconferencing systems, leads to new problems. Now, it is necessary to separate the speech of both speakers and this is a very challenging task in speech recognition.
In this thesis an acoustic echo cancellation system is explained and implemented and then parts of this echo cancellation system are used to solve the speech separation problem.
1.2 Overview
This thesis explores the speech separation problem under use of parts of an acoustic echo cancellation system. It is organized as follows:
Section 2
This section gives an introduction to digital filters. It is very important to have knowledge about these filters in order to understand and implement an acoustic echo cancellation system.
Section 3
This section first describes the basic concept of an acoustic echo cancellation system. Then, a voice activity detection system and a pre-emphasis filter are introduced. In order to improve the results a postfilter for the residual echo is also presented in this part. Furthermore, the system is implemented in Matlab and Section 3.6 shows the results of the implementation.
Section 4
This section deals with the actual speech separation problem. Here, we first point a beamformer (spatial filter) at each of the acoustic sources in order to make use of the spatial diversity of the signal. Hence, we explain the beamforming process at the hand of the delay-and-sum beamformer in Section 4.1.2 and the superdirective beamformer in Section 4.1.4. Different postfilters that are used after the beamforming process are also part of this section.
Section 5
This section presents the experiments and in particular the results of the different experiments.
Section 6
The last section gives a summary of the main facts of this thesis and a prospect for future work is presented.
2 Digital Filters
Digital filters are an important topic in signal processing and they are used in many applications in communication technology. In this section, we will introduce the filters that are used in this thesis.
2.1 FIR-Filter
As the name suggests, finite impulse response filter have a finite impulse response. The output of a FIR-filter is bounded, because it is a finite sum of weighted, and bounded inputs. Since there is no feedback, the FIR filter can never oscillate and it is always stable. Figure 2.1 shows the structure of a FIR filter.
illustration not visible in this excerpt
Figure 2.1: FIR Filter Structure [Tho11]
The weighted sum of the inputs leads to the impulse response of an FIR filter in the z-domain:
illustration not visible in this excerpt
The z elements in equation 2.1 are also called taps and they are delays of the input signal. H(z) describes the z-transform of the impulse response h(n) and it is useful to check the stability of the system. In this case there are no poles and this implies stability of the system. We use a FIR filter later in the pre-emphasis part (Section 3.4) of the acoustic echo cancellation system.
2.2 IIR-Filter
An infinite impulse response filter has an infinite length of the impulse response h(n), because the output samples are fed back in order to compute the output y(n). In Figure 2.2 we can see the scheme of an IRR Filter.
illustration not visible in this excerpt
Figure 2.2: IIR Filter Structure [Tho11]
The z-transform of the impulse response h(n) is
illustration not visible in this excerpt
Equation 2.2 has poles at the roots of the denominator. Because of this poles an IRR filter can be unstable. For the design of such a filter we have to determine the coefficients bi and we will use this filter type in the de-emphasis part in Section 3.4.
2.3 Wiener Filter
The Wiener Filter theory was developed by Norbert Wiener (1949) and Andrei Kolmogorov (1941)[Vas96]. Wiener did his solution based on the time domain analysis and Andrei Kolmogorovs solution was based on the frequency domain analysis.
The goal of a Wiener filter is to reduce noise from a noisy signal. If you add noise n[k] to a clean signal s[k], a Wiener filter with the filter coefficients h[k] tries to reconstruct the clean signal from the noisy signal. Figure 2.3 shows the scheme of a Wiener Filter. For the acoustic echo cancellation task, a Wiener filter is used in the echo suppression system that is explained in Section 3.5.
illustration not visible in this excerpt
Figure 2.3: Structure of a Wiener Filter
As we can see in Figure 2.3, we can express the noisy signal x[k] as
illustration not visible in this excerpt
The filtered signal s [ k ] is a reconstruction of the clean signal s[k].
In the following part the solutions in the time and frequency domain are explained.
2.3.1 Solution in the Time Domain
The signal s [ k ] can be written with a convolution as
illustration not visible in this excerpt
Then, the error between the clean and filtered signal is
illustration not visible in this excerpt
The idea of a Wiener filter is to minimize the expectation value of the squared error
illustration not visible in this excerpt
If we use the definition of the discrete convolution, we get
illustration not visible in this excerpt
In order to minimize equation 2.7 we have to calculate the derivative with respect to h[i] and set it to zero.
illustration not visible in this excerpt
Simplifying and substituting equation 2.3 into equation 2.9 leads to
illustration not visible in this excerpt
Now, we assume that noise and signal are uncorrelated
illustration not visible in this excerpt
and we define
illustration not visible in this excerpt
Then, we get the following equation
illustration not visible in this excerpt
If we use the convolution, we get the Wiener-Hopf-equation:
illustration not visible in this excerpt
Solving this system of equations gives the optimum impulse response at a complexity of O (n 2 ).
2.3.2 Solution in the Frequency Domain
In the remaining part of the thesis uppercase letters mean a signal in the frequency domain. In the frequency domain, a convolution means a multiplication and therefore, the error J (ω) is
illustration not visible in this excerpt
The next step is again to minimize the expectation value of the squared error E { J (ω)2 } by taking the derivative with respect to H (ω):
illustration not visible in this excerpt
With
illustration not visible in this excerpt
and the assumption of uncorrelated noise and signal
illustration not visible in this excerpt
we can simplify equation 2.18 to
illustration not visible in this excerpt
With the power spectral densities
E { S (ω)2 + N (ω)2 } .
illustration not visible in this excerpt
we can formulate the solution of the impulse response in the frequency domain as
illustration not visible in this excerpt
This method in the frequency domain has a complexity of[Abbildung in dieser Leseprobe nicht enthalten]
2.4 Adaptive Filters
An adaptive filter is a filter whose filter coefficients can be changed so that the filter can be adapted to different environments. In order to realize this we need, in addition to the filter coefficients, an algorithm to update these coefficients. Figure 2.4 shows the scheme of such an adaptive filter.
illustration not visible in this excerpt
Figure 2.4: Structure of an Adaptive Filter
The input of the filter consists of N samples of the signal x(n)
illustration not visible in this excerpt
and with the filter coefficients w(n)
illustration not visible in this excerpt
we can calculate the output signal
illustration not visible in this excerpt
This signal will be compared with the desired signal d (n) and the error
illustration not visible in this excerpt
is used to calculate the new filter coefficients for the next N samples of the signal. The goal is to minimize the error because then you estimate the echo as good as possible and thus the echo can be removed from the signal.
2.4.1 LMS Algorithm
The term LMS algorithm stands for Least-Mean-Squares algorithm. It denotes a certain way of updating the coefficients of the adaptive filter. It uses the method of steepest descent and was invented by Bernard Widrow and Marcian Edward Hoff in 1960. The mean square error
illustration not visible in this excerpt
will be calculated and minimized. Since the expectation value of the error is unknown, we have to assume that the expectation value of the error corresponds to the instantaneous value of the error
illustration not visible in this excerpt
Substituting equation 2.30 into equation 2.29 leads to
illustration not visible in this excerpt
To minimize this function we use the method of steepest descent. Therefore, we have to take the gradient with respect to the filter coefficients w(n).
illustration not visible in this excerpt
Substituting
in equation 2.32 leads to
illustration not visible in this excerpt
Applying the chain rule and simplifying results in
illustration not visible in this excerpt
The equation for updating the filter coefficients w(n + 1) at any given time n is given by the method of steepest descent by subtracting the gradient from the previous filter coefficients w(n):
illustration not visible in this excerpt
Where μ is the stepsize which represents a small constant and which affects the speed of updating. If we now substitute equation 2.35 into equation 2.36, we obtain the update formula of the filter coefficients for the LMS algorithm:
illustration not visible in this excerpt
In summary, we get the following scheme for the procedure of the LMS algorithm. Algorithm 1 LMS Algorithm [Kha07]
illustration not visible in this excerpt
Comment: The initialization of the filter coefficients is done with w(0) = 0. Re- garding the stepsize it holds that μ is 0 < μ < here λ max is the largest λ max w eigenvalue of the autocorrelation matrix R = E { x(n)x(n) T } . [Hay02] Each step of the algorithm requires 2N+1 multiplications and 2N additions. Consequently, the complexity has order O(N). Because of the simplicity and low complexity, the LMS algorithm is often used in adaptive filters.
2.4.2 NLMS Algorithm
One problem with the LMS algorithm is the slow convergence of the filter coefficients, especially for fast changes of the input. To improve this, the normalized least mean-squares algorithm is used. It is an extension of the LMS algorithm and it adapts the stepsize in each iteration by normalizing with the signal power of x(n). The stepsize μ is given by
illustration not visible in this excerpt
Substituting equation 2.38 in the update formula of the LMS algorithm leads to
illustration not visible in this excerpt
This equation is often modified to
illustration not visible in this excerpt
where μ 1 and ψ are small positive constants that influence the speed of adaptation and prevent a division by zero.
Algorithm 2 NLMS Algorithm [FB98]
illustration not visible in this excerpt
3 Acoustic Echo Cancellation
Acoustic echo cancellation systems are used in a wide range of communication systems. For example, they can improve the speech quality of speakerphones or Voice over IP telephones. These techniques are more and more common and therefore acoustic echo cancellation is a very important and useful feature in signal processing. In this section we will explain the details of an AEC system and at the end we will have a look at some results of the Matlab implementation.
3.1 Problem Definition
Audio feedback is often a problem of speakerphones or audio conference systems. The microphone receives the speaker’s voice in addition to an input loudspeaker signal that is reflected by the walls or the ceiling. The result leads to a superposi- tion of speech and disturbing echo, which makes the utterance hard to understand for the receiver. Figure 3.1 shows the scheme of such an audio conference system.
illustration not visible in this excerpt
Figure 3.1: Audio Conference System
In order to avoid the superposition, an acoustic echo cancellation system tries to estimate the echo with certain filters and then subtracts the estimated echo from the microphone signal. So, at the end the echo is reduced and the receiver hears a clear speech signal.
illustration not visible in this excerpt
Figure 3.2 illustrates the scheme of an acoustic echo cancellation system. Two speakers with microphones and loudspeakers are placed in two different rooms. The microphone signal x (t) of the far-end room is sent to the loudspeaker in the near-end room. There the signal, which is reflected by the wall and the ceiling, is received at the microphone in addition to the voice of the second speaker v (t). This signal is sent to the far-end room and the speaker there would hear a disturbing echo. To solve this problem, we have to estimate the echo y (t) and then subtract it from the microphone signal. For this estimation an adaptive filter is used to estimate the impulse response of the room.
illustration not visible in this excerpt
Figure 3.2: Acoustic Echo Cancellation System
The error is
illustration not visible in this excerpt
with
illustration not visible in this excerpt
where w (t) describes the filter coefficients of the adaptive filter and h (t) describes the impulse response of the room.
3.2 Adaptive Filter
For the estimation of the room impulse response y (t), we use an adaptive filter as described in Section 2.4. The input signal x (t) is the signal of the far-end speaker and the desired signal d (n) is the near-end speaker signal in addition to the disturbing echo. So if we have a good estimate of the impulse response, we can reduce the echo by subtracting the filter output y (t) from the desired signal.
3.3 Voice Activity Detection VAD
Voice activity detection is a technique with which the presence or absence of human speech can be detected. It is an important feature of an acoustic echo cancellation system because we have to prevent the updating process of the adaptive filter when there is no far-end speech activity so that the impulse response is estimated correctly. Therefore, we implement a VAD algorithm in this work.
illustration not visible in this excerpt
Figure 3.3: Speech Signal
Figure 3.3 shows the amplitude of a speaker file. The first step now is to calculate the power of the signal. Therefore, we divide the signal into frames and calculate the power of these frames (Figure 3.4(a)).
illustration not visible in this excerpt
(a) Signal Power
illustration not visible in this excerpt
(b) Signal Power with Threshold
Figure 3.4: Signal Power of a Signal
The last step now is to compare the power to a certain threshold. In this case we choose a threshold of a third of the power. It means that the speaker is considered active if the value exceeds one third of the maximum power (Figure 3.4(b)) and it is assumed that there is only noise if the value is below. In Figure 3.5 you can see the result. If there is no voice activity, the value of the frame is 0 and in the other case it is 1.
illustration not visible in this excerpt
Figure 3.5: Voice Activity of a Signal
[...]
1 μ usually 0 < μ < 2 [Göt]
- Citar trabajo
- Christian Siegwart (Autor), 2012, Improving Speech Separation by Acoustic Echo Cancellation, Múnich, GRIN Verlag, https://www.grin.com/document/207359
Así es como funciona
Comentarios