speech enhancement

The problem: How to remove unwanted near-uniform background noise (e.g. a car engine, factory floor noise) from a speech signal.

The solution (overview): One can assume that in normal speech, the speaker will pause fairly often. Therefore, the quietest component encountered at each frequency over a 10second period is probably part of the noise spectrum during that interval. A DSP program can be written to sample the signal, determine a noise estimate spectrum, and subtract it.

The results: A program written to do this on a Motorola 52002 DSP removed all the noise from a range of input signals, and was especially good on low level car and helicopter noise. It did produce a "musical" gurgling artifact, however this was at a much lower level than the original noise.

The solution (detail): The program uses the basic method known as spectral subtraction, along with additional low-pass filtering and noise overestimation at low frequencies to estimate the noise present in the incoming signal and subtract it. A skeleton program was used to repeatedly generate a set of frequency-domain coefficients from the incoming signal, and to transform them back giving the output signal. The real and imaginary parts of the D.C. level of the signal are stored to preserve them from the forthcoming routines.

The complex frequency domain signal X(ω) is transformed by the subroutine Mag to give the magnitude |X(ω)|, and low pass filtered with the previous magnitudes to give |PX(ω)| by the subroutine LPF_X.

The noise spectrum is determined by creating a frequency domain signal comprising the quietest spectral components observed over a ten-second period. A set of signals |M₁(ω)|,|M₂(ω)|,|M₃(ω)|,|M₄(ω)| are built up by the following technique: For every frame, each frequency in |M1(ω)| is set to the corresponding frequency in |PX(ω)| if the later is lower than the former. Every 2.5 seconds, all the M signals are moved along one index (i.e. |M_i(ω)| is set to |M_i+1(ω)|) and |M₁(ω)| is set to |PX(ω)|. This is handled by the subroutines counter_250ms and reach_250ms. Now by choosing the lowest member of |M_1,2,3,4(ω)| for each ω, |N(ω)| can be estimated. Note that, since the lowest noise amplitude is always chosen, the actual average noise amplitude will be considerably higher. Therefore the values in |N(ω)| are multiplied by a factor, alpha (found to be 4). All these tasks are performed by the subroutine getN.

The noise spectrum then has its lowest 8 frequency bins boosted by a further 8 times, to reduce somewhat the effect described as “Musical Noise”. This works because most of this effect occurs in the lower frequencies. Next the noise estimate is subtracted from the complex signal X(ω). Since there is no knowledge of the phase of the noise, yet the phase information of the input signal must be preserved, the noise estimate is in fact used to calculate a factor, |g(ω)|, by which X(ω) is multiplied to give the output.

g(ω) = max(λ, 1 - |N(ω)|/|X(ω)|)

The value of λ used was 0.1, and it is required to avoid g(ω) from taking a negative or very small value. Before running the inverse fft routine, the D.C. levels are restored.

Hidden Markov Model	George W. Bush's 2005 State of the Union Address	Speech Processing	The night your father was stabbed in the back room of a convenience store. No mercy.
DSP engineer	scanner	voice processing card	George Washington's 1791 State of the Union Address
Alien Dreamtime	noise reduction	Motorola

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups