OK, I have a song stored as 2-channel, 16-bit linear PCM on my reasonably fast computer. I want to slow down the tempo because I'm trying to remix with another song.

"Re-perform it!" No, I don't have the source score or samples, and I don't have the vocal training; all I have is this wav file I extracted from a CD.

"Resample it!" No, resampling digital audio has an effect analogous to that of slowing down the turntable: it transposes the song to a lower key makes the singer sound like an ogre (no, not Shrek).

I guess it's time for my old friends Fourier and Wigner to come help. We'll build a phase vocoder after Flanagan, Golden, and Portnoff. Basic steps: compute the frequency/time relationship of the signal by taking the FFT of each windowed block of 2,048 samples (assuming 44 KHz input), do some processing of the frequencies' amplitudes and phases, and perform the inverse FFT. A good algorithm will give good results at compression/expansion ratios of + 25%; beyond that, the pre-echo and other smearing artifacts of frequency domain interpolation on transient ("beat") waveforms, which are not localized at all in the frequency domain, begin to take a toll on perceived audio quality.

Rabiner and Schafer in 1978 put forth an alternate solution: work in the time domain, attempt to find the period of a given section of the fundamental wave with the autocorrelation function, and crossfade one period into another. This is called time domain harmonic scaling or synchronized overlap-add method and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation misunderestimates the period of a signal with complicated harmonics (such as orchestral pieces). Cool Edit Pro seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass frequency. For a 120 bpm tune, use 48 Hz because 48 Hz = 2,880 cycles/minute = 24 cycles/beat * 120 bpm.

High-end commercial audio processing packages combine the two techniques, using wavelet techniques to separate the signal into sinusoid and transient waveforms, applying the phase vocoder to the sinusoids, and processing transients in the time domain, producing the highest quality time stretching.

These techniques can also be used to scale the pitch of an audio sample while holding time constant. (Note that I said pitch scaling, not "shifting," as pitch shifting by amplitude modulation with a complex exponential does not preserve the ratios of the partial frequencies that determine the sound's timbre.) Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the formants into a sort of Alvin and the Chipmunks-like effect, which may be desirable or undesirable. To preserve the formants and character of the voice, you can use a "regular" channel vocoder keyed to the signal's fundamental frequency. (Fundamental following is straightforward; send me a /msg if you want me to write about it.)

Sources: http://www.dspdimension.com/html/timepitch.html
Further reading: comp.dsp FAQ
Application of this technique can be found in Eminenya 2.0.

©

Copyright © 2002 Damian Yerrick.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the writeup entitled "GNU Free Documentation License".