OK, I have a song stored as 2-channel, 16-bit linear PCM on my reasonably fast computer. I want to slow down the tempo because I'm trying to remix with another song.

"Re-perform it!" No, I don't have the source score or samples, and I don't have the vocal training; all I have is this wav file I extracted from a CD.

"Resample it!" No, resampling digital audio has an effect analogous to that of slowing down the turntable: it transposes the song to a lower key makes the singer sound like an ogre (no, not Shrek).

I guess it's time for my old friends Fourier and Wigner to come help. We'll build a phase vocoder after Flanagan, Golden, and Portnoff. Basic steps: compute the frequency/time relationship of the signal by taking the FFT of each windowed block of 2,048 samples (assuming 44 KHz input), do some processing of the frequencies' amplitudes and phases, and perform the inverse FFT. A good algorithm will give good results at compression/expansion ratios of + 25%; beyond that, the pre-echo and other smearing artifacts of frequency domain interpolation on transient ("beat") waveforms, which are not localized at all in the frequency domain, begin to take a toll on perceived audio quality.

Rabiner and Schafer in 1978 put forth an alternate solution: work in the time domain, attempt to find the period of a given section of the fundamental wave with the autocorrelation function, and crossfade one period into another. This is called time domain harmonic scaling or synchronized overlap-add method and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation misunderestimates the period of a signal with complicated harmonics (such as orchestral pieces). Cool Edit Pro seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass frequency. For a 120 bpm tune, use 48 Hz because 48 Hz = 2,880 cycles/minute = 24 cycles/beat * 120 bpm.

High-end commercial audio processing packages combine the two techniques, using wavelet techniques to separate the signal into sinusoid and transient waveforms, applying the phase vocoder to the sinusoids, and processing transients in the time domain, producing the highest quality time stretching.

These techniques can also be used to scale the pitch of an audio sample while holding time constant. (Note that I said pitch scaling, not "shifting," as pitch shifting by amplitude modulation with a complex exponential does not preserve the ratios of the partial frequencies that determine the sound's timbre.) Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the formants into a sort of Alvin and the Chipmunks-like effect, which may be desirable or undesirable. To preserve the formants and character of the voice, you can use a "regular" channel vocoder keyed to the signal's fundamental frequency. (Fundamental following is straightforward; send me a /msg if you want me to write about it.)

Sources: http://www.dspdimension.com/html/timepitch.html
Further reading: comp.dsp FAQ
Application of this technique can be found in Eminenya 2.0.


Copyright © 2002 Damian Yerrick.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the writeup entitled "GNU Free Documentation License".
Note that while this may sound like a lot of complicated mathematical mumbo jumbo to you, the aspiring Britney Spears remix artist, the reality is that you just import your sample into your multitrack audio application of choice (Pro Tools, Sonic Foundry's Vegas Audio, Cubase's Logic Audio, Cakewalk...), route the channel with your sample to an effects plugin, select the time stretch/pitch shift plugin of your choice, tell it how much to stretch, and you're good to go. It's remarkably easy.

In fact, one application, Sonic Foundry's ACID, does automatic real time pitch- and time-scaling for every sample you use, according to a tempo and key you set, with remarkable results. It's not the slickest algorithm I've heard, but it is computationally lightweight, and sounds good enough that for moderate amounts of shifting most people will never notice a difference. Plus I believe they use a better algorithm when you render your track to an audio file.

Time- and pitch-scaling software hardware is incredibly useful, and the market today contains a very clear spectrum from inexpensive consumer products to high end, professional products with high end, professional prices. The current king of software plugins is Serato's Pitch N Time 2.0.1, which will run you USD$800, comes only in AudioSuite format, but produces breathtakingly clear results. Roland also has a newish sample playback synth, the VP-9000 Variphrase Processor, which does DSP-based realtime tempo- and pitch-matching, as well as bends and other, more synth-y, manipulations. To take one home, however, will set you back about USD$2500...

Log in or registerto write something here or to contact authors.