Introduction
AAC is short for Advanced Audio Coding (not Codec or Compression). It is an international standard defined by the ISO and IEC, standard number 13818, part 7. The first edition was published on 1997-12-01. ISO 13818 is more commonly known as MPEG-2, defined as "Generic coding of moving pictures and associated audio information".
MPEG-2 actually defines 2 standards for audio coding. The first version of MPEG-2 defined an audio codec based on the audio coding in MPEG-1. MPEG-1 audio, which includes MP3 (more properly known as MPEG-1 Layer III) was included as part of the MPEG-1 digital video standard (ISO 11174). MPEG-1 is divided into 3 levels of increasing complexity, known as layers. Each layer adds new features to the coding, but they share a common basic structure.
The update of MPEG-1 audio coding is commonly called MPEG-2 BC audio, short for backwards compatible. MPEG-2 BC can be played back on an MPEG-1 decoder in a limited fashion, and MPEG-1 audio can be played on an MPEG-2 BC decoder.
AAC was a later addition to MPEG-2. It uses a different compression method to any of the MPEG-1 codecs, and is not compatible with MPEG-1 encoders or decoders. MP3 is unnecessarily complex to preserve compatibility with MPEG-1 Layers I and II, and sacrificing backwards compatibility allows AAC to offer better compression.
AAC also offers facilities to encode up to 48 channels, instead of the 2 channels in MPEG-1, and this includes up to 15 LFE (Low Frequency Enhancement) channels, which are used in home cinema systems to drive a sub-woofer for a bigger bass sound. It also supports more sampling rates, from 8 kHz to 96 kHz.
Technical details
In the standard, AAC is defined not by describing an encoder, but by describing the operation of a decoder. The logic of this is that it may be possible to improve the efficiency of encoding data by modifying the encoder, but it is essential for all decoders to be able to decode the same bit stream.
However, it is more useful to consider the elements involved in AAC compression if we consider the encoder operation. The first stage in encoding is to divide the PCM audio data into blocks. In AAC, each block corresponds to 1024 audio samples (or ADC values) per channel.
Like MPEG-1 and most other audio codecs - ATRAC, Dolby AC3, WMA - AAC is a perceptual coder. I will not attempt to explain the full operation of a perceptual coder here, but the basic principle is to divide sounds into those which the listener can perceive clearly, and those which the user cannot perceive, and to remove the latter.
The encoder has the following components:
The first component, the filter bank, is an MDCT - Modified Discrete Cosine Transform - filter, which converts data sampled in the time domain into values corresponding to individual frequencies. Unlike any of the MPEG-1 audio standards, AAC is a pure transform coder. The MPEG-1 codecs all use a 32 sub-band polyphase filter, which splits up the time domain data into frequency bands, and Layer III adds a 18-point MDCT (Modified Discrete Cosine Transform).
Instead of the hybrid filter in MP3, AAC uses a simple 2048-input MDCT, which produces 1024 spectral (frequency domain) coefficients from the 1024 samples for the current block combined with the 1024 from the previous block (this overlapping smoothes out transitions between blocks). Already, this increase in the number of spectral coefficients above the 576 in MP3 offers an improvement in frequency resolution and hence sound quality.
Next is Temporal Noise Shaping or TNS. This is designed to reduce noise introduced by the quantizer (see below), accomplished by applying a simple filter across blocks of spectral coefficients. This allows for fine adjustments to the spectral coefficients.
Intensity stereo is a method of effectively encoding stereo information. Rather than encode 2 channels separately, they are combined to give a mono audio stream and a stereo position; the spectral coefficients are divided up into contiguous blocks and each block is assigned a stereo position. This is effective because the two channels in a stereo broadcast typically share information (having a common source).
Coupling allows data from one channel to be combined with that from another channel. This is used for sophisticated effects, such as transmitting a single sound effects channel and multiple dialog channels for different languages. The appropriate dialog can be mixed onto the sound effects channel depending on the language selected by the user. Two modes of coupling are supported, depending on when the combination is performed.
Prediction uses the method of backward prediction to efficiently encode signals which do not change much from block to block. A predictor within the encoder estimates the expected value for a spectral coefficient based on the previous two values for that channel. Rather than encoding the new value, the difference from the expected value is encoded.
Mid/side stereo, or M/S stereo, is another method of encoding stereo. All the stereo modes are optional, and any one can be selected, or stereo can be encoded as 2 separate channels (this may be decided by the user encoding the data, or by algorithms in the encoder). M/S stereo encodes stereo not as left and right channels (L and R), but as a centre channel M=(L+R) and a difference S=(L-R)/2. This offers similar advantages to intensity stereo, but generally slightly higher quality.
Scale factors and the quantizer perform the main perceptual coding. The goal here is to divide down blocks of spectral coefficients so they can be stored as smaller values, with a loss of precision (e.g. 21, 22, 12, 5 divided down by 5 to give 5, 5, 2, 1). The amount they are divided down by is given by a scale factor stored for a group of spectral coefficients (in the case given, the scale factor is 5). After scaling, values are quantized by a non-linear transform, so smaller values are encoded with a greater resolution than larger values.
Noiseless coding is the final stage. Huffman coding is used to efficiently encode pairs or quadruples of quantized spectral coefficients. Finally these Huffman-coded values are written into the output bit stream along with various parameters used to specify exactly which methods of encoding are used (prediction on or off, etc.), and parameters such as the scale factors.
The difficulty in writing a good encoder is to find a perceptual model which allows you to remove parts of the audio signal that the listener won't notice, but to preserve the quality of those parts of the signal which the listener will notice, all with the lowest possible bit rate.
In contrast to the complex decision-making in the encoder, decoding data is largely a mechanical process, involving using the parameters in the bit stream to reconstruct audio data according to the standard. Decoders are therefore considerably simpler.
AAC is divided up into 3 profiles which specify the complexity of the decoder and the features involved. The three profiles are:
- Main profile, which uses all the tools described above.
- Low Complexity profile, which omits the prediction tool, and restricts some other parameters.
- Scalable Sampling Rate, which allows the receiver to output data at fraction of the transmitted sampling rate, although the additional data needed for this leads to a reduction in quality compared to Main profile or Low Complexity profile at the same bit rate. SSR adds a gain control tool to the codec, and removes prediction.
- Low Complexity is the most commonly implemented profile, with a complexity similar to that of MP3, but an increased quality - LC AAC at 96 kbps is similar to MP3 at 128 kbps.
Applications
MP3 is still the most common format for digital audio compression. AAC is designed to compete in the market for high-quality audio compression, such as for storing audio files on one's own computer, rather than for streaming audio over the internet. It is particularly suited to home cinema and digital television, because it offers high quality multi-channel audio at data rates far lower than Dolby Digital or Dolby AC3, and the large number of channels supported allows for applications such as multi-lingual broadcasting, descriptions for the hard of hearing, and commentary tracks.
AAC was created by a number of companies, principally Dolby, the Fraunhofer Institute, Sony and AT&T. Because they hold software patents on the technology involved, it is necessary to obtain a license from Dolby to distribute an encoder or decoder. This means that although it is an international standard, you still have to pay commercial organizations to distribute software to decode AAC files.
Reference code and the standard are obtainable from ISO or your local standards body (e.g. BSI). They will however cost several hundred USD.
Few AAC codecs are currently available for use on PCs. A number of implementations are available for use in portable digital audio players, and it is increasingly becoming offered in these devices.
It is also being used by a number of companies such as Nokia as a basis for encrypted proprietary formats. Because of the threatening behavior of organizations representing copyright holders in the USA and elsewhere, hardware companies feel they must include copyright protection technology in their devices, even if this means (as with the Nokia 5510) that files must be re-encoded for use on the device. Coupled with the ongoing collapse of SDMI, which promised to offer a global standard for IP rights protection, this is leading to a worrying market fragmentation, which serves only to make it harder for any technology other than MP3 to become widespread; MP3 offering no copyright protection whatsoever.
AAC has also been chosen by Japanese authorities for digital television and digital radio broadcasts, and is being used for digital radio in the USA, although elsewhere digital radio is tending towards the technically inferior Eureka 147 standard, based on MPEG-1 Layer II (which was in turn based on MUSICAM).
Listening tests have repeatedly shown the merits of AAC over both MP3 and Microsoft's Windows Media Audio, although it is probably less widely used than either. One such test was: David Meares, Kaoru Watanabe, Eric Scheirer, “Report on the MPEG-2 AAC Stereo Verification Tests”.
http://www.tnt.uni-hannover.de/project/mpeg/audio/public/w2006.pdf,
Feb 1998. (Formal designation: ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio N2006).
Although AAC is technically superior to other codecs at high-fidelity sound reproduction, and still much better than MP3 at much lower bit rates, that is no guarantee that it will achieve the success it deserves.
(See also: How audio compression works)