Wave File Format Implementation Errors

I've read through and written many wave reader/writer implementations over the years and most of them are wrong. I tend to blame this on a vast number of websites which incorrectly "document" the wave format rather than point to the original IBM/Microsoft specifications (which, to be fair, are also pretty poor).

This post describes some of the main points of contention which exist in the wave format and points out some of the common issues which I've seen in implementations. It is a long post but I hope it will be useful to people writing software which uses wave files.

I will refer to the following three documents in this post:

  • IBM and Microsoft's "Multimedia Programming Interface and Data Specifications" version 1.0 dated August 1991.
  • Microsoft's "New Multimedia Data Types and Data Techniques" version 3.0 dated April 1994.
  • The MSDN article "Multiple Channel Audio Data and WAVE Files" dated 7th March 2007.

All of these documents (including the MSDN page) are available as PDF downloads here.

Wave chunks are two-byte aligned

All chunks (even the main RIFF chunk) must be two-byte aligned ("Multimedia Programming Interface and Data Specifications", pp. 11). A broken wave implementation which fails to do this will most of the time still input or output wave files correctly, unless:

  • Writing an 8 bit, mono wave file with an odd number of samples that has chunks following the data chunk
  • Potentially when reading/writing a compressed wave format or
  • Writing more complicated chunks like "associated data lists" which may contain strings with odd numbers of characters.

The format chunk is not a fixed sized record

The format chunk contains the structure given below followed by an optional set of "format-specific-fields" ("Multimedia Programming Interface and Data Specifications", pp. 56):

struct {
WORD wFormatTag;        // Format category
WORD wChannels;         // Number of channels
DWORD dwSamplesPerSec;  // Sampling rate
DWORD dwAvgBytesPerSec; // For buffer estimation
WORD wBlockAlign;       // Data block size

For WAVE_FORMAT_PCM types, the "format-specific-fields" is simply a WORD-type named wBitsPerSample which:

... specifies the number of bits of data used to represent each sample of each channel
("Multimedia Programming Interface and Data Specifications", pp. 58)

Be aware that this definition is vague and could mean either:

  • The number of valid resolution bits or
  • The bits used by the sample container

The WAVE_FORMAT_EXTENSIBLE format tag solves this confusion by defining wBitsPerSample to be the container size and providing an additional wValidBitsPerSample field ("Multiple Channel Audio Data and WAVE Files", pp. 3-4).

The original specification did not specify what "format-specific-fields" was for any other format. However in the 1994 update, the use of WAVEFORMATEX became mandated for any wFormatTag which is not WAVE_FORMAT_PCM ("New Multimedia Data Types and Data Techniques", pp. 19). This structure is defined as follows:

/* general extended waveform format structure */
/* Use this for all NON PCM formats */
/* (information common to all formats) */
typedef struct waveformat_extended_tag {
WORD wFormatTag;       /* format type */
WORD nChannels;        /* number of channels (i.e. mono, stereo...) */
DWORD nSamplesPerSec;  /* sample rate */
DWORD nAvgBytesPerSec; /* for buffer estimation */
WORD nBlockAlign;      /* block size of data */
WORD wBitsPerSample;   /* Number of bits per sample of mono data */
WORD cbSize;           /* The count in bytes of the extra size */

What this means is that all wave formats contain the members up to and including wBitsPerSample. Only wave files with formats which are not WAVE_FORMAT_PCM are required to have the cbSize member.

When WAVE_FORMAT_EXTENSIBLE should be used

WAVE_FORMAT_EXTENSIBLE should be used when:

  • The channel configuration of the wave file is not mono or left/right. This is because other channel configurations are ambiguous unless WAVE_FORMAT_EXTENSIBLE is used.
  • The valid data bits per sample is not a multiple of 8. This is because the meaning of wBitsPerSample in the wave format is ambiguous unless WAVE_FORMAT_EXTENSIBLE is used.

WAVE_FORMAT_EXTENSIBLE should not be used when:

  • Compatibility with ancient, incorrect and/or broken wave reading implementations is required.

FACT chunks are required for any wave format which is not WAVE_FORMAT_PCM

The following dot points are taken from "Multimedia Programming Interface and Data Specifications" (pp. 61) and "New Multimedia Data Types and Data Techniques" (pp. 12) in relation to the 'fact' chunk:

  • 1991) "The 'fact' chunk is required if the waveform data is contained in a 'wavl' LIST chunk and for all compressed audio formats. The chunk is not required for PCM files using the 'data' chunk format."
  • 1994) "The fact chunk is required for all new WAVE formats. The chunk is not required for the standard WAVE_FORMAT_PCM files."
  • 1991 and 1994) "The 'fact' chunk will be expanded to include any other information required by future WAVE formats. Added fields will appear following the dwSampleLength field. Applications can use the chunk size field to determine which fields are present."

From this information, it is always necessary to include a fact chunk when the wave format chunk has wFormatTag set to anything other than WAVE_FORMAT_PCM (this includes WAVE_FORMAT_IEEE_FLOAT). From recollection, Windows Media Player will not play floating point wave files which do not have a fact chunk - however I am not prepared to install Windows to validate if this is still true.

On the ordering of wave chunks

I saved the worst point for the end...

Almost every website which I have seen which attempts to define the wave format states: wave chunks can be in any order but the format chunk must precede the data chunk. One website in particular even stated (although it did not recommend) that the format chunk could come after the data chunk - this is plain wrong ("Multimedia Programming Interface and Data Specifications", pp. 56). Unfortunately in relation to all other chunks, the specification appears to be inconsistent. The original specifications states while defining the RIFF format:

... Following the form-type code is a series of subchunks. Which subchunks are present depends on the form type. The definition of a particular RIFF form typically includes the following:

  • A unique four-character code identifying the form type
  • A list of mandatory chunks
  • A list of optional chunks
  • Possibly, a required order for the chunks

Multimedia Programming Interface and Data Specifications, page 12

The presence of the fourth point hints that maybe the ordering of the chunks does not matter. However, on reading the "Extended Notation for Representing RIFF Form Definitions" (page 17 onwards), many examples are given which seem to suggest that mandatory ordering is implied through the grammar. See the sections on <name:type>, [elements], element... and [element]....

The form of a RIFF/WAVE file is defined on page 56 of "Multimedia Programming Interface and Data Specifications" and again on page 12 of "New Multimedia Data Types and Data Techniques" as:

<fmt-ck>            // Format
[<fact-ck>]         // Fact chunk
[<cue-ck>]          // Cue points
[<playlist-ck>]     // Playlist
[<assoc-data-list>] // Associated data list
<wave-data>         // Wave data

Given the notation examples, this grammar would suggest that the chunks listed should be supplied in that order. Chunks which are not listed in the grammar, but still defined for the form type (for example, smpl or inst chunks), presumably can be located anywhere.

That being said, if we look at the grammar for the associated data list chunk (pp. 63 of the wave specification):

<labl-ck> // Label
<note-ck> // Note
<ltxt-ck> // Text with data length
<file-ck> // Media file

It would appear that this list must contain exactly one label, one note, one labelled text and one media file chunk. This is clearly incorrect in the specifications given the purpose of the chunk. The definition given in "New Multimedia Data Types and Data Techniques" (pp. 14) is even worse - it is completely mangled on the page. There is a stray close-brace floating towards the end of the definition (which hints that there may have been an attempt to fix the issue) but it is still incorrect. All this tends to suggest that the grammar used to define RIFF/WAVE chunks is unreliable.

In writing an implementation, I would presume that the most correct solution would be to write the chunks in the order supplied by the grammar with any additional chunks being written anywhere. *sigh* If anyone can provide more information from a reliable source to clarify the chunk ordering issue, I would appreciate it.

That's enough. I'm going to make dinner.