# A Decimation in Frequency Real FFT

## Problem Background

Most resources I've found for computing the FFT of a real sequence use the two-for-one trick (such as this good one: FFT of Pure Real Sequences). The "trick" resembles a real FFT using a decimation in time formation. Here is a rough summary of how it works:

• The input sequence of length $N$ is split into two length $N/2$ sequences: one of the even indexed elements, one of odd indexed elements.
• Treat the even element sequence and odd element sequence as the real and imaginary components of a length $N/2$ sequence.
• Find the length $N/2$ DFT of this sequence.
• Using the symmetry properties of the DFT, extract the spectrum of the real sequence and imaginary sequence.
• Perform a decimation in time twiddle to get the results.

This is fine if the output needs to be in a particular order, but certain applications such as fast convolution could not care less about the ordering of the output bins (as has been done before by Dan Bernstein here - I'll elaborate more on this after the example) and this may enable a faster implementation.

## Something Interesting

If you've never written a recursive DIF/DIT FFT before, you should do it before reading on as some interesting things pop out of the implementation (which I will get to after the example - it's also a valuable learning exercise). Below is the code for a radix-2 Decimation in Frequency FFT:

 void fft_cplx(float *inout, float *scratch, unsigned len) { unsigned i; if (len == 1) return; for (i = 0; i < len / 2; i++) { float re0 = inout[2*i+0]; float im0 = inout[2*i+1]; float re1 = inout[2*i+len+0]; float im1 = inout[2*i+len+1]; float twr = cosf(i * -2.0f * M_PI / len); float twi = sinf(i * -2.0f * M_PI / len); float sr = re0 - re1; float si = im0 - im1; scratch[2*i+0] = re0 + re1; scratch[2*i+1] = im0 + im1; scratch[2*i+len+0] = sr * twr - si * twi; scratch[2*i+len+1] = sr * twi + si * twr; } fft_cplx(scratch, inout, len/2); fft_cplx(scratch+len, inout, len/2); for (i = 0; i < len / 2; i++) { inout[4*i+0] = scratch[2*i+0]; inout[4*i+1] = scratch[2*i+1]; inout[4*i+2] = scratch[2*i+len+0]; inout[4*i+3] = scratch[2*i+len+1]; } }

The interesting things to note about the above code are:

• The first loop can operate in-place (but doesn't).
• The re-ordering operation could just be a copy if we do not care about the order of the output bins.
• If we change the re-order operation into a copy, modify the first loop to operate in place, we could call the sub-FFTs in-place and would not need a scratch buffer at all. i.e. we could compute the FFT in-place but our outputs would be completely out-of-order.

It is also straight forward to write a function that "undoes" this FFT by performing the steps in reverse. It turns out that this inverse function ends up being exactly the implementation of a decimation in time structure. i.e. a decimation in frequency FFT is pretty much a decimation in time FFT done with the steps in reverse. If we remove all the post data-reordering in our DIF FFT and remove all the pre data-reordering in our DIT IFFT, we can perform a transform that gives bins in a strange order ("bit-reversed" only applies to radix-2 transforms and this algorithm can run mixed radix!) and transform these strange order bins to produce the original sequence. This is interesting for FFT based convolution because we can multiply the results of two forward transforms where the outputs are in a completely crazy order and run the inverse transform to get back the convolved sequence - and this can all be done in-place! This is usually a serious performance win on modern processors.

Another big take-away point here is that: if we can write a recursive transform where the code after the recursion only re-orders or conjugates output values, we can remove that entire step if the algorithm is to be used for applications where the ordering of the bins does not matter. This is why the usual real, two-for-one, DIT-style FFT algorithm is not particularly good: the twiddles occur after the recursion and each output bin depends on two bins (as we are relying on DFT symmetry properties to extract the spectrum of two real sequences where one is rammed into the real component and the other into the imaginary component - read the article I linked to at the start of this blog for more comprehensive information).

All of the above comments were related to the complex input implementation... so what about a real transform?

## Creating a Real Input FFT

We know that a DIF FFT algorithm always ends up with the reorder occurring after the recursion, so let's see if we can formulate a DIF-style FFT that will operate on real-data which we can solve recursively:

And:

Nothing special there, we've just arrived at the text-book DIF transform. What can we see here? First, the $X_{2 k}$ terms can be found by recursing into the real FFT that we are building as the terms $x_n + x_{n+\frac{N}{2}}$ are real. The question is what can we do about the odd output terms - we don't have a transform that does this... or do we? Let's define a new real transform as:

This is an FFT with a half-bin shift which gives a conjugate symmetric response for a real input $y_n$ - but unlike a normal real FFT which has conjugate symmetry about DC, this has conjugate symmetry about $\frac{-1}{2}$. i.e.

The above tells us that if we compute the $Y_{2n}$ terms (or $Y_{N-1-2n}$ terms, or the first half or the second half of the output terms - it doesn't actually matter), we actually have the entire spectrum for a real input. Here is an example for $N=4$ to illustrate:

Define a function for computing the Y_{2k} values:

We need to bring this back into a DFT form by making the summation over half the elements:

This sequence transforms N real elements into an N/2 complex component spectrum and can be used to find the $X_{2k+1}$ terms we needed for the real DFT. It can be seen from the above that this algorithm only performs a data combination step, a complex multiply and then a normal DFT. If we make this DFT a DIF style implementation, we satisfy the requirement that the only operations after the recursion are moves and conjugates. Here is a link to a boring implementation: real_fft_example.c.

Something I find particularly cool about the algorithm is that it naturally packs the DC and Nyquist bins into the real and imaginary components of the first output as part of the recursion (regardless of if we re-order the output or not - meaning that for convolution we still know which outputs are DC and Nyquist!). This is what most other libraries do but it ends up looking like a "hack" (at least in the two-for-one implementation).

## Why you probably shouldn't bother

... because there are loads of great libraries out there that do fast convolution of real sequences (FFTW) and this is probably not a particularly good design. The implementation ends up recursing into two different functions: itself and a complex FFT - which isn't the worst thing in the world, but it's also not really that good either if you are trying to build a general-purpose FFT library or want to execute the FFT pass-at-a-time. If you have a fast FFT pass, it's not going to be that useful for this algorithm which would need it's own optimised implementation.

This was just a bit of fun.

# Release Alignment in Sampled Pipe Organs - Part 1

At the most basic level, a sample from a digital pipe organ contains:

• an attack transient leading into
• a looped sustain block and
• a release which will be cross-faded into when the note is released.

The release cross-fade must be fast (otherwise it will not sound natural or transient details may be lost) and it must also be phase-aligned to the point where the cross-fade begins.

## The necessity for phase alignment

Without phase aligning the release, disturbing artefacts will likely be introduced. The effects are different with short and long cross-fades but are always unpleasant.

The following image shows an ideal cross-fade into a release sample. The crossfade begins at 0.1 seconds and lasts for 0.05 seconds. The release is aligned properly and the signal looks continuous.

The following image shows a bad release where the cross-fade is lagging an ideal release offset by half-a-period. Some cancellation occurs during the cross-fade and the result will either sound something like a "pluck" for long cross-fades or a "click" for short cross-fades.

(The cross-fade used in generating the above data sets was a raised cosine - linear cross-fades can be used but will result in worse distortions).

The problem of aligning release cross-fades in virtual pipe organs is an interesting one. As an example: at the time of writing this article, release alignment in the GrandOrgue project is not particularly good; it uses a lookup-table taking the value and first-order estimated derivative (both quantised heavily) of the last sample of the last played block as keys. This is not optimal as a single sample says nothing about phase and the first-order derivative estimate could be completely incorrect in the presence of noise.

## Another approach for handling release alignment

If the pitch a pipe was to be completely stable, known ($f=\frac{1}{T}$) and we knew one point where the release was perfectly aligned ($t_r$), we know that we could cross-fade into the start of the release at:

Hence, for any sample offset we could compute an offset into the release to cross-fade into.

In reality, pipe pitch wobbles around a bit and so the above would not strictly hold all the time - that being said, it is true for much of the time. If we could take a pipe sample and find all of the points where the release is aligned we could always find the best way to align the release.

It turns out that a simple way to do this is to find the cross-correlation of the attack and sustain segment with a short portion of the release. Taking the whole release would be problematic because as it decays it becomes less similar to the sustaining segment (which leads to an unhelpful correlation signal).

The above image shows the attack and some sustain of bottom-C of the St. Augustine's Closed Horn. This shows visually why single sample amplitude and derivative matching is a poor way to align releases. During one period of the closed horn, there are 14 zero crossings and 16 obvious zero crossings in the derivative. One sample gives hardly enough information.

The above image shows a 1024 sample segment taken from the release marker of the same Closed Horn sample. It contains just over a single period of the horn.

The next image shows the cross-correlation of this release segment with the sample itself. My analysis program does correlation of the left and right channels and sums them to provide an overall correlation. Positive maximums correspond to points where the release will phase-align well. Minimums correspond to points where the signal has the least correlation to the release.

Using the correlation and a pitch guesstimate, we could construct a function which given any sample offset in the attack/sustain could produce an offset into the release which we should cross-fade into. This is for next time.

# Understanding the Modified Discrete Cosine Transform (MDCT)

After playing around with discrete cosine transforms, I thought I would implement an MDCT and document my understanding of how everything works. I use some similar techniques to those used on the Wikipedia page as they are helpful for understanding but will add some scanned drawings which I think help (I'm not even close to being clever enough to get a computer to draw these for me).

## Prerequisites

The only real background knowledge which I think is relevant to understanding the MDCT is the data extensions which the DCT-4 transform assumes.

I'll refer to the above image in the Forward Transform overview, but for the mean time, only pay attention to the solid quarter wave. This is the first basis function (i.e. $k=0$ ) for an $N$ length DCT-4. If the basis is continued past $N$, it has a repeating symmetrical pattern (the dashed line in the image) which repeats every $4N$. The symmetry is even around $-0.5$ and odd around $N-0.5$ and holds for every basis function of the DCT-4. i.e. The DCT-4 assumes that the input data continues on forever, repeating itself in the following manner: $x_n$, $-x_{N-n-1}$, $-x_n$, $x_{N-n-1}$.

## Forward Transform

The MDCT takes in $2N$ real data points and produces $N$ real outputs. These inputs are designed to overlap, so the first half of the input data should be the second half of the input data of the previous call. The definition is:

It should be trivial to see from the above that the MDCT can be computed using a DCT-4 with an extended number of input data points, all of which have been shifted by half a basis. Go back to the crappy drawing and notice the concatenated $N/2$ length sequences $a$, $b$, $c$ and $d$. The total length of this sequence is $2N$ and begins at $N/2$ (or half the length of a basis function). We need to get $b$, $c$ and $d$ back into the $N$ point region if we want to compute the MDCT using a DCT-4, this can be achieved with the following concatenated sequence (I will subscript these sequences with $r$ to denote a reversal of the sequence):

If we take the DCT-4 of this concatenated sequence, we have found the MDCT of the input sequence.

## Inverse Transform

The inverse MDCT or IMDCT takes in $N$ real data points and produces $2N$ real outputs. In this transform, the outputs should overlap such that the first half of the output should be added to the second half of the output data in the previous call. The definition is:

Because we know how the DCT-4 assumes the input and output data repeats in a symmetric pattern, we can get this data trivially in exactly the same fashion as we did in the forward transform. In the following Illustration, we take the output from the forward transform and extend it along the basis:

In output row zero, we can see how to extend the input sequence to obtain the 2N points required. We then see in rows two and three how summing the overlapping blocks causes the aliased sequences to cancel in subsequent calls to the IMDCT.

## World's Dumbest C MDCT Implementation

I validated all this actually works with a small C program. Follows are the MDCT/IMDCT implementations I came up with... ignore the "twid" input, I cache the modulation factors for the FFT which gets called in the dct4 routine:

 /* state should contain double the number of elements as the input buffer (N) * and should have all elements initialized to zero prior to calling. The * output buffer is actually the first N elements of state after calling. */ void mdct(double *state, const double *input, double *twid, unsigned lenbits) { unsigned rl = 1u << lenbits; unsigned i; /* Alias the input data with the previous block. */ for (i = 0; i < rl / 2; i++) { state[i] = - input[rl/2+i] - input[rl/2-i-1]; state[rl/2+i] = state[rl+i] - state[rl+rl-i-1]; } /* Save the input block */ for (i = 0; i < rl; i++) state[rl+i] = input[i]; /* DCT-4 */ dct4(state, lenbits, twid); }   /* state should contain double the number of elements as the input buffer (N) * and should have all elements initialized to zero prior to calling. The * output buffer is actually the first N elements of state after calling. */ void imdct(double *state, const double *input, double *twid, unsigned lenbits) { unsigned rl = 1u << lenbits; unsigned i; /* Collect contributions from the previous frame to the output buffer */ for (i = 0; i < rl / 2; i++) { state[i] = - state[rl+rl/2-i-1]; state[rl/2+i] = - state[rl+i]; } /* Load the input and run the DCT-4 */ for (i = 0; i < rl; i++) state[rl+i] = input[i]; dct4(state + rl, lenbits, twid); /* Sum contributions from this frame to the output buffer and perform the * required scaling. */ for (i = 0; i < rl / 2; i++) { state[i] = (state[i] + state[rl+rl/2+i]) / rl; state[rl/2+i] = (state[rl/2+i] - state[rl+rl-i-1]) / rl; } }

## Windowed MDCT Implementation

Typical MDCT implementations will window the input and output data (this can also be thought of as windowing the basis functions - which I think is a more helpful way to understand what is happening). It is really important to note that the window function must be carefully chosen to ensure that the basis functions remain orthogonal! The window makes the basis functions always begin and end near zero. The process has the side effect of de-normalising the basis functions (unless the window is rectangular) and means there will be a window-dependent scaling factor which will need to be applied at the output to achieve perfect reconstruction. The following images show the second basis function of the MDCT both un-windowed and windowed with a half-sine window (given at the end of the post).

In a lossy codec this windowing process is somewhat necessary because if the start and end points are not close to zero, the output is likely to periodically glitch for even the slightest errors in the reconstructed MDCT data. This glitching will occur at the boundaries of the transform (i.e. every N points).

We can work out the necessary conditions for the window to obtain perfect reconstruction using the previous drawings (I'd steer away from equations for this one - it's easier to validate the results visually) by applying a window function split into 4 segments to each of the input blocks. I'll do the generic case for a symmetrical window which is applied to both the input and the output. We split the window (which has a length $2N$ ) into four segments which will be applied to our original input segments $a$, $b$, $c$ and $d$. Because we are defining this window to be symmetric, we can call the pieces:

The above illustration shows how our window segments are applied to the input data and the impact that has on the DCT-4 analysed data blob. Following that is the output segments from two sequential IMDCT calls with the windows applied to the output here as well.

We need to make the overlapping terms equal the required output segment i.e.

It is clear from the above that the necessary condition to achieve reconstruction is for $v_r^2 + u^2 = 1$ (which implies in this case that $v^2 + u_r^2 = 1$ must also be true).

A simple solution to this is:

The output requires a scaling factor of 2 for this window.

# Three Fast Methods to Compute the DCT-2

After doing my "proof thing" of the fast DFT based implementation of the DCT-4 (which does actually come in useful in my line of work), I thought I would have a fiddle with some of the other DCT variants. The DCT-2 is a transform which I doubt I will ever use, but I managed to find three ways (excluding the slow and obvious one) to build it. Here is the definition:

Like in the DCT-4 post, I hate trig functions so the first thing to do is transform $\cos$ into a complex exponential. I'll use this as the starting point for the methods.

## Method 1: Recursion and the DCT-4

All of the basis functions of the DCT-2 have a symmetry about $n=\frac{N}{2}$; the symmetry is even for even $k$ and odd for odd $k$. This is a big hint that we can benefit from splitting the input sequence into two half-length sequences where one sequence is reversed:

As on the DCT-4 page, we can conjugate any terms we like in the summation as we are only interested in the real part of the result. This allows us to rewrite the above as:

Split $X_k$ into even and odd terms:

The end. It can be seen that the $X_{2k}$ terms can be obtained by recursively calling the function again. The $X_{2k+1}$ terms can be obtained via a DCT-4. Writing out the steps explicitly:

• Create a sequence $y_n = x_n + x_{N-n-1}$ of length $\frac{N}{2}$
• Create a sequence $z_n = x_n - x_{N-n-1}$ of length $\frac{N}{2}$
• The resulting $X_{2k+1}$ terms are given by the DCT-4 of the sequence $z_n$
• The resulting $X_{2k}$ terms are given by performing this same process on the sequence $y_n$

## Method 2: A real DFT and a DCT-4

This method is probably not particularly useful given how the next method works - but I thought it was interesting so I'm listing it. Starting from the beginning again:

Rearrange the above into two halves by interleaving the input:

Again, we conjugate terms where we know it won't impact the result:

For the $X_{2k}$ case, we can write this as:

So for the even output indices, we can use a DFT on some rearranged data followed by a twiddle. For the $X_{2k+1}$ outputs, this particular form gets ugly - but recalling that in the first method, the $X_{2k+1}$ outputs were not recursive and simply based on a DCT-4, we just steal that result:

I won't bother explicitly listing the steps for this one. They are simple to figure out.

## Method 3: A single real DFT

This is almost certainly the method you should use to compute the DCT-2. Recall the third equation in the second method:

I didn't spot this immediately, but the summation term also happens to be the first stage of a decimation in frequency transform. Recall the DIF FFT stage:

So we create a new vector $y_n$:

Then substitute back into $X_k$:

This actually becomes an $N/2$ length DFT if you use a real transform.