A Decimation in Frequency Real FFT

Problem Background

Most resources I've found for computing the FFT of a real sequence use the two-for-one trick (such as this good one: FFT of Pure Real Sequences). The "trick" resembles a real FFT using a decimation in time formation. Here is a rough summary of how it works:

  • The input sequence of length  N is split into two length  N/2 sequences: one of the even indexed elements, one of odd indexed elements.
  • Treat the even element sequence and odd element sequence as the real and imaginary components of a length  N/2 sequence.
  • Find the length  N/2 DFT of this sequence.
  • Using the symmetry properties of the DFT, extract the spectrum of the real sequence and imaginary sequence.
  • Perform a decimation in time twiddle to get the results.

This is fine if the output needs to be in a particular order, but certain applications such as fast convolution could not care less about the ordering of the output bins (as has been done before by Dan Bernstein here - I'll elaborate more on this after the example) and this may enable a faster implementation.

Something Interesting

If you've never written a recursive DIF/DIT FFT before, you should do it before reading on as some interesting things pop out of the implementation (which I will get to after the example - it's also a valuable learning exercise). Below is the code for a radix-2 Decimation in Frequency FFT:

void fft_cplx(float *inout, float *scratch, unsigned len)
{
  unsigned i;
  if (len == 1)
    return;
  for (i = 0; i < len / 2; i++) {
    float re0 = inout[2*i+0];
    float im0 = inout[2*i+1];
    float re1 = inout[2*i+len+0];
    float im1 = inout[2*i+len+1];
    float twr = cosf(i * -2.0f * M_PI / len);
    float twi = sinf(i * -2.0f * M_PI / len);
    float sr = re0 - re1;
    float si = im0 - im1;
    scratch[2*i+0] = re0 + re1;
    scratch[2*i+1] = im0 + im1;
    scratch[2*i+len+0] = sr * twr - si * twi;
    scratch[2*i+len+1] = sr * twi + si * twr;
  }
  fft_cplx(scratch, inout, len/2);
  fft_cplx(scratch+len, inout, len/2);
  for (i = 0; i < len / 2; i++) {
    inout[4*i+0] = scratch[2*i+0];
    inout[4*i+1] = scratch[2*i+1];
    inout[4*i+2] = scratch[2*i+len+0];
    inout[4*i+3] = scratch[2*i+len+1];
  }
}

The interesting things to note about the above code are:

  • The first loop can operate in-place (but doesn't).
  • The re-ordering operation could just be a copy if we do not care about the order of the output bins.
  • If we change the re-order operation into a copy, modify the first loop to operate in place, we could call the sub-FFTs in-place and would not need a scratch buffer at all. i.e. we could compute the FFT in-place but our outputs would be completely out-of-order.

It is also straight forward to write a function that "undoes" this FFT by performing the steps in reverse. It turns out that this inverse function ends up being exactly the implementation of a decimation in time structure. i.e. a decimation in frequency FFT is pretty much a decimation in time FFT done with the steps in reverse. If we remove all the post data-reordering in our DIF FFT and remove all the pre data-reordering in our DIT IFFT, we can perform a transform that gives bins in a strange order ("bit-reversed" only applies to radix-2 transforms and this algorithm can run mixed radix!) and transform these strange order bins to produce the original sequence. This is interesting for FFT based convolution because we can multiply the results of two forward transforms where the outputs are in a completely crazy order and run the inverse transform to get back the convolved sequence - and this can all be done in-place! This is usually a serious performance win on modern processors.

Another big take-away point here is that: if we can write a recursive transform where the code after the recursion only re-orders or conjugates output values, we can remove that entire step if the algorithm is to be used for applications where the ordering of the bins does not matter. This is why the usual real, two-for-one, DIT-style FFT algorithm is not particularly good: the twiddles occur after the recursion and each output bin depends on two bins (as we are relying on DFT symmetry properties to extract the spectrum of two real sequences where one is rammed into the real component and the other into the imaginary component - read the article I linked to at the start of this blog for more comprehensive information).

All of the above comments were related to the complex input implementation... so what about a real transform?

Creating a Real Input FFT

We know that a DIF FFT algorithm always ends up with the reorder occurring after the recursion, so let's see if we can formulate a DIF-style FFT that will operate on real-data which we can solve recursively:

 \begin{align*} X_k &= \sum\limits_{n=0}^{N-1} x_n e^{-\frac{j 2 \pi n k }{N}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} x_n e^{-\frac{j 2 \pi n k }{N}} + x_{n+\frac{N}{2}} e^{-\frac{j 2 \pi (n+\frac{N}{2}) k }{N}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n + e^{-j \pi k} x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n k }{N}} \end{align*}

And:

 \begin{align*} X_{2 k} &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n + x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n k }{N/2}} \\ X_{2 k + 1} &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n - x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n (k + \frac{1}{2}) }{N/2}} \end{align*}

Nothing special there, we've just arrived at the text-book DIF transform. What can we see here? First, the  X_{2 k} terms can be found by recursing into the real FFT that we are building as the terms  x_n + x_{n+\frac{N}{2}} are real. The question is what can we do about the odd output terms - we don't have a transform that does this... or do we? Let's define a new real transform as:

 Y_k = \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{2}) }{N}}

This is an FFT with a half-bin shift which gives a conjugate symmetric response for a real input  y_n - but unlike a normal real FFT which has conjugate symmetry about DC, this has conjugate symmetry about  \frac{-1}{2} . i.e.

 \begin{align*} Y_{N-1-k} &= \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (N - k - \frac{1}{2})}{N}} \\ &= \sum\limits_{n=0}^{N-1} y_n e^{\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \\ &= \left( \sum\limits_{n=0}^{N-1} y_n^* e^{-\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \right)^* \\ &= \left( \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \right)^* = Y_k^* \end{align*}

The above tells us that if we compute the  Y_{2n} terms (or  Y_{N-1-2n} terms, or the first half or the second half of the output terms - it doesn't actually matter), we actually have the entire spectrum for a real input. Here is an example for  N=4 to illustrate:

 \begin{align*} Y_0 &= Y_3^* \\ Y_1 &= Y_2^* \\ Y_2 &= Y_1^* \\ Y_3 &= Y_0^* \end{align*}

Define a function for computing the Y_{2k} values:

 Y_{2k} = \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{4}) }{N/2}}

We need to bring this back into a DFT form by making the summation over half the elements:

 \begin{align*} Y_{2k} &= \sum\limits_{n=0}^{\frac{N}{2}-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{4})}{N/2}} + y_{n+\frac{N}{2}} e^{-\frac{j 2 \pi (n + \frac{N}{2}) (k + \frac{1}{4})}{N/2}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} (y_n - j y_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n (k + \frac{1}{4})}{N/2}} \end{align*}

This sequence transforms N real elements into an N/2 complex component spectrum and can be used to find the  X_{2k+1} terms we needed for the real DFT. It can be seen from the above that this algorithm only performs a data combination step, a complex multiply and then a normal DFT. If we make this DFT a DIF style implementation, we satisfy the requirement that the only operations after the recursion are moves and conjugates. Here is a link to a boring implementation: real_fft_example.c.

Something I find particularly cool about the algorithm is that it naturally packs the DC and Nyquist bins into the real and imaginary components of the first output as part of the recursion (regardless of if we re-order the output or not - meaning that for convolution we still know which outputs are DC and Nyquist!). This is what most other libraries do but it ends up looking like a "hack" (at least in the two-for-one implementation).

Why you probably shouldn't bother

... because there are loads of great libraries out there that do fast convolution of real sequences (FFTW) and this is probably not a particularly good design. The implementation ends up recursing into two different functions: itself and a complex FFT - which isn't the worst thing in the world, but it's also not really that good either if you are trying to build a general-purpose FFT library or want to execute the FFT pass-at-a-time. If you have a fast FFT pass, it's not going to be that useful for this algorithm which would need it's own optimised implementation.

This was just a bit of fun.