Generating multiple low-quality random numbers… fast.


A while ago I was doing some profiling to find out ways to improve the load performance of my Open Diapason project. The load process is roughly as follows:

  • Load wave file from disk into memory.
  • Apply a linear phase filter to all loaded samples (this is to compensate for the response of the resampler which occurs during playback).
  • Normalise and quantise the filtered audio using TPDF dither. The normalisation ensures that we use the full 16-bits of precision for each sample. Obviously, the scaling coefficient is stored along with the sample to ensure we play the content back at the intended level.

All of these steps are done across multiple threads to maximise performance (there is one thread that is reading audio files from disk ready to spawn threads which apply the filter and requantise the data)… this is one of those times where the speedup was roughly linear with the first couple of threads which I added.

Anyway, back to the profiling: I was expecting that the vast majority of the cycles would be spent in the application of the filter, but this actually turned out not to be the case. The dithering and re-quantisation of the audio – which I assumed would be cheap – turned out to be chewing cycles on the order of 30% of the load time (all of my profiling was done on a mid 2014 MacBook Pro with an i7 processor).

There were multiple things which I did to speed it up, but I think what I pushed in this change list was the most interesting one – so what follows in this blog is a bit of a write up.

The idea

Uniform random numbers are useful and often we need several of them. In my use-case (performing TPDF dither on a sample pair) I needed four random variables. Generating them needs to be fast but the “quality” of the randomness is not that important. The obvious way to do this is to use a Linear Congruential Generator which ends up forming a code pattern like this:

r0 = (state * A + C) % M;
r1 = (r0 * A + C) % M;
r2 = (r1 * A + C) % M;
r3 = (r2 * A + C) % M;
state = r3;

This forms a dependency chain where each random number requires knowledge of the previous one and could end up leading to pipeline issues (this is really dependent on how the compiler re-orders the code and the architecture of the platform). In the code generated on my machine, this turned out to be a bottleneck. We could get around the pipeline issues by creating four different LCGs:

r0 = (state0 * A0 + C0) % M;
r1 = (state1 * A1 + C1) % M;
r2 = (state2 * A2 + C2) % M;
r3 = (state3 * A3 + C3) % M;
state0 = r0;
state1 = r1;
state2 = r2;
state3 = r3;

But this will require more registers to store the different LCG states and possibly more registers or code memory to hold all the new coefficients. It would be nice to have something a bit simpler…

LCGs are really cool toys. We can choose our A_n and C_n values in such a way that the pseudo-random sequence has period M. Given the algorithm generates the next value from the previous, this must mean that the random sequence contains every integer between 0 and M-1. The first part of the idea, is to modify the code as:

r0 = (state0 * A0) % M;
r1 = (state1 * A1) % M;
r2 = (state2 * A2) % M;
r3 = (state3 * A3) % M;
state0 = (r0 + C0 % M) % M;
state1 = (r1 + C1 % M) % M;
state2 = (r2 + C2 % M) % M;
state3 = (r3 + C3 % M) % M;

This produces a different sequence in r_0 to r_3 because the accumulation now happens at the state assignment, but still has a period M. Now for the dodgy part, we change the algorithm to this:

r0 = (state0 * A0) % M;
r1 = (state0 * A1) % M;
r2 = (state0 * A2) % M;
r3 = (state0 * A3) % M;
state0 = (r0 + C0 % M) % M;

Where C_0 = 1, M = 2^{32} and A_0 to A_3 are all different but satisfy the required conditions for an LCG to work. It should not be difficult to infer that r_0 to r_3 still all have period M – but produce different output sequences. I wrote a test program to look at the correlation of these sequences over their period which can be found on GitHub here – they are mostly independent over their period. I doubt they make great random numbers, but for performing dither, they are completely fine. The generated code seems pretty good also:

sampletest[0x10000a99f] <+3503>: imull  $0xd688014d, %r8d, %edi   ; imm = 0xD688014D 
sampletest[0x10000a9a6] <+3510>: imull  $0xdb71f7bd, %r8d, %r15d  ; imm = 0xDB71F7BD 
sampletest[0x10000a9ad] <+3517>: imull  $0xe05f354d, %r8d, %esi   ; imm = 0xE05F354D 
sampletest[0x10000a9b4] <+3524>: imull  $0xe54c7f35, %r8d, %ecx   ; imm = 0xE54C7F35 
sampletest[0x10000a9bb] <+3531>: movss  (%r11,%rdx,2), %xmm2      ; xmm2 = mem[0],zero,zero,zero 
sampletest[0x10000a9c1] <+3537>: mulss  %xmm1, %xmm2
sampletest[0x10000a9c5] <+3541>: movss  (%r10,%rdx,2), %xmm3      ; xmm3 = mem[0],zero,zero,zero 
sampletest[0x10000a9cb] <+3547>: mulss  %xmm1, %xmm3
sampletest[0x10000a9cf] <+3551>: leal   0x1(%rdi), %r8d
sampletest[0x10000a9d3] <+3555>: cvttss2si %xmm2, %r12
sampletest[0x10000a9d8] <+3560>: cvttss2si %xmm3, %r13
sampletest[0x10000a9dd] <+3565>: addq   %rdi, %rsi
sampletest[0x10000a9e0] <+3568>: addq   %r12, %rsi
sampletest[0x10000a9e3] <+3571>: addq   %r15, %rcx
sampletest[0x10000a9e6] <+3574>: addq   %r13, %rcx
sampletest[0x10000a9e9] <+3577>: shrq   $0x21, %rsi
sampletest[0x10000a9ed] <+3581>: shrq   $0x21, %rcx
sampletest[0x10000a9f1] <+3585>: movl   %edx, %edi
sampletest[0x10000a9f3] <+3587>: movw   %si, (%rax,%rdi,2)
sampletest[0x10000a9f7] <+3591>: leal   0x1(%rdx), %esi
sampletest[0x10000a9fa] <+3594>: movw   %cx, (%rax,%rsi,2)
sampletest[0x10000a9fe] <+3598>: addq   $0x2, %rdx
sampletest[0x10000aa02] <+3602>: decl   %ebx
sampletest[0x10000aa04] <+3604>: jne    0x10000a99f

See the code in the original commit for a better understanding of what’s going on. There is a hefty comment explaining the fixed-point dithering process.

Given a vector: find N orthogonal vectors.

When writing eigen solvers, it’s useful to be able to generate an orthonormal basis containing a particular vector. Another way of saying this is: given some fixed vector, find other vectors that are perpendicular to it and each other. I am under the impression from some websites that this is also useful in computer graphics (but I know nothing about that – let me know if you do!).

I’ve seen people ask how to do this on the internet (for example: here and here) and in general the recommended process is:

  • If only one perpendicular vector is required, use a heuristic to come up with something perpendicular (see previous links).
  • If two perpendicular vectors are required, create one perpendicular vector using the above heuristic then use the cross-product to come up with the final vector.
  • For more than two perpendicular vectors, generate random vectors and use Gram Schmidt to (hopefully) orthogonalise them.
  • For 3D graphics, see the “Update” heading below…

The first two usually require branching in the heuristic. The second may not work (if you’re really really unlucky) and requires multiple normalisation operations. Follows is an approach which I haven’t seen documented elsewhere which requires one square root for an arbitrary dimension vector set.

What we want is some technique which creates an orthogonal matrix from a vector. This is exactly what the Householder transformation matrix does:

    \[ \mathbf{P} = \mathbf{I} - 2 \mathbf{v} \mathbf{v}^* \]

Where \mathbf{v} is some unit vector. The matrix \mathbf{P} is hermitian and unitary and is hence orthonormal. If we can find a simple expression for \mathbf{v} which will force the first column of \mathbf{P} to be equal to some unit vector \mathbf{q}, we have all the information we need to create a basis containing \mathbf{q}. This turns out to be trivial:

    \[ \mathbf{q} = \left( \mathbf{I} - 2 \mathbf{v} \mathbf{v}^* \right) \mathbf{e}_1 \]

    \[ \mathbf{q} = \mathbf{e}_1 - 2 \mathbf{v} \mathbf{v}^* \mathbf{e}_1 \]

    \[ \frac{\mathbf{e}_1 - \mathbf{q}}{2} = \mathbf{v} v^*_1 \]

Explicitly, this looks like:

    \[ \begin{pmatrix} \frac{1 - q_1}{2} \\ -\frac{q_2}{2} \\ -\frac{q_3}{2} \\ \vdots \\ -\frac{q_N}{2} \end{pmatrix} = v^*_1 \begin{pmatrix} v_1 \\ v_2 \\ v_3 \\ \vdots \\ v_N \end{pmatrix} \]

If \mathbf{q} is real (i.e. v^*_1 = v_1 ), we set:

    \[ v_n = \left\{   \begin{array}{@{}ll@{}}     \sqrt{\frac{1 - q_1}{2}}, & \text{if}\ n = 1 \\     -\frac{q_n}{2 v_1}, & \text{if}\ n > 1   \end{array}\right. \]

Here is a very inefficient C implementation (the square root is not actually necessary – this can be seen in the fact that two v_n entries are always multiplied together and the term in the square root is constant – also the most inner loop over j should start from 1 because when j==0, the value is simply q_{i+1} ):

/* The first n values in m is a normalised vector which is desired to be
 * part of the basis set. Following this vector, n-1 new vectors of length n
 * will be added which are orthogonal to it. */
void house_ortho(double *m, unsigned n)
    const double x = sqrt(1.0 - m[0]);
    /* As m[0] approaches 1, x approaches 0. This detects that and will end
     * up forcing the result to be the identity matrix. */
    const double s = (x < 1e-20) ? 0.0 : (-1.0 / x);
    unsigned i, j;
    /* We first convert the m vector into the v vector. Note that we have
     * incorporated the multiply by 2 present in the Householder operation
     * into v by multiplying every element by sqrt(2). */
    for (m[0] = x, i = 1; i < n; i++)
        m[i] *= s;
    /* Create all of the orthogonal vectors using the definition of the
     * Householder transformation matrix: I - 2 v v*. */
    for (i = 1; i < n; i++) {
        for (j = 0; j < n; j++) {
            m[i*n+j] = ((i == j) ? 1.0 : 0.0) - m[i] * m[j];
    /* Convert the v vector back into m. Probably could have written this
     * code in a way where this didn't need to happen. */
    for (m[0] = 1.0 - x * x, j = 1; j < n; j++)
        m[j] = -m[j] * x;

The termination condition for the for loop (over i) in the middle of the code sample could be modified to generate a smaller number of vectors.

This process could be modified to work with a complex valued \mathbf{q} but is slightly more complicated because \mathbf{P} is hermitian (which implies that all values on the diagonal are real). The modification amounts to finding a unit vector which forces q_1 to be real, multiplying \mathbf{q} by this unit vector then dividing all resultant vectors by the factor. I haven’t tested this, but it should work.


I posted a link to this blog on Twitter and reached out to see if anyone else had used this method. Warren Moore helpfully responded and pointed me at this paper “Building an Orthonormal Basis, Revisited” (Pixar) which describes numerical and implementation improvements to an algorithm proposed in this paper “Building an Orthonormal Basis from a 3D Unit Vector Without Normalization” (Frisvad, J. R.).

The algorithm which I’ve described in terms of finding the vector v which creates the Householder matrix P with the first column equal to some unit vector q, ends up producing exactly the same equations (given a vector of dimension 3) given in Frisvad’s paper up to a few sign changes – albeit I’ve gone about defining the problem in a slightly different way. We can demonstrate the equivalence by expanding out the resulting householder matrix given the v vector we derived earlier:

    \[ v_n = \left\{   \begin{array}{@{}ll@{}}     \sqrt{\frac{1 - q_1}{2}}, & \text{if}\ n = 1 \\     -\frac{q_n}{2 v_1}, & \text{if}\ n > 1   \end{array}\right. \]

    \[ \begin{pmatrix}  q_1 & q_2 & q_3 \\  q_2 & 1 - 2 v_2 v_2 & -2 v_2 v_3 \\  q_3 & -2 v_2 v_3 & 1 - 2 v_3 v_3 \\ \end{pmatrix} \]

Note: we don’t actually need v_1 because the householder matrix is symmetric. The matrix can be expressed purely in terms of q as:

    \[ \begin{pmatrix}  q_1 & q_2 & q_3 \\  q_2 & 1 - \frac{q^2_2}{1-q_1} & -\frac{q_2 q_3}{1-q_1} \\  q_3 & -\frac{q_3 q_2}{1-q_1} & 1 - \frac{q^2_3}{1-q_1} \\ \end{pmatrix} \]

The Pixar paper addresses the singularity which occurs as q_1 tends towards 1. We can do the same thing here by negating the definition of the Householder matrix which still produces a matrix with the same properties. i.e.

    \[ \mathbf{P} = 2 \mathbf{v} \mathbf{v}^* - \mathbf{I} \]

In this case, the vector v is now defined as:

    \[ v_n = \left\{   \begin{array}{@{}ll@{}}     \sqrt{\frac{1 + q_1}{2}}, & \text{if}\ n = 1 \\     \frac{q_n}{2 v_1}, & \text{if}\ n > 1   \end{array}\right. \]

Which moves the singularity to occur as q_1 tends towards -1. The matrix still looks very similar:

    \[ \begin{pmatrix}  q_1 & q_2 & q_3 \\  q_2 & \frac{q^2_2}{1+q_1} - 1 & \frac{q_2 q_3}{1+q_1} \\  q_3 & \frac{q_3 q_2}{1+q_1} & \frac{q^2_3}{1+q_1} - 1 \\ \end{pmatrix} \]

Efficient implementation of 3-dimension basis finder

As shown in the Pixar paper, there is no need to test the sign of q_1 and pick a code path to follow (which would create a branch prediction performance hit) – we can simply extract the sign and use it as a multiplier to flip signs where they need to be flipped (if they need to be flipped).

Modifying some signs in the explicit matrices given previously, we can re-write the matrices as:

    \[ \begin{pmatrix}  q_1 & q_2 & q_3 \\  q_2 & \frac{q^2_2}{q_1-1} + 1 & \frac{q_2 q_3}{q_1-1} \\  q_3 & \frac{q_3 q_2}{q_1-1} & \frac{q^2_3}{q_1-1} + 1 \\ \end{pmatrix} \]

    \[ \begin{pmatrix}  q_1 & q_2 & q_3 \\  q_2 & \frac{q^2_2}{q_1+1} - 1 & \frac{q_2 q_3}{q_1+1} \\  q_3 & \frac{q_3 q_2}{q_1+1} & \frac{q^2_3}{q_1+1} - 1 \\ \end{pmatrix} \]

Or more simply:

    \[ \begin{pmatrix}  q_1 & q_2 & q_3 \\  q_2 & \frac{q^2_2}{q_1+s} - s & \frac{q_2 q_3}{q_1+s} \\  q_3 & \frac{q_3 q_2}{q_1+s} & \frac{q^2_3}{q_1+s} - s \\ \end{pmatrix} \]


    \[ s = \left\{   \begin{array}{@{}ll@{}}     1, & \text{if}\ q_1 > 0 \\     -1, & \text{if}\ q_1 < 0 \\     \pm 1, & \text{if}\ q_1 = 0   \end{array}\right. \]

And here’s the C implementation:

/* The first three values in v are a three point unit vector. The
 * function produces two more three point vectors of the same coordinate
 * space which are orthogonal. */
void ortho3(float *v)
    float q1        = v[0];
    float q2        = v[1];
    float q3        = v[2];
    float sign      = copysignf(1.0f, q1);
    float scale     = 1.0f / (q1 + sign);
    float q2scale   = q2 * scale;
    float q3scale   = q3 * scale;
    float q3q2scale = q2scale * q3;
    v[3]            = q2;
    v[4]            = q2 * q2scale - sign;
    v[5]            = q3q2scale;
    v[6]            = q3;
    v[7]            = q3q2scale;
    v[8]            = q3 * q3scale - sign;

Three step digital Butterworth design.

This guide omits a fair amount of detail in order to provide a fairly quick guide.

1) Design in the s-domain

Butterworth filters have their poles evenly spaced on a circlular arc in the left hand plane of the [s/Laplace/analogue]-domain.

For a low-pass filter of order N, the complex valued pole locations are:

    \[ p_n = \omega_0 e^\frac{j \pi (2 n + N + 1)}{2 N} \]

Where 0 <= n < N and \omega_0 is the cutoff frequency in radians.

The transfer function in the s domain is:

    \[ H(s) = \frac{w_0}{\prod\limits_{n=0}^{N-1}(s - p_n)} \]

Because the poles always occur in conjugate pairs for this kind of filter, it can always be expanded out into a polynomial with real-valued coefficients a_n:

    \[ H(s) = \frac{w_0}{\sum\limits_{n=0}^{N-1} a_n s^n} \]

2) Convert the s-domain transfer function to the z-domain.

There are a few methods to convert an s-domain transfer function this into a [z/digital/discrete]-domain transfer function. The most common is the bilinear transform which makes the substitution:

    \[ s = \alpha \frac{z - 1}{z + 1} \]


    \[ \alpha = \frac{2}{T} \]

Or if you’re performing “frequency warping”:

    \[ \alpha = \frac{\omega_0}{\tan(\frac{\omega_0 T}{2})} \]

I won’t go into verbose details here, but you probably want frequency warping. The bilinear transform is a linear approximation to the function s=\frac{1}{T}\ln(z) which maps the s-domain to the digital z-domain. This linear mapping is analogous to the first term of a series expansion of s=\frac{1}{T}\ln(z) which is most accurate around z=1 (i.e. DC). The frequency warping operation moves the point where most accuracy is preserved to \omega_0. For higher frequency cutoffs, this becomes important. \omega_0 can be set to the cutoff frequency of the filter to improve the frequency response near the transition.

For a 3-rd order system, the conversion yields (see notes at the end of this page about risks):

    \[ \frac{b_3 s^3 + b_2 s^2 + b_1 s^1 + b_0}{a_3 s^3 + a_2 s^2 + a_1 s^1 + a_0} \leftrightarrow \frac{b_3 \alpha^3 (z + 1)^3 + b_2 \alpha^2 (z + 1)^2 (z - 1) + b_1 \alpha (z + 1) (z - 1)^2 + b_0 (z - 1)^3}{a_3 \alpha^3 (z + 1)^3 + a_2 \alpha^2 (z + 1)^2 (z - 1) + a_1 \alpha (z + 1) (z - 1)^2 + a_0 (z - 1)^3} \]

Which can be expanded into a rational polynomial transfer function in z.

3) Convert from the z-domain to a difference equation.

First, you must make your z-domain expression causal. Positive powers of z must be removed otherwise your filter would need to know the future! This is simple, just modify the numerator and denominator of the transfer function by the inverse of the highest power of z. For example:

    \[ \frac{Y(z)}{X(z)} = H(z) = \frac{z + 4}{2z^2 - 3z + 2} \]

Multiply the numerator and denominator by z^{-2}.

    \[ \frac{Y(z)}{X(z)} = \frac{z^{-1} + 4z^{-2}}{2 - 3z^{-1} + 2z^{-2}} \]

Get Y(z) as a function of X(z).

    \[ Y(z)(2 - 3z^{-1} + 2z^{-2}) = X(z)(z^{-1} + 4z^{-2}) \]

The difference equation can be found by substituting z-powers as offsets to the data they are using. i.e. for the above case:

    \[ 2y[n] = x[n-1] + 4x[n-2] + 3y[n-1] - 2y[n-2] \]

That is what you want to implement in your code.

Things to keep in mind

High-order systems can start behaving very poorly when designed in this way. Even on floating point, once a system reaches 4-th order or higher, if there are poles close to the real axis in the z-domain, the output can become severely degraded. There are ways to mitigate this: the simplest of which is to break the implementation into separate 2nd order filters (biquads).

A Decimation in Frequency Real FFT

Problem Background

Most resources I’ve found for computing the FFT of a real sequence use the two-for-one trick (such as this good one: FFT of Pure Real Sequences). The “trick” resembles a real FFT using a decimation in time formation. Here is a rough summary of how it works:

  • The input sequence of length N is split into two length N/2 sequences: one of the even indexed elements, one of odd indexed elements.
  • Treat the even element sequence and odd element sequence as the real and imaginary components of a length N/2 sequence.
  • Find the length N/2 DFT of this sequence.
  • Using the symmetry properties of the DFT, extract the spectrum of the real sequence and imaginary sequence.
  • Perform a decimation in time twiddle to get the results.

This is fine if the output needs to be in a particular order, but certain applications such as fast convolution could not care less about the ordering of the output bins (as has been done before by Dan Bernstein here – I’ll elaborate more on this after the example) and this may enable a faster implementation.

Something Interesting

If you’ve never written a recursive DIF/DIT FFT before, you should do it before reading on as some interesting things pop out of the implementation (which I will get to after the example – it’s also a valuable learning exercise). Below is the code for a radix-2 Decimation in Frequency FFT:

void fft_cplx(float *inout, float *scratch, unsigned len)
  unsigned i;
  if (len == 1)
  for (i = 0; i &lt; len / 2; i++) {
    float re0 = inout[2*i+0];
    float im0 = inout[2*i+1];
    float re1 = inout[2*i+len+0];
    float im1 = inout[2*i+len+1];
    float twr = cosf(i * -2.0f * M_PI / len);
    float twi = sinf(i * -2.0f * M_PI / len);
    float sr = re0 - re1;
    float si = im0 - im1;
    scratch[2*i+0] = re0 + re1;
    scratch[2*i+1] = im0 + im1;
    scratch[2*i+len+0] = sr * twr - si * twi;
    scratch[2*i+len+1] = sr * twi + si * twr;
  fft_cplx(scratch, inout, len/2);
  fft_cplx(scratch+len, inout, len/2);
  for (i = 0; i &lt; len / 2; i++) {
    inout[4*i+0] = scratch[2*i+0];
    inout[4*i+1] = scratch[2*i+1];
    inout[4*i+2] = scratch[2*i+len+0];
    inout[4*i+3] = scratch[2*i+len+1];

The interesting things to note about the above code are:

  • The first loop can operate in-place (but doesn’t).
  • The re-ordering operation could just be a copy if we do not care about the order of the output bins.
  • If we change the re-order operation into a copy, modify the first loop to operate in place, we could call the sub-FFTs in-place and would not need a scratch buffer at all. i.e. we could compute the FFT in-place but our outputs would be completely out-of-order.

It is also straight forward to write a function that “undoes” this FFT by performing the steps in reverse. It turns out that this inverse function ends up being exactly the implementation of a decimation in time structure. i.e. a decimation in frequency FFT is pretty much a decimation in time FFT done with the steps in reverse. If we remove all the post data-reordering in our DIF FFT and remove all the pre data-reordering in our DIT IFFT, we can perform a transform that gives bins in a strange order (“bit-reversed” only applies to radix-2 transforms and this algorithm can run mixed radix!) and transform these strange order bins to produce the original sequence. This is interesting for FFT based convolution because we can multiply the results of two forward transforms where the outputs are in a completely crazy order and run the inverse transform to get back the convolved sequence – and this can all be done in-place! This is usually a serious performance win on modern processors.

Another big take-away point here is that: if we can write a recursive transform where the code after the recursion only re-orders or conjugates output values, we can remove that entire step if the algorithm is to be used for applications where the ordering of the bins does not matter. This is why the usual real, two-for-one, DIT-style FFT algorithm is not particularly good: the twiddles occur after the recursion and each output bin depends on two bins (as we are relying on DFT symmetry properties to extract the spectrum of two real sequences where one is rammed into the real component and the other into the imaginary component – read the article I linked to at the start of this blog for more comprehensive information).

All of the above comments were related to the complex input implementation… so what about a real transform?

Creating a Real Input FFT

We know that a DIF FFT algorithm always ends up with the reorder occurring after the recursion, so let’s see if we can formulate a DIF-style FFT that will operate on real-data which we can solve recursively:

     \[ \begin{align*} X_k &= \sum\limits_{n=0}^{N-1} x_n e^{-\frac{j 2 \pi n k }{N}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} x_n e^{-\frac{j 2 \pi n k }{N}} + x_{n+\frac{N}{2}} e^{-\frac{j 2 \pi (n+\frac{N}{2}) k }{N}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n + e^{-j \pi k} x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n k }{N}} \end{align*} \]


     \[ \begin{align*} X_{2 k} &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n + x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n k }{N/2}} \\ X_{2 k + 1} &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n - x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n (k + \frac{1}{2}) }{N/2}} \end{align*} \]

Nothing special there, we’ve just arrived at the text-book DIF transform. What can we see here? First, the X_{2 k} terms can be found by recursing into the real FFT that we are building as the terms x_n + x_{n+\frac{N}{2}} are real. The question is what can we do about the odd output terms – we don’t have a transform that does this… or do we? Let’s define a new real transform as:

    \[ Y_k = \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{2}) }{N}} \]

This is an FFT with a half-bin shift which gives a conjugate symmetric response for a real input y_n – but unlike a normal real FFT which has conjugate symmetry about DC, this has conjugate symmetry about \frac{-1}{2}. i.e.

     \[ \begin{align*} Y_{N-1-k} &= \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (N - k - \frac{1}{2})}{N}} \\ &= \sum\limits_{n=0}^{N-1} y_n e^{\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \\ &= \left( \sum\limits_{n=0}^{N-1} y_n^* e^{-\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \right)^* \\ &= \left( \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \right)^* = Y_k^* \end{align*} \]

The above tells us that if we compute the Y_{2n} terms (or Y_{N-1-2n} terms, or the first half or the second half of the output terms – it doesn’t actually matter), we actually have the entire spectrum for a real input. Here is an example for N=4 to illustrate:

     \[ \begin{align*} Y_0 &= Y_3^* \\ Y_1 &= Y_2^* \\ Y_2 &= Y_1^* \\ Y_3 &= Y_0^* \end{align*} \]

Define a function for computing the Y_{2k} values:

    \[ Y_{2k} = \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{4}) }{N/2}} \]

We need to bring this back into a DFT form by making the summation over half the elements:

     \[ \begin{align*} Y_{2k} &= \sum\limits_{n=0}^{\frac{N}{2}-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{4})}{N/2}} + y_{n+\frac{N}{2}} e^{-\frac{j 2 \pi (n + \frac{N}{2}) (k + \frac{1}{4})}{N/2}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} (y_n - j y_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n (k + \frac{1}{4})}{N/2}} \end{align*} \]

This sequence transforms N real elements into an N/2 complex component spectrum and can be used to find the X_{2k+1} terms we needed for the real DFT. It can be seen from the above that this algorithm only performs a data combination step, a complex multiply and then a normal DFT. If we make this DFT a DIF style implementation, we satisfy the requirement that the only operations after the recursion are moves and conjugates. Here is a link to a boring implementation: real_fft_example.c.

Something I find particularly cool about the algorithm is that it naturally packs the DC and Nyquist bins into the real and imaginary components of the first output as part of the recursion (regardless of if we re-order the output or not – meaning that for convolution we still know which outputs are DC and Nyquist!). This is what most other libraries do but it ends up looking like a “hack” (at least in the two-for-one implementation).

Why you probably shouldn’t bother

… because there are loads of great libraries out there that do fast convolution of real sequences (FFTW) and this is probably not a particularly good design. The implementation ends up recursing into two different functions: itself and a complex FFT – which isn’t the worst thing in the world, but it’s also not really that good either if you are trying to build a general-purpose FFT library or want to execute the FFT pass-at-a-time. If you have a fast FFT pass, it’s not going to be that useful for this algorithm which would need it’s own optimised implementation.

This was just a bit of fun.