A Decimation in Frequency Real FFT

Problem Background

Most resources I've found for computing the FFT of a real sequence use the two-for-one trick (such as this good one: FFT of Pure Real Sequences). The "trick" resembles a real FFT using a decimation in time formation. Here is a rough summary of how it works:

  • The input sequence of length  N is split into two length  N/2 sequences: one of the even indexed elements, one of odd indexed elements.
  • Treat the even element sequence and odd element sequence as the real and imaginary components of a length  N/2 sequence.
  • Find the length  N/2 DFT of this sequence.
  • Using the symmetry properties of the DFT, extract the spectrum of the real sequence and imaginary sequence.
  • Perform a decimation in time twiddle to get the results.

This is fine if the output needs to be in a particular order, but certain applications such as fast convolution could not care less about the ordering of the output bins (as has been done before by Dan Bernstein here - I'll elaborate more on this after the example) and this may enable a faster implementation.

Something Interesting

If you've never written a recursive DIF/DIT FFT before, you should do it before reading on as some interesting things pop out of the implementation (which I will get to after the example - it's also a valuable learning exercise). Below is the code for a radix-2 Decimation in Frequency FFT:

void fft_cplx(float *inout, float *scratch, unsigned len)
{
  unsigned i;
  if (len == 1)
    return;
  for (i = 0; i < len / 2; i++) {
    float re0 = inout[2*i+0];
    float im0 = inout[2*i+1];
    float re1 = inout[2*i+len+0];
    float im1 = inout[2*i+len+1];
    float twr = cosf(i * -2.0f * M_PI / len);
    float twi = sinf(i * -2.0f * M_PI / len);
    float sr = re0 - re1;
    float si = im0 - im1;
    scratch[2*i+0] = re0 + re1;
    scratch[2*i+1] = im0 + im1;
    scratch[2*i+len+0] = sr * twr - si * twi;
    scratch[2*i+len+1] = sr * twi + si * twr;
  }
  fft_cplx(scratch, inout, len/2);
  fft_cplx(scratch+len, inout, len/2);
  for (i = 0; i < len / 2; i++) {
    inout[4*i+0] = scratch[2*i+0];
    inout[4*i+1] = scratch[2*i+1];
    inout[4*i+2] = scratch[2*i+len+0];
    inout[4*i+3] = scratch[2*i+len+1];
  }
}

The interesting things to note about the above code are:

  • The first loop can operate in-place (but doesn't).
  • The re-ordering operation could just be a copy if we do not care about the order of the output bins.
  • If we change the re-order operation into a copy, modify the first loop to operate in place, we could call the sub-FFTs in-place and would not need a scratch buffer at all. i.e. we could compute the FFT in-place but our outputs would be completely out-of-order.

It is also straight forward to write a function that "undoes" this FFT by performing the steps in reverse. It turns out that this inverse function ends up being exactly the implementation of a decimation in time structure. i.e. a decimation in frequency FFT is pretty much a decimation in time FFT done with the steps in reverse. If we remove all the post data-reordering in our DIF FFT and remove all the pre data-reordering in our DIT IFFT, we can perform a transform that gives bins in a strange order ("bit-reversed" only applies to radix-2 transforms and this algorithm can run mixed radix!) and transform these strange order bins to produce the original sequence. This is interesting for FFT based convolution because we can multiply the results of two forward transforms where the outputs are in a completely crazy order and run the inverse transform to get back the convolved sequence - and this can all be done in-place! This is usually a serious performance win on modern processors.

Another big take-away point here is that: if we can write a recursive transform where the code after the recursion only re-orders or conjugates output values, we can remove that entire step if the algorithm is to be used for applications where the ordering of the bins does not matter. This is why the usual real, two-for-one, DIT-style FFT algorithm is not particularly good: the twiddles occur after the recursion and each output bin depends on two bins (as we are relying on DFT symmetry properties to extract the spectrum of two real sequences where one is rammed into the real component and the other into the imaginary component - read the article I linked to at the start of this blog for more comprehensive information).

All of the above comments were related to the complex input implementation... so what about a real transform?

Creating a Real Input FFT

We know that a DIF FFT algorithm always ends up with the reorder occurring after the recursion, so let's see if we can formulate a DIF-style FFT that will operate on real-data which we can solve recursively:

 \begin{align*} X_k &= \sum\limits_{n=0}^{N-1} x_n e^{-\frac{j 2 \pi n k }{N}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} x_n e^{-\frac{j 2 \pi n k }{N}} + x_{n+\frac{N}{2}} e^{-\frac{j 2 \pi (n+\frac{N}{2}) k }{N}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n + e^{-j \pi k} x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n k }{N}} \end{align*}

And:

 \begin{align*} X_{2 k} &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n + x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n k }{N/2}} \\ X_{2 k + 1} &= \sum\limits_{n=0}^{\frac{N}{2}-1} (x_n - x_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n (k + \frac{1}{2}) }{N/2}} \end{align*}

Nothing special there, we've just arrived at the text-book DIF transform. What can we see here? First, the  X_{2 k} terms can be found by recursing into the real FFT that we are building as the terms  x_n + x_{n+\frac{N}{2}} are real. The question is what can we do about the odd output terms - we don't have a transform that does this... or do we? Let's define a new real transform as:

 Y_k = \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{2}) }{N}}

This is an FFT with a half-bin shift which gives a conjugate symmetric response for a real input  y_n - but unlike a normal real FFT which has conjugate symmetry about DC, this has conjugate symmetry about  \frac{-1}{2} . i.e.

 \begin{align*} Y_{N-1-k} &= \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (N - k - \frac{1}{2})}{N}} \\ &= \sum\limits_{n=0}^{N-1} y_n e^{\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \\ &= \left( \sum\limits_{n=0}^{N-1} y_n^* e^{-\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \right)^* \\ &= \left( \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{2})}{N}} \right)^* = Y_k^* \end{align*}

The above tells us that if we compute the  Y_{2n} terms (or  Y_{N-1-2n} terms, or the first half or the second half of the output terms - it doesn't actually matter), we actually have the entire spectrum for a real input. Here is an example for  N=4 to illustrate:

 \begin{align*} Y_0 &= Y_3^* \\ Y_1 &= Y_2^* \\ Y_2 &= Y_1^* \\ Y_3 &= Y_0^* \end{align*}

Define a function for computing the Y_{2k} values:

 Y_{2k} = \sum\limits_{n=0}^{N-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{4}) }{N/2}}

We need to bring this back into a DFT form by making the summation over half the elements:

 \begin{align*} Y_{2k} &= \sum\limits_{n=0}^{\frac{N}{2}-1} y_n e^{-\frac{j 2 \pi n (k + \frac{1}{4})}{N/2}} + y_{n+\frac{N}{2}} e^{-\frac{j 2 \pi (n + \frac{N}{2}) (k + \frac{1}{4})}{N/2}} \\ &= \sum\limits_{n=0}^{\frac{N}{2}-1} (y_n - j y_{n+\frac{N}{2}}) e^{-\frac{j 2 \pi n (k + \frac{1}{4})}{N/2}} \end{align*}

This sequence transforms N real elements into an N/2 complex component spectrum and can be used to find the  X_{2k+1} terms we needed for the real DFT. It can be seen from the above that this algorithm only performs a data combination step, a complex multiply and then a normal DFT. If we make this DFT a DIF style implementation, we satisfy the requirement that the only operations after the recursion are moves and conjugates. Here is a link to a boring implementation: real_fft_example.c.

Something I find particularly cool about the algorithm is that it naturally packs the DC and Nyquist bins into the real and imaginary components of the first output as part of the recursion (regardless of if we re-order the output or not - meaning that for convolution we still know which outputs are DC and Nyquist!). This is what most other libraries do but it ends up looking like a "hack" (at least in the two-for-one implementation).

Why you probably shouldn't bother

... because there are loads of great libraries out there that do fast convolution of real sequences (FFTW) and this is probably not a particularly good design. The implementation ends up recursing into two different functions: itself and a complex FFT - which isn't the worst thing in the world, but it's also not really that good either if you are trying to build a general-purpose FFT library or want to execute the FFT pass-at-a-time. If you have a fast FFT pass, it's not going to be that useful for this algorithm which would need it's own optimised implementation.

This was just a bit of fun.

Real-time re-sampling and linear interpolation?

Disclaimer: I've intentionally tried to keep this post "non-mathy" - I want it to provide a high level overview of what linear interpolation does spectrally and provide some evidence as to why it's probably not suited in audio processing... unless distortion is desirable.

In the context of constant pitch shifting (the input and output signals have a fixed sampling rate), linear interpolation treats the discrete input signal as continuous by drawing straight lines between the discrete samples. The output signal is constructed by picking values at regular times.

linear-interpolator-intuition

In the above, the green arrows are the input samples and the red arrows are where we want to pick the values off. It's an intuitive answer to "find the missing values", but what does it actually do to an audio signal? To find this out, it helps to look at the problem in a different way; we can change the intuitive definition described previously to: the linear interpolator interpolates (meaning: inserts a fixed number of zeroes between each input sample) the input signal by some factor, convolves the response with a triangular shaped filter kernel then decimates by some other factor. This is not quite as trivial as the previous definition, but is identical and we can draw the behaviour and system as:

drawit-diagram1

If you are not familiar with signal processing, and the block diagram in the above picture scares you what you need to know is:

  • Audio data is real valued and real valued signals have a symmetric magnitude spectrum about DC (in an audio editor, you will only ever see one side, so you'll just need to imagine that it has a symmetric reflection going from 0 to -pi (pi can be thought of as the Nyquist frequency of the audio i.e. 24 kHz for a 48 kHz input).
  • Interpolators insert U-1 zeroes between each sample. This is analogous to shrinking the spectrum of the input by a factor U and concatenating U-1 copies of it. The copies of the spectrum are called "images". i.e.

    drawit-diagram

  • Decimators drop D-1 samples for every input sample. This is analogous to expanding the spectrum by a factor D and wrapping the result on top of itself (there is also an attenuation by D, but I will not draw that). The parts of the spectrum which have been wrapped back onto itself are called "aliases". i.e.

    Decimation-in-action
    In audio, aliasing represents a distortion component which usually sounds dreadful. The only way to avoid the aliasing distortions is to ensure that the input signal is band-limited prior to decimation.

  • The H(z) block is a filter, this is a convolution applied to the samples that it sees with some other signal. It is analogous to multiplying the spectrum by some other shape.

So, the interpolation operation introduces images which H(z) needs to remove, and the decimation operation will introduce aliases if we try to decimate too much. Typically, in our real-time re-sampler use, we like to fix the interpolation factor U and permit D to vary (this allows us to use an efficient implementation structure). For H(z) to block the images, we know that it must preserve as much as possible the first 1/U component of the spectrum and must attenuate heavily everything from that point up. Here is the response of H(z) for a linear interpolator based re-sampler with an up sampling factor of 4:

Linear interpolator response
Linear interpolator response for up-sample factor of 4.

This is not good - remember, we wanted the spectrum to preserve as much signal as possible for the first quarter of the spectrum and attenuate everything everywhere else. We can see that the worst case level of an imaging component will be about 6 dB below the signal level. It's worth mentioning here that the problem does not get any better for higher values of U.

There is nothing stopping us from using a proper low-pass filter for H(z) instead of the triangular shape. Here are a few other options for use as a comparison:

Other filter options
The linear interpolator with two additional FIR filters.

The blue and green responses correspond to 8*U and 12*U length FIR filters respectively. These are both reasonably longer than the linear interpolator which has a filter of length 2*U. The way these filters were designed is outside the scope of this article. The red linear interpolator response costs two multiplies per sample to run, the blue costs eight, the green costs twelve - so these filters show a tradeoff between filter quality and implementation complexity. Note that both the blue and green filters achieve at least 50 dB of aliasing rejection - but we pay for this in the passband performance. If the input were an audio signal sampled at 48 kHz, the frequencies between 0 and 24 kHz would map to the frequency range on the graph between 0 and 0.25 (as we are interpolating by a factor of 4). At 18 kHz, we are attenuating the signal by about 11 dB; at 21 kHz, we are attenuating by about 31 dB. There is an interesting question here as to whether this matters as the frequency is so high. We can get around this to a certain extent by pre-equalising our samples to give a subtle high-frequency boost - but that is an extra complexity in the sampling software. Really the only way to make the cutoff sharper is to use longer filters - and that's not really an option if performance is important.

Here are some examples comparing the output of the above 8-tap per output sample filter to the linear interpolation filter:

Downsampling-by-2 of a white noise signal
A white noise input being resampled using 8-tap polyphase filters (left) and linear interpolation (right).

The input signal to the above output was white noise. Given that the input signal only had content from 0-24 kHz, we would expect that the output signal would only contain information from 0-12 kHz after halving the playback rate. We can see the linear interpolator has "created" a large amount of data in the high frequency region (all from badly attenuated images and aliasing). The "designed" filter attenuates the aliasing heavily but also attenuates some of the high frequency components of the input noise signal.

Upsampling by 1487/1024 of a complex tonal signal
A complex tonal input being resampled using 8-tap polyphase filters (left) and linear interpolation (right).

The input signal to the above output was a set of tones separated by octaves in frequency. The aliasing components of the spectrum have introduced inharmonic audible distortions in the linear interpolation case. The "designed" filter almost eliminates the distortion. Magic.

I suppose this all comes down to complexity: two multiplies per output sample for linear interpolation vs. more-than-two for a different filter. I chose 8 and 12 for the taps per polyphase component in the examples on this page as I was able to get implementations of the re-sampler where the two sets of filter states (for stereo samples) were able to be stored completely in SSE registers on x64 - this greatly improves the performance of the FIR delay line operations.

Windows Phone, ownCloud and CardDAV and CalDAV

This is a combination of information from the following two locations:

What is this?

This page gives an outline of how to get your Windows Phone to sync calendars and contacts with an ownCloud instance running on a server using a self-signed certificate. This whole process is a hack and I'm incredibly disappointed that Windows Phone does not support this natively - especially since they have a CalDAV and CardDAV implementation already available (used by iCloud and Google accounts). If this process stops working at some point, that should be expected.

I successfully got this working on a Lumia 735.

Process

I'm assuming that you've got ownCloud installed on a server using SSL. When the certificate was set up, the FQDN must really be the FDQN. i.e. if your ownCloud instance is hosted at "a.b.com", the certificate must be for "a.b.com" - not "b.com". I had this wrong on my home server and was getting the "80072F0D" error on the phone. If you have not yet set up ownCloud or have not set up SSL for your server yet, there are sites already documenting this process How to Create and Install an Apache Self Signed Certificate.

You then need to get the certificate installed on your phone. If you do not do this, you will get errors when the phone tries to start syncing. You can do this by opening the certificate in Internet Explorer on the phone (you can do this easily by copying the certificate to your ownCloud files, logging into your cloud in IE on the phone and opening the file.

You should then be able to follow the instructions here: Setting up CardDAV and CalDAV on Windows Phone 8.1.

Which roughly reads:

  1. Create a fake iCloud account. Put garbage information in the fields and create it. The phone won't check anything.
  2. Modify the account, edit the advanced settings then change the servers for CalDAV and CardDAV (respectively) as:
    [domain]/remote.php/caldav/principals/[username]
    [domain]/remote.php/carddav/principals/[username]

Release Alignment in Sampled Pipe Organs - Part 1

At the most basic level, a sample from a digital pipe organ contains:

  • an attack transient leading into
  • a looped sustain block and
  • a release which will be cross-faded into when the note is released.

The release cross-fade must be fast (otherwise it will not sound natural or transient details may be lost) and it must also be phase-aligned to the point where the cross-fade begins.

The necessity for phase alignment

Without phase aligning the release, disturbing artefacts will likely be introduced. The effects are different with short and long cross-fades but are always unpleasant.

The following image shows an ideal cross-fade into a release sample. The crossfade begins at 0.1 seconds and lasts for 0.05 seconds. The release is aligned properly and the signal looks continuous.

A good crossfade into a release.
A good crossfade into a release.

The following image shows a bad release where the cross-fade is lagging an ideal release offset by half-a-period. Some cancellation occurs during the cross-fade and the result will either sound something like a "pluck" for long cross-fades or a "click" for short cross-fades.

A worst-case crossfade into a release.
A worst-case crossfade into a release.

(The cross-fade used in generating the above data sets was a raised cosine - linear cross-fades can be used but will result in worse distortions).

The problem of aligning release cross-fades in virtual pipe organs is an interesting one. As an example: at the time of writing this article, release alignment in the GrandOrgue project is not particularly good; it uses a lookup-table taking the value and first-order estimated derivative (both quantised heavily) of the last sample of the last played block as keys. This is not optimal as a single sample says nothing about phase and the first-order derivative estimate could be completely incorrect in the presence of noise.

Another approach for handling release alignment

If the pitch a pipe was to be completely stable, known (f=\frac{1}{T}) and we knew one point where the release was perfectly aligned (t_r), we know that we could cross-fade into the start of the release at:

 \forall n \in \mathbb Z, t = t_r + T n

Hence, for any sample offset we could compute an offset into the release to cross-fade into.

In reality, pipe pitch wobbles around a bit and so the above would not strictly hold all the time - that being said, it is true for much of the time. If we could take a pipe sample and find all of the points where the release is aligned we could always find the best way to align the release.

It turns out that a simple way to do this is to find the cross-correlation of the attack and sustain segment with a short portion of the release. Taking the whole release would be problematic because as it decays it becomes less similar to the sustaining segment (which leads to an unhelpful correlation signal).

The first 25000 samples of the signal used for the cross-correlation.
The first 25000 samples of the signal used for the cross-correlation.

The above image shows the attack and some sustain of bottom-C of the St. Augustine's Closed Horn. This shows visually why single sample amplitude and derivative matching is a poor way to align releases. During one period of the closed horn, there are 14 zero crossings and 16 obvious zero crossings in the derivative. One sample gives hardly enough information.

A 1024 sample cut from the start of the release.
A 1024 sample cut from the start of the release.

The above image shows a 1024 sample segment taken from the release marker of the same Closed Horn sample. It contains just over a single period of the horn.

The next image shows the cross-correlation of this release segment with the sample itself. My analysis program does correlation of the left and right channels and sums them to provide an overall correlation. Positive maximums correspond to points where the release will phase-align well. Minimums correspond to points where the signal has the least correlation to the release.

Normalised cross correlation of the signal with the release segment.
Normalised cross correlation of the signal with the release segment.

Using the correlation and a pitch guesstimate, we could construct a function which given any sample offset in the attack/sustain could produce an offset into the release which we should cross-fade into. This is for next time.

Nick Appleton's audio, electronics and software projects… and blog thing