Category Archives: Signal Processing

Circular Convolution, Discrete Fourier Transforms and Toeplitz Matrix Multiplication

Circular Convolution

It’s fairly well known that multiplying the results of two discrete Fourier transforms and taking the inverse results in a circular convolution. This is fairly straight-forward to prove given the definitions of the DFT and its inverse.

Given the DFT:

    \[ X \left[ k \right] = \sum_{n=0}^{N-1} e^{-\frac{j 2 \pi n k}{N}} x \left[ n \right] \]

And the inverse DFT:

    \[ x \left[ n \right] = \frac{1}{N} \sum_{k=0}^{N-1} e^{\frac{j 2 \pi n k}{N}} X \left[ k \right] \]

We can derive convolution as:

    \[ \begin{array}{@{}ll@{}} y \left[ n \right] &= \frac{1}{N} \sum_{k=0}^{N-1} e^{\frac{j 2 \pi n k}{N}} \left( \sum_{m=0}^{N-1} e^{-\frac{j 2 \pi m k}{N}} a \left[ m \right] \right) \left( \sum_{l=0}^{N-1} e^{-\frac{j 2 \pi l k}{N}} b \left[ l \right] \right) \\ &= \sum_{m=0}^{N-1} \sum_{l=0}^{N-1} a \left[ m \right] b \left[ l \right] \frac{1}{N} \sum_{k=0}^{N-1} e^{\frac{j 2 \pi \left( n - l - m \right) k}{N}} \\ &= \sum_{m=0}^{N-1} \sum_{l=0}^{N-1} a \left[ m N + \left( n - l \right) \mathrm{ mod } N \right] b \left[ l \right] \end{array} \]

The last step in the above recognises that the summation over k is only non-zero for certain values of l, m and n and we make a variable swap of m to attain the result. We can write the above in matrix form as:

    \[ \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ \vdots \\ y_{N-1} \end{pmatrix} = \begin{pmatrix} a_0 & a_{N-1} & a_{N-2} & \dots & a_1 \\ a_1 & a_0 & a_{N-1} & & \\ a_2 & a_1 & a_0 & \ddots & \\ \vdots & & \ddots & \ddots & a_{N-1} \\ a_{N-1} & & & a_1 & a_0 \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_{N-1} \end{pmatrix} \]

The matrix of a coefficients is a circulant matrix. Each row is a shifted copy of the preceeding row. Given that there exist \mathcal{O} \left( n \log n \right) algorithms for computing the DFT, we have shown that multiplying a vector by a circulant matrix has an efficient algorithm (note – this is only a computational reality for large N).

Circular Convolution with a Generalised DFT

Let’s redefine our DFT as:

    \[ X \left[ k \right] = \sum_{n=0}^{N-1} e^{-\frac{j 2 \pi n \left( k + \alpha \right) }{N}} x \left[ n \right] \]

Which has an inverse of:

    \[ x \left[ n \right] = \frac{1}{N} \sum_{k=0}^{N-1} e^{\frac{j 2 \pi n \left( k + \alpha \right)}{N}} X \left[ k \right] \]

This generalisation gives us some control over the boundary conditions of the DFT and hence the assumed data periodicity i.e. the DFT assumes the transformed data continues forever being repeated verbatim over and over – we can change this using \alpha. Let’s derive the convolution again:

    \[ \begin{array}{@{}ll@{}} y \left[ n \right] &= \frac{1}{N} \sum_{k=0}^{N-1} e^{\frac{j 2 \pi n \left( k + \alpha \right)}{N}} \left( \sum_{m=0}^{N-1} e^{-\frac{j 2 \pi m \left( k + \alpha \right)}{N}} a \left[ m \right] \right) \left( \sum_{l=0}^{N-1} e^{-\frac{j 2 \pi l \left( k + \alpha \right)}{N}} b \left[ l \right] \right) \\ &= \sum_{m=0}^{N-1} \sum_{l=0}^{N-1} a \left[ m \right] b \left[ l \right] \frac{1}{N} \sum_{k=0}^{N-1} e^{\frac{j 2 \pi \left( n - l - m \right) \left( k + \alpha \right)}{N}} \\ \end{array} \]

Given this, we can now draw up some matrices for various values of \alpha. Some interesting (and perhaps useful) values are \frac{1}{4}, \frac{3}{4} and \frac{1}{2}. We will restrict our attention to \frac{1}{2} which has a slightly different form once we obliterate the exponential:

    \[ Y[n] = \sum_{m=0}^{N-1} \sum_{l=0}^{N-1} a \left[ m N + \left( n - l \right) \mathrm{ mod } 2 N \right] b \left[ l \right] \left( -1 \right)^m \]

We find that the matrix operation for this looks like:

    \[ \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ \vdots \\ y_{N-1} \end{pmatrix} = \begin{pmatrix} a_0 & -a_{N-1} & -a_{N-2} & \dots & -a_1 \\ a_1 & a_0 & -a_{N-1} & & \\ a_2 & a_1 & a_0 & \ddots & \\ \vdots & & \ddots & \ddots & -a_{N-1} \\ a_{N-1} & & & a_1 & a_0 \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_{N-1} \end{pmatrix} \]

This matrix is no-longer circulant; all entries to the right of the main diagonal have been negated. This convolution might not have practical value by itself, but the symmetry suggests that it might have value when combined with another.

Toeplitz Matrix Vector Multiplication

A Toeplitz matrix has the form:

    \[ \begin{pmatrix} t_0 & t_{-1} & t_{-2} & \dots & t_{-(N-1)} \\ t_1 & t_0 & t_{-1} & & \\ t_2 & t_1 & t_0 & \ddots & \\ \vdots & & \ddots & \ddots & t_{-1} \\ t_{N-1} & & & t_1 & t_0 \end{pmatrix} \]

There are efficient algorithms for performing Toeplitz matrix by vector multiplication that use a circular convolution algorithm. These algorithms end up throwing away much of the computed result (this can be seen in the previous link in the multiplication by a zero matrix). We can avoid this by using the symmetry defined in the previous section.

If we take the sum of a regular DFT convolution of A and X and the previously defined convolution of B and X, we are effectively computing the following matrix operation:

    \[ \begin{small} \begin{pmatrix} a_0+b_0 & a_{N-1}-b_{N-1} & a_{N-2}-b_{N-2} & \dots & a_1-b_1 \\ a_1+b_1 & a_0+b_0 & a_{N-1}-b_{N-1} & & \\ a_2+b_2 & a_1+b_1 & a_0+b_0 & \ddots & \\ \vdots & & \ddots & \ddots & a_{N-1}-b_{N-1} \\ a_{N-1}+b_{N-1} & & & a_1+b_1 & a_0+b_0 \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{N-1} \end{pmatrix} \end{small} \]

We can select values for a and b to create a multiplication of any matrix with equal elements on each diagonal – such as our Toeplitz matrix – using only DFTs of the length of the input data sequence. There are some interesting optimisations that can be made when the matrix is Hermitian to eliminate some of the DFTs entirely.

On the perils of cross-fading loops in organ samples

One common strategy when looping problematic organ samples is to employ a cross-fade. This is an irreversible audio modification that gradually transitions the samples leading up to the end of a loop to equal the samples that were leading into the start. The goal is to completely eliminate any sort of impulsive glitch and hopefully also create a “good spectral match”. While the ability to eliminate glitches is possible, creating a good spectral match might not be so simple.

We can think of an organ sample as being a linear combination of two signals:

  • A voiced/correlated/predictable component (choose your word) which represents the tonal part of the sample. Strictly speaking, the tonal component is not entirely predictable – there are continuous subtle variations in the speech of a pipe… but they are typically small and we will assume predictability.
  • An unvoiced/uncorrelated/unpredictable component which represents the pipe-noise of the sample.

Both of these components are necessary for realism in a sample.

The following image shows a cross-fade of two entirely correlated signals. The top picture contains the input signals and the cross-fade transition that will be used, the bottom contains the input signals with the cross-faded signal overlaid on top. There is nothing interesting about this picture: the two input signals were the same. The cross-fade transitions between two identical signals i.e. the output signal is equal to both of the input signals.

Crossfade of correlated signal

Crossfade of correlated signal

The next image shows a cross-fade of the same correlated signal with some uniform white noise overlaid on top. What is going on in the middle of output signal? It looks like it’s been attenuated a little bit and doesn’t appear to have a uniform distribution anymore.

Crossfade of correlated signal with uniform noise

Crossfade of correlated signal with uniform noise

This final image is a cross-fade purely consisting of two uniform random signals to add some more detail.

Crossfade of uniform noise

Crossfade of uniform noise

It turns out that summing two uniformly distributed random variables yields a new random variable with a triangular distribution (this is how noise for triangular dither gets generated). During the cross-fade, the distribution of the uncorrelated components of the signal is actually changed. Not only that, but the power of the signal is reduced in the middle of the transition. Here is an audio sample of two white noise streams being cross-faded over 2 seconds:

Listen for the level drop in the middle.

If the cross-fade is too long when looping a sample, it could make the pipe wind-noise duck over the duration of the cross-fade. While the effect is subtle, I have heard it in some samples and it’s not particularly natural. I suppose the TLDR advice I can give is:

  • If you can avoid cross-fading – avoid it.
  • If you cannot avoid cross-fading – make the cross-fade as short as possible (milliseconds) to avoid an obvious level-drop of noise during the transition.

Real-time re-sampling and linear interpolation?

Disclaimer: I’ve intentionally tried to keep this post “non-mathy” – I want it to provide a high level overview of what linear interpolation does spectrally and provide some evidence as to why it’s probably not suited in audio processing… unless distortion is desirable.

In the context of constant pitch shifting (the input and output signals have a fixed sampling rate), linear interpolation treats the discrete input signal as continuous by drawing straight lines between the discrete samples. The output signal is constructed by picking values at regular times.

linear-interpolator-intuition

In the above, the green arrows are the input samples and the red arrows are where we want to pick the values off. It’s an intuitive answer to “find the missing values”, but what does it actually do to an audio signal? To find this out, it helps to look at the problem in a different way; we can change the intuitive definition described previously to: the linear interpolator interpolates (meaning: inserts a fixed number of zeroes between each input sample) the input signal by some factor, convolves the response with a triangular shaped filter kernel then decimates by some other factor. This is not quite as trivial as the previous definition, but is identical and we can draw the behaviour and system as:

drawit-diagram1

If you are not familiar with signal processing, and the block diagram in the above picture scares you what you need to know is:

  • Audio data is real valued and real valued signals have a symmetric magnitude spectrum about DC (in an audio editor, you will only ever see one side, so you’ll just need to imagine that it has a symmetric reflection going from 0 to -pi (pi can be thought of as the Nyquist frequency of the audio i.e. 24 kHz for a 48 kHz input).
  • Interpolators insert U-1 zeroes between each sample. This is analogous to shrinking the spectrum of the input by a factor U and concatenating U-1 copies of it. The copies of the spectrum are called “images”. i.e.

    drawit-diagram

  • Decimators drop D-1 samples for every input sample. This is analogous to expanding the spectrum by a factor D and wrapping the result on top of itself (there is also an attenuation by D, but I will not draw that). The parts of the spectrum which have been wrapped back onto itself are called “aliases”. i.e.

    Decimation-in-action
    In audio, aliasing represents a distortion component which usually sounds dreadful. The only way to avoid the aliasing distortions is to ensure that the input signal is band-limited prior to decimation.

  • The H(z) block is a filter, this is a convolution applied to the samples that it sees with some other signal. It is analogous to multiplying the spectrum by some other shape.

So, the interpolation operation introduces images which H(z) needs to remove, and the decimation operation will introduce aliases if we try to decimate too much. Typically, in our real-time re-sampler use, we like to fix the interpolation factor U and permit D to vary (this allows us to use an efficient implementation structure). For H(z) to block the images, we know that it must preserve as much as possible the first 1/U component of the spectrum and must attenuate heavily everything from that point up. Here is the response of H(z) for a linear interpolator based re-sampler with an up sampling factor of 4:

Linear interpolator response

Linear interpolator response for up-sample factor of 4.

This is not good – remember, we wanted the spectrum to preserve as much signal as possible for the first quarter of the spectrum and attenuate everything everywhere else. We can see that the worst case level of an imaging component will be about 6 dB below the signal level. It’s worth mentioning here that the problem does not get any better for higher values of U.

There is nothing stopping us from using a proper low-pass filter for H(z) instead of the triangular shape. Here are a few other options for use as a comparison:

Other filter options

The linear interpolator with two additional FIR filters.

The blue and green responses correspond to 8*U and 12*U length FIR filters respectively. These are both reasonably longer than the linear interpolator which has a filter of length 2*U. The way these filters were designed is outside the scope of this article. The red linear interpolator response costs two multiplies per sample to run, the blue costs eight, the green costs twelve – so these filters show a tradeoff between filter quality and implementation complexity. Note that both the blue and green filters achieve at least 50 dB of aliasing rejection – but we pay for this in the passband performance. If the input were an audio signal sampled at 48 kHz, the frequencies between 0 and 24 kHz would map to the frequency range on the graph between 0 and 0.25 (as we are interpolating by a factor of 4). At 18 kHz, we are attenuating the signal by about 11 dB; at 21 kHz, we are attenuating by about 31 dB. There is an interesting question here as to whether this matters as the frequency is so high. We can get around this to a certain extent by pre-equalising our samples to give a subtle high-frequency boost – but that is an extra complexity in the sampling software. Really the only way to make the cutoff sharper is to use longer filters – and that’s not really an option if performance is important.

Here are some examples comparing the output of the above 8-tap per output sample filter to the linear interpolation filter:

Downsampling-by-2 of a white noise signal

A white noise input being resampled using 8-tap polyphase filters (left) and linear interpolation (right).

The input signal to the above output was white noise. Given that the input signal only had content from 0-24 kHz, we would expect that the output signal would only contain information from 0-12 kHz after halving the playback rate. We can see the linear interpolator has “created” a large amount of data in the high frequency region (all from badly attenuated images and aliasing). The “designed” filter attenuates the aliasing heavily but also attenuates some of the high frequency components of the input noise signal.

Upsampling by 1487/1024 of a complex tonal signal

A complex tonal input being resampled using 8-tap polyphase filters (left) and linear interpolation (right).

The input signal to the above output was a set of tones separated by octaves in frequency. The aliasing components of the spectrum have introduced inharmonic audible distortions in the linear interpolation case. The “designed” filter almost eliminates the distortion. Magic.

I suppose this all comes down to complexity: two multiplies per output sample for linear interpolation vs. more-than-two for a different filter. I chose 8 and 12 for the taps per polyphase component in the examples on this page as I was able to get implementations of the re-sampler where the two sets of filter states (for stereo samples) were able to be stored completely in SSE registers on x64 – this greatly improves the performance of the FIR delay line operations.