Building a digital filter for use in synthesisers

This is a tutorial on how to build a digital implementation of a 2nd-order, continuously-variable filter (i.e. one where you can change the parameters runtime) that has dynamic behaviour that mimics an analogue filter. The digital model is linear (i.e. we are not going to model any transistor/diode behavior – we’ll do that in another blog). The resulting implementation is suitable for use in a digital synthesiser or equaliser. It is not attempting to be a perfect model – but it should be quite close.

The tutorial follows this rough guide:

  • Difference equations and a recap on state-space.
  • Description of the circuit we will be modelling (this can be skipped – it is included for completeness).
  • Expressing the desired system digitally using a partially-transformed state-space model.
  • Making the filter more useful
  • Summary of implementation and demo video

Difference equations and state-space recap

Most EE grads know how to get a digital implementation of an analogue filter:

  1. Pick a transfer function and select values for the natural frequency (\omega) [rads/second] and the dimensionless Q-factor (Q) – these two controls are like the “frequency” and “resonance” knobs on a synth filter.

        \[ \large \begin{array}{cl} \text{Lowpass} & \frac{\omega^2}{s^2 + \frac{\omega}{Q}s + \omega^2} \\ \text{Highpass} & \frac{s^2}{s^2 + \frac{\omega}{Q}s + \omega^2} \\ \text{Bandpass} & \frac{\frac{\omega}{Q}s}{s^2 + \frac{\omega}{Q}s + \omega^2} \\ \text{Bandstop} & \frac{s^2 + \omega^2}{s^2 + \frac{\omega}{Q}s + \omega^2} \end{array} \]

  2. Discretise this using your favourite method (e.g. the Bilinear Transformation) to obtain a transfer function in z.
  3. Implement the difference equation which has coefficients trivially taken from the previous step.

More detail on those last two steps can be found in this older blog Three step digital Butterworth design.

This process works fairly well and difference equations are relatively-inexpensive to implement. But what if we want to attach \omega and Q to knobs a user could twiddle (as is the case if we were building a synth)? Can we simply update the difference equation coefficients as the system processes data?

The answer is: it depends. Assuming you are using the bilinear transform, the discretisation process will always result in coefficients that are stable i.e. if the input is zero, the output will eventually decay to zero and changing the difference equation coefficients on the fly will never cause the system to end up in some state which it cannot recover from. However, the transient that the coefficient change introduces may be inappropriate for the use case. For example: if changing the coefficients can cause a large burst of energy in the output that takes thousands of samples to decay – well, that might not be okay in a synthesiser.

In the State-Space Representation – An Engineering Primer post, we went into a decent amount of detail of the benefits of working in state-space representation. The main relevant takeaway from that post is that while the transfer function representation is unique for a particular system, the state-space representation is not. A change of coordinates can be applied to the states which does not impact the steady-state response of the filter – but this change will impact the transient response to system parameter changes!

The state vector (\mathbf{Q}[n] using terminology from the old blog) of a system derived from a difference equation consists of previous output samples from an all-pole filter. Every sample processed through the system shifts the states of the vector (discarding one element) and creates a new value. Think about this for a while as there is some intuition to be had. Consider a low-pass, very-low-frequency and under-damped (i.e. exhibits an exponentially-decaying sinusoidal impulse response) system. When an impulse is pushed in, the magnitude of all elements in the state vector slowly rise, eventually all have similar values (near the peak), then all start falling until they all have similar values again in the trough. A small error introduced in the state when all values are the same could lead the filter onto a completely different trajectory. This was demonstrated again in the state-space primer blog with the 6th order system’s inability to run with a difference equation with single-precision floats.

So how do we pick the 2nd order system to use? For a synth, the answer is probably: whatever sounds “best” as perceived by the designer. All systems will exhibit different behavior as their coefficients change. In this blog, we will chose an analogue circuit to approximately model – but that is just that: a choice.

The analogue circuit

The circuit below is what we are going to model. It is a Voltage-Controlled Voltage Source (VCVS) filter topology. The VCVS design permits the Q-factor and natural frequency to be controlled independently. I have explicitly labeled the voltages across the two capacitors because these are the states we are going to try and preserve in our digital implementation i.e. we’re not going to use the two previous all-pole output values as with a typical difference equation. By doing this, when we vary parameters, it will be like changing component values while leaving the instantaneous voltages across those capacitors the same – this is realistic, we can vary resistors by replacing them with potentiometers.

Circuit diagram of the VCVS filter topology we will be modelling.

VCVS filter topology

Explaining how to apply circuit analysis is way out of scope for this blog. But one set of equations that describes the circuit in the Laplace domain is given below:

    \[ \large \begin{array}{rl} sV_{c1}(s) &= \frac{-2V_{c1}(s) - \frac{R_b+2R_q}{R_b}V_{c2}(s)}{RC} + \frac{V_i(s)}{RC} \\ sV_{c2}(s) &= \frac{V_{c1}(s) + \frac{R_q}{R_b}V_{c2}(s)}{RC} \\ V_o(s) &= \frac{R_b+R_q}{R_b} V_{c2}(s) \end{array} \]

Before we go further, we will make some substitutions in the above equations to simplify further working. Define k=\frac{R_q}{R_b} and \omega=\frac{1}{RC} to get:

    \[ \large \begin{array}{rl} sV_{c1}(s) &= -2 \omega V_{c1}(s) - \left(2k+1\right)\omega V_{c2}(s) + \omega V_i(s) \\ sV_{c2}(s) &= \omega V_{c1}(s) + k \omega V_{c2}(s) \\ V_o(s) &= \left(k+1\right) V_{c2}(s) \end{array} \]

At this point, we can do some substitution of the equations above to find \frac{V_o(s)}{V_i(s)} i.e. the unique transfer function of the circuit:

    \[ \large \frac{V_o(s)}{V_i(s)} = \frac{\omega^2}{s^2 + s\omega(2-k) + \omega^2}(k+1) \]

This looks very similar to the resonant lowpass function given previosuly with Q=\frac{1}{2-k} and an additional gain of k+1 tacked on.

Given that there are no common components in determining \omega and k, the analogue VCVS filter topology has the ability to provide independent control of Q and \omega (although, the overall gain of the system is affected by Q).

We’ve stated what we want: the states to approximate the voltages across the capacitors. We know that we cannot do that if we use a standard difference equation obtained via a transfer function. So what next?

Straight to state-space

We derived the equations for the analogue model carefully in the previous section. We ensured that the s terms only existed on the left hand side and only multiplied each of the states we said we were interested in. This enables us to write out the state-space model directly as follows:

    \[ \large \begin{array}{rll} s \left[ \begin{array}{c} V_{c1}(s) \\ V_{c2}(s) \end{array} \right] &= \left[ \begin{array}{cc} -2\omega & -\left(2k+1\right)\omega \\ \omega & k\omega \end{array} \right] \left[ \begin{array}{c} V_{c1}(s) \\ V_{c2}(s) \end{array} \right] & + \left[ \begin{array}{c} \omega \\ 0 \end{array} \right] V_i(s) \\ V_o(s) &= \left[ \begin{array}{c} 0 & k+1 \end{array} \right] \left[ \begin{array}{c} V_{c1}(s) \\ V_{c2}(s) \end{array} \right] &+ 0 V_i(s) \end{array} \]

If we define the state vector \mathbf{Q}(s) = \left[ \begin{array}{cc} V_{c1}(s) & V_{c2}(s) \end{array} \right]^T, the input vector \mathbf{X}(s)=V_i(s) and the output vector \mathbf{Y}(s)=V_o(s) we can write the above in the usual state variable representation:

    \[ \large \begin{array}{rll} s \mathbf{Q}(s) &= \mathbf{A} \mathbf{Q}(s) + \mathbf{B} \mathbf{X}(s) \\ \mathbf{Y}(s) &= \mathbf{C} \mathbf{Q}(s) + \mathbf{D} \mathbf{X}(s) \end{array} \]

The top equation is the state equation. The bottom is the output equation. Recall that the state-space mapping is not unique. We can change the coordinates of the \mathbf{Q}(s) vector using any invertible matrix as demonstrated in the state-space primer post. But here, we have intentionally chosen the states we want to use so that our system behaves similarly to the circuit. What we will now do is apply the bilinear transform to this system in such a way that our output equation remains the same as in the continuous system – which it must if we are actually modelling the circuit. Defining the bilinear transform as:

    \[ \large s=\alpha\frac{z-1}{z+1} \]

Where \alpha is related to the sampling rate of the system we are building and can be used to control the specific frequency in the frequency response that maps exactly to the s-domain transfer function (see the wikipedia page for the bilinear transform for more information). Make the substitution to the state update equation and attempt to make the left hand side z\mathbf{Q}(z):

    \[ \normalsize \begin{array}{rl} \alpha\frac{z-1}{z+1} \mathbf{Q}(z) &= \mathbf{A} \mathbf{Q}(z) + \mathbf{B} \mathbf{X}(z) \\ \alpha\left(z-1\right) \mathbf{Q}(z) &= \left(z+1\right) \mathbf{A}\mathbf{Q}(z) + \left(z+1\right) \mathbf{B} \mathbf{X}(z) \\ z\left(\alpha\mathbf{I}-\mathbf{A}\right)\mathbf{Q}(z) &= \left(\mathbf{\alpha \mathbf{I} + A}\right) \mathbf{Q}(z) + \left(z+1\right) \mathbf{B} \mathbf{X}(z) \\ z\mathbf{Q}(z) &= \left(\alpha\mathbf{I}-\mathbf{A}\right)^{-1}\left(\mathbf{\alpha \mathbf{I} + A}\right) \mathbf{Q}(z) + \left(z+1\right) \left(\alpha\mathbf{I}-\mathbf{A}\right)^{-1}\mathbf{B} \mathbf{X}(z) \end{array} \]

Note that the input mixing term includes (z+1). Normally, this term would be moved into the output equation (so that it can be eliminated) and would cause the \mathbf{C} and \mathbf{D} matrices to change indicating that we have changed the states – but we don’t want this! Instead, we simply implement that operation on the input before mixing it.

The only thing that remains is to make the system causal i.e. the system has the following form:

    \[ \large \begin{array}{rll} z \mathbf{Q}(z) &= \mathbf{A_d} \mathbf{Q}(z) + \left(z+1\right)\mathbf{B_d} \mathbf{X}(z) \\ \mathbf{Y}(z) &= \mathbf{C} \mathbf{Q}(z) + \mathbf{D} \mathbf{X}(z) \end{array} \]

This depends on future samples of \mathbf{X}(z). We multiply the state update equation by z^{-1} to get:

    \[ \large \begin{array}{rll} \mathbf{Q}(z) &= z^{-1} \mathbf{A_d} \mathbf{Q}(z) + \left(1+z^{-1}\right)\mathbf{B_d} \mathbf{X}(z) \\ \mathbf{Y}(z) &= \mathbf{C} \mathbf{Q}(z) + \mathbf{D} \mathbf{X}(z) \end{array} \]

Converting this into a state-space difference-equation we get:

    \[ \large \begin{array}{rll} \mathbf{Q}[n] &= \mathbf{A_d} \mathbf{Q}[n-1] + \mathbf{B_d}\left(\mathbf{X}[n] + \mathbf{X}[n-1]\right) \\ \mathbf{Y}[n] &= \mathbf{C} \mathbf{Q}[n] + \mathbf{D} \mathbf{X}[n] \end{array} \]

The above is what we implement. \mathbf{C} and \mathbf{D} are the same as what we derived for our continuous time system (\mathbf{C}=\left[\begin{array}{cc}0 & k+1 \end{array}\right] and \mathbf{D}=0). \mathbf{A_d} and \mathbf{B_d} are defined as:

    \[ \large \begin{array}{rl} \mathbf{A_d} &= \left(\alpha\mathbf{I}-\mathbf{A}\right)^{-1}\left(\mathbf{\alpha \mathbf{I} + A}\right) \\ \mathbf{B_d} &= \left(\alpha\mathbf{I}-\mathbf{A}\right)^{-1}\mathbf{B} \end{array} \]

Making it more useful

If we stopped here, we would have the lowpass filter the circuit represented. But who just wants a lowpass filter?

Based on how we designed the system earlier, all the properties of the circuit are modeled by the \mathbf{A_d} and \mathbf{B_d} matrices. The \mathbf{C} and \mathbf{D} matrices can be thought of as buffering values available at various points in the circuit and mixing them with various gains to create the output i.e. the \mathbf{C} and \mathbf{D} matrices can be thought to not really impact the circuit – only how the output is tapped from it. If we let \mathbf{C}=\left[\begin{array}{cc}c0&c1\end{array}\right] and \mathbf{D}=d_0 and expand the analogue transfer function, we get:

    \[ \large \begin{array}{rll} \mathbf{A} &= \left[ \begin{array}{cc} -2\omega & -\left(2k+1\right)\omega \\ \omega & k\omega \end{array} \right] \\ \mathbf{B} &= \left[ \begin{array}{c} \omega \\ 0 \end{array} \right] \\ \frac{V_o(s)}{V_i(s)} &= \mathbf{C}(s\mathbf{I} - \mathbf{A})^{-1}\mathbf{B} + d_0 \\ &= \frac{\mathbf{C} ( \text{adj } s\mathbf{I}-\mathbf{A})\mathbf{B}}{\det s\mathbf{I} - \mathbf{A}} + d_0 \\ &= \frac{\left[\begin{array}{cc}c_0 & c_1\end{array}\right]\left[\begin{array}{cc} s - k\omega & -(2k+1)\omega \\ \omega & s+2\omega \end{array} \right] \left[\begin{array}{c} \omega \\ 0\end{array} \right]}{s^2 +s\omega(2-k) +\omega^2} + d_0 \\ &= \frac{d_0 s^2+s\omega(c_0 +d_0 (2-k))+\omega^2(d_0+c_1-c_0k)}{s^2 +s\omega(2-k) +\omega^2} \end{array} \]

We can use d_0, c_0 and c_1 to create any numerator of our choosing. The denominator is completely determined by the \mathbf{A} matrix, but recall the list of resonant filters at the start of this post also share the same denominator. We can use this information to build multiple outputs from the same states and input sample! We can even build some smooth control to transition between them. Let’s try obtain these quantities from something a bit more readable:

    \[ \large d_0 s^2+s\omega(c_0 +d_0 (2-k))+\omega^2(d_0+c_1-c_0k) = b_2 s^2 + s\omega b_1 + b_0\omega^2 \]

Solving for d_0, c_0 and c_1:

    \[ \large \begin{array}{rl} d_0 &= b_2 \\ c_0 &= b_1 - (2-k)b_2 \\ c_1 &= b_0-d_0+c_0k = b_0 + k b_1 - (k(2-k)+1)b_2\end{array} \]

  • Setting b_2=0, b_1=0 and b_0=1 makes the resonant lowpass filter.
  • Setting b_2=1, b_1=0 and b_0=0 makes the resonant highspass filter.
  • Setting b_2=1, b_1=g(2-k) and b_0=1 makes the resonant bandpass/bandstop filter dependent on the value of g.

We could introduce a parameter p to transition between lowpass and highpass as follows:

    \[ \large \begin{array}{rl} b_0 &= 1-p \\ b_1 &= 0 \\ b_2 &= p\end{array} \]

When p=0, the system is lowpass. When p=1, the system is highpass. When p=0.5, we have a bandstop filter with a wideband attenuation of 6 dB. That’s kinda funky, but it would be cool if we could hack in some controllable bandpass/bandstop behaviour too. We can fudge this by setting:

    \[ \large \begin{array}{rl} b_0 &= 1-p \\ b_1 &= 2(1-p)p(2-k)g \\ b_2 &= p\end{array} \]

Now when p=0.5 we have a controllable bandpass/bandstop filter dependent on the value of g.

Screenshot highlighting the filter controls as they appear in my prototype synthesiser.

The controls in the red outline control the filter.

The “Sync” checkbox controls whether \omega should be:

  • If unchecked, a constant. The slider next to it will be in Hz.
  • If checked, a multiple of the main oscillator frequency.

The “LP/BP/HP” slider directly controls p. g gets the linear gain corresponding to the value of “BP Gain”. The value of the “Q” slider is converted to k as described earlier.

Summary of implementation and demo video

The implementation has 4 input parameters:

  • p is the continuous mode control. It is in the range [0, 1]. At 0, the system is lowpass. At 1, the system is highpass. At 0.5, the system is bandpass/bandstop.
  • g is the linear bandpass/bandstop gain. When p=0 and g=0, the system is bandstop. When p=0 and g=1, the output is the input. When p=0.5 and g\gg1, the system is bandboost.
  • \omega the natural frequency of the filter – this will be close to where resonance occurs.
  • k the derived from the Q-factor.

For every sample:

  1. Smooth the control parameters p, g, \omega, k with 1-pole smoothers to avoid instantaneous jumps which will cause glitches.
  2. Form the \mathbf{A} and \mathbf{B} matrices and compute their discretised counterparts as described (yes, this is an entire bilinear transform per sample).
  3. Evaluate the state update equation to form the new states.
  4. Derive \mathbf{C} and \mathbf{D} using:

        \[\begin{array}{rl} b_0 &= 1-p \\ b_1 &= 2(1-p)p(2-k)g \\ b_2 &= p\end{array}\]

  5. Evaluate the output equation to get the next output sample.

Here’s a video of it in action:

Even More Householder

In several previous blogs (here and here), I’ve written about the orthonormal Householder matrix and how it can constructed such that it contains a particular unit-vector. A recent tweet from Eric Arnebäck posted a link to some code that was originally put together in 1999 by Tomas Akenine-Möller and John Hughes to find an orthogonal matrix which maps one unit-vector to another. This is a slightly more general problem which can be still solved in the same way using the Householder matrix. The vectors and matrices in this post are assumed to be real but the method can be modified to work with complex numbers.

In the previously linked posts, we described the method to find the Householder matrix \mathbf{P} given a unit-vector \mathbf{x} such that \mathbf{P} \mathbf{x}=\mathbf{e}_1. Another way of wording this is: find the orthogonal matrix \mathbf{P} that contains \mathbf{x} as the first row. Now, we will generalise this to find an orthogonal matrix \mathbf{P} given a unit-vector \mathbf{x} such that \mathbf{P} \mathbf{x}=\mathbf{y} where \mathbf{y} is another unit-vector. We start by defining \mathbf{P} to be the Householder reflector \mathbf{P}=\mathbf{I}-2\mathbf{v}\mathbf{v}^H and attempt to solve for \mathbf{v}:

    \[ \begin{array}{@{}ll@{}} \mathbf{y} &= \left(\mathbf{I} - 2 \mathbf{v} \mathbf{v}^H\right) \mathbf{x} \\ \mathbf{x} - \mathbf{y} &= 2 \mathbf{v} \mathbf{v}^H \mathbf{x} \end{array} \]

As the quantity \mathbf{v}^H \mathbf{x} is a scalar, the vector \mathbf{v} must be a scaled version of \mathbf{x}-\mathbf{y}. Define \mathbf{v}=\alpha\left(\mathbf{x}-\mathbf{y}\right) and make the substitution into the previous equation:

    \[ \begin{array}{@{}ll@{}} \mathbf{x} - \mathbf{y} &= 2 \alpha^2 \left(\mathbf{x}-\mathbf{y}\right) \left(\mathbf{x}-\mathbf{y}\right)^H \mathbf{x} \\ &= 2 \alpha^2 \left(\mathbf{x}-\mathbf{y}\right) \left(1-\mathbf{y}^H \mathbf{x}\right) \\ \frac{1}{\sqrt{2 \left(1-\mathbf{y}^H \mathbf{x}\right)}} &= \alpha \end{array} \]

Observe that \alpha has a singularity as \mathbf{y}^H \mathbf{x} approaches one. This can be addressed (as has been done in the previous posts) by negating the definition of the reflector (i.e. set \mathbf{P}=2\mathbf{v}\mathbf{v}^H-\mathbf{I}) and solving for \mathbf{v} again:

    \[ \begin{array}{@{}ll@{}} \mathbf{y} &= \left(2 \mathbf{v} \mathbf{v}^H - \mathbf{I}\right) \mathbf{x} \\ \mathbf{x} + \mathbf{y} &= 2 \mathbf{v} \mathbf{v}^H \mathbf{x} \end{array} \]

This time, \mathbf{v} will be defined as \mathbf{v} = \alpha\left(\mathbf{x}+\mathbf{y}\right) and the solution for \alpha becomes:

    \[ \frac{1}{\sqrt{2 \left(1+\mathbf{y}^H \mathbf{x}\right)}}\]

This shifts the singularity to \mathbf{y}^H \mathbf{x}=-1. If the quantity \mathbf{y}^H \mathbf{x} is positive, this latter solution will be more numerically robust; if it is negative, the former will be more numerically robust. We will build an algorithm that performs the flip based on the sign of \mathbf{y}^H \mathbf{x} which results in a discontinuity of the resulting matrix when the sign changes. This might be important to consider in some use-cases.

We can avoid computing the square root because \alpha always ends up being used squared. We define 2 \mathbf{v}\mathbf{v}^H=\beta\mathbf{w}\mathbf{w}^H where \mathbf{w} is defined to be either \mathbf{x}-\mathbf{y} or \mathbf{x}+\mathbf{y} leading to \beta being defined as either \frac{1}{1-\mathbf{y}^H \mathbf{x}} or \frac{1}{1+\mathbf{y}^H \mathbf{x}} respectively. We choose which quantity to use based on the sign of \mathbf{y}^H \mathbf{x}. The following C listing is not efficient but provides a reference in very little code for arbitrary dimension vectors.

/* Find the orthonormal symmetric reflection matrix which
 * maps the given unit-vector x to the given unit-vector y.
 * As the matrix is symmetric, it can be interpreted as
 * either row or column major.
 *
 * This isn't efficient. */
void make_reflector(const float *y, const float *x, unsigned N, float *mat) {
	unsigned i, j;
	float proj, sgn, beta;
	for (i = 0, proj = 0.0f; i < N; i++)
		proj += y[i] * x[i];
	sgn  = (proj >= 0.0f) ? 1.0 : -1.0;
	beta = 1.0f / (proj + sgn);
	for (i = 0; i < N; i++) {
		for (j = 0; j < N; j++) {
			float w_i = x[i] + y[i] * sgn;
			float w_j = x[j] + y[j] * sgn;
			mat[i*N+j] = w_i * beta * w_j;
		}
		mat[i*N+i] -= sgn;
	}
}

Because the resulting matrix is symmetric only n(n+1)/2 elements need to be computed. The w vector could also be precomputed to avoid excessive repeated computation in the inner loop. The completely unrolled and branch-less 3-dimension case can be written as:

void make_reflector(float x[3], float y[3], float mat[3][3]) {
	float proj  = x[0]*y[0] + x[1]*y[1] + x[2]*y[2];
	float sgn   = copysignf(1.0f, proj);
	float beta  = 1.0f / (proj + sgn);
	float w0    = x[0] + y[0] * sgn;
	float w1    = x[1] + y[1] * sgn;
	float w2    = x[2] + y[2] * sgn;
	float w0a   = w0 * beta;
	float w1a   = w1 * beta;
	float w2a   = w2 * beta;
	mat[0][0]   = w0a * w0 - sgn;
	mat[0][1]   = w0a * w1;
	mat[0][2]   = w0a * w2;
	mat[1][0]   = mat[0][1];
	mat[1][1]   = w1a * w1 - sgn;
	mat[1][2]   = w1a * w2;
	mat[2][0]   = mat[0][2];
	mat[2][1]   = mat[1][2];
	mat[2][2]   = w2a * w2 - sgn;
}

This looks very similar to the listings used to reflect \mathbf{x} to \mathbf{e}_1. None of this is new. The Householder matrix has been used for a very long time to solve this exact problem in the context of finding orthogonal matrices for performing the QR decomposition (google “QR decomposition using reflectors” to find less-verbose descriptions of what I’ve written on this page).

State-Space Representation – An Engineering Primer.

This is the first blog in what will probably be a series on the benefits of using state-space representation to implement discrete infinite impulse response (IIR) filters. Before I write about state-space and some of the things you can do with it, I want to give an overview of the difference equation and explain (hopefully intuitively) why they can be a poor choice for implementing a filter.

The discrete difference equation

The discrete difference equation takes an input value, a list of previous input values and a list of previous output values and combines them linearly to produce the next output value. This is codified mathematically in the following equation which can be recognised as the sum of two convolutions:

    \[ y[n] = \sum_{k=0}^{N_b} b[k] x[n-k] - \sum_{k=1}^{N_a} a[k] y[n-k] \]

Instead of the above, we could write the difference equation as two distinct difference equations chained together:

    \[ \begin{array}{@{}ll@{}} w[n] &= \sum_{k=0}^{N_b} b[k] x[n-k] \\ y[n] &= w[n] - \sum_{k=1}^{N_a} a[k] y[n-k] \end{array} \]

w[n] is the output of an all-zero filter which is used as input into an all-pole filter to produce y[n]. If we were to take the z-transforms of the two filters, we would see quite plainly that we can also swap the order of the filters as:

    \[ \begin{array}{@{}ll@{}} q[n] &= x[n] - \sum_{k=1}^{N_a} a[k] q[n-k] \\ y[n] &= \sum_{k=0}^{N_b} b[k] q[n-k] \end{array} \]

This time, q[n] is the output of an all-pole filter which is used as input into an all-zero filter to produce y[n]. If we set N=N_a=N_b, the benefit of this representation is that the two filters share exactly the same state and so an implementation needs not have delay lines for both output and input. This formulation has the name “Direct Form II” and is a popular implementation choice.

Regardless of whether we implement the difference equation directly or use Direct Form II, the implementation is easy to realise in a C program given a set of filter coefficients… so why would we want to do anything differently?

When difference equations go bad…

Let’s build a 6-th order elliptic filter and run the difference equation explicitly using the following GNU Octave code. The filter is designed with a passband edge frequency of 240 Hz at a 48000 Hz sample rate (which isn’t particularly low) with 6 dB of ripple in the passband and 80 dB of stopband attenuation.

N = 6;
IRLen = 8000;
[b, a] = ellip(N, 6, 80.0, 240.0/24000);
dp_y = zeros(IRLen, 1);
q = zeros(N, 1);
x = 1;
for i = 1:IRLen
  qv      = x - a(2:end) * q;
  dp_y(i) = b * [qv; q];
  q       = [qv; q(1:(N-1))];
  x       = 0;
end

And here is a plot of the impulse response and frequency response (which both look pretty good!):

Frequency Response of Double Precision Difference Equation - Everything looks good

Frequency Response of Double Precision Difference Equation

Oh yeah, Octave by default is using double-precision for everything. Because we like fast things, let’s specify that the states should always be stored using single precision by replacing the initialisation of q with:

q = single(zeros(N, 1));

And here is the new impulse response:

Frequency Response of Single Precision Difference Equation - it explodes into instability

Frequency Response of Single Precision Difference Equation

Bugger! Storing the states in single precision has lead to an instability in the filter. Small errors in the state are somehow being amplified by the coefficients of the difference equation and the output is growing without bound.

I’m going to try and give some intuition here: imagine that you are trying to synthesise a low-frequency discretely-sampled sinusoid that follows a slowly decaying exponential envelope. You’ve found that there is a z-transform that does exactly what you want from a z-transform table:

    \[ a^n \sin(\omega_0 n) u[n] \Leftrightarrow \frac{a z \sin(\omega_0)}{z^2 - 2az\cos(\omega_0) + a^2} \]

You decide to implement the difference equation directly using Direct Form II leading to states that are the previous outputs of the impulse response of the all-pole component of the system.

The intuition is this: the values of the states close to t=0 are all very close to zero – but after sufficient time, will eventually reach a very high-peak relative to their starting point without any external influence. Tiny errors in these small state values could grow by multiple orders of magnitude as the state evolves. This effectively happens at every zero crossing of the signal – and that is just illustrative component of the problem.

Ideally, the energy in the states of the system would only ever decrease when x[n]=0 regardless of the values of y[n]. Unfortunately, this almost never happens with a resonant system implemented using a difference equation. We should ask the question: can we find better states than previous outputs of the all-pole filter?

State-space

The Direct Form II difference equations presented earlier utilising q[n] and y[n] can be mutated into some fairly gross matrix equations:

    \[ \begin{tiny} \begin{array}{@{}ll@{}} \begin{pmatrix} q[n-N+1] \\ \vdots \\ q[n-2] \\ q[n-1] \\ q[n] \end{pmatrix} &= \begin{pmatrix} 0 & 1 & 0 & \dots & 0 \\ 0 & \ddots & \ddots & \ddots & \vdots \\ 0 & \dots & 0 & 1 & 0 \\ 0 & \dots & 0 & 0 & 1 \\ -a[N] & \dots & -a[3] & -a[2] & -a[1] \end{pmatrix} \begin{pmatrix} q[n-N] \\ \vdots \\ q[n-3] \\ q[n-2] \\ q[n-1] \end{pmatrix} + \begin{pmatrix} 0 \\ \vdots \\ \vdots \\ 0 \\ 1 \end{pmatrix} x[n] \\ y[n] &= \begin{pmatrix} b[N]-a[N]b[0] & \dots & b[2]-a[2]b[0] & b[1]-a[1]b[0] \end{pmatrix} \begin{pmatrix} q[n-N] \\ \vdots \\ q[n-2] \\ q[n-1] \end{pmatrix} + b[0] x[n] \end{array} \end{tiny} \]

Work through these and convince yourself they are equivalent. We can rewrite this entire system giving names to the matrices as:

    \[ \begin{array}{@{}ll@{}} \mathbf{q}[n+1] &= \mathbf{A}\mathbf{q}[n] + \mathbf{B}\mathbf{x}[n] \\ \mathbf{y}[n] &= \mathbf{C} \mathbf{q}[n] + \mathbf{D}\mathbf{x}[n] \end{array} \]

These matrix equations are a state-space representation. The explicitly listed matrices before it are known as a “controllable canonical form” representation (for the case when b[0]=0). \mathbf{q} is the state vector and in controllable canonical form, it holds previous values of q[n] as described in the difference equations earlier.

So why would we want to do this? Who cares that we mapped a difference equation into a set of matrix equations? I’m going to list out some reasons which probably deserve their own blogs:

  1. Ability to analyse system using a vast array of matrix mathematical tools.
  2. Improve the numerical stability of the system.
  3. Preserving physical meanings of states during discretisation.
  4. Ability to design for efficient implementations.

In this blog, we’re going to look at points one and two briefly to attempt to address the problems shown earlier.

Firstly, let’s write the state equations in terms of their z-transforms so that we can build an expression that maps input to output:

    \[ \begin{array}{@{}ll@{}} z\mathbf{Q}(z) &= \mathbf{A}\mathbf{Q}(z) + \mathbf{B}\mathbf{X}(z) \\ \mathbf{Y}(z) &= \mathbf{C} \mathbf{Q}(z) + \mathbf{D}\mathbf{X}(z) \end{array} \]

Doing some rearranging and substitution enables us to eliminate \mathbf{Q}(z):

    \[\begin{array}{@{}ll@{}} z\mathbf{Q}(z) &= \mathbf{A}\mathbf{Q}(z)+\mathbf{B}\mathbf{X}(z) \\ \left(\mathbf{I}z-\mathbf{A}\right)\mathbf{Q}(z) &= \mathbf{B}\mathbf{X}(z) \\ \mathbf{Q}(z) &= \left(\mathbf{I}z-\mathbf{A}\right)^{-1} \mathbf{B}\mathbf{X}(z) \\ \mathbf{Y}(z) &= \left( \mathbf{C}\left(\mathbf{I}z-\mathbf{A}\right)^{-1} \mathbf{B}+\mathbf{D} \right) \mathbf{X}(z) \\ &= \frac{\mathbf{C}\text{ adj}\left(\mathbf{I}z-\mathbf{A}\right) \mathbf{B}+\mathbf{D}\det\left(\mathbf{I}z-\mathbf{A}\right) }{\det\left(\mathbf{I}z-\mathbf{A}\right)} \mathbf{X}(z) \end{array}\]

The \det\left(\mathbf{I}z-\mathbf{A}\right) term in the denominator expands to the characteristic polynomial of \mathbf{A} in z and if we are trying to map a z-domain transfer function to a state-space system, it needs to equal the denominator of the transfer function. Just as the poles of a transfer function are roots of the denominator, the poles of a state-space system are the eigenvalues of the \mathbf{A} matrix. The numerator of the above expression is a matrix of polynomials in z showing how state-space is naturally able to handle multiple-input multiple-output (MIMO) systems.

Earlier, we asked can we pick better values for the states? Before getting to that, let’s answer a different question: can we pick different values for the states? The answer is: yes – and we can prove it trivially by substituting \mathbf{q}[n] with \mathbf{W}^{-1}\hat{\mathbf{q}}[n] where \mathbf{W} is some invertible square matrix:

    \[ \begin{array}{@{}ll@{}} \mathbf{W}^{-1}\hat{\mathbf{q}}[n+1] &= \mathbf{A}\mathbf{W}^{-1}\hat{\mathbf{q}}[n] + \mathbf{B}\mathbf{x}[n] \\ \hat{\mathbf{q}}[n+1] &= \mathbf{W}\mathbf{A}\mathbf{W}^{-1}\hat{\mathbf{q}}[n] + \mathbf{W}\mathbf{B}\mathbf{x}[n] \\ \mathbf{y}[n] &= \mathbf{C} \mathbf{W}^{-1}\hat{\mathbf{q}}[n] + \mathbf{D}\mathbf{x}[n] \end{array} \]

This changes the coordinate system that the state vector exists in. We can pick any invertible matrix \mathbf{W} and use it to modify a system’s \mathbf{A}, \mathbf{B} and \mathbf{C} matrices keeping the system output \mathbf{y}[n] identical. The quantity \mathbf{W}\mathbf{A}\mathbf{W}^{-1} is a similarity transform on \mathbf{A} and it preserves the eigenvalues/characteristic polynomial of \mathbf{A} for all invertible \mathbf{W}.

State-space degrees of freedom

The coefficients of a difference equation are unique for a particular impulse response. The coefficients of the state-space representation for N>1 are not. In-fact, we have already demonstrated this previously by showing we can change the coordinate system of the state-vector.

Let’s go through the process of taking a 2×2 SISO system and finding the transfer function.

    \[ \begin{small} \begin{array}{@{}ll@{}} \frac{Y(z)}{X(z)} &= \frac{\mathbf{C}\text{ adj}\left(\mathbf{I}z-\mathbf{A}\right) \mathbf{B}+\mathbf{D}\det\left(\mathbf{I}z-\mathbf{A}\right) }{\det\left(\mathbf{I}z-\mathbf{A}\right)} \\ &= \frac{\begin{pmatrix}c_0 & c_1\end{pmatrix}\text{ adj}\begin{pmatrix}z-a_{00} & -a_{01} \\ -a_{10} & z-a_{11}\end{pmatrix} \begin{pmatrix}b_0 \\ b_1\end{pmatrix}+d\det\begin{pmatrix}z-a_{00} & -a_{01} \\ -a_{10} & z-a_{11}\end{pmatrix} }{\det\begin{pmatrix}z-a_{00} & -a_{01} \\ -a_{10} & z-a_{11}\end{pmatrix}} \\ &= \frac{\begin{pmatrix}c_0 & c_1\end{pmatrix}\begin{pmatrix}z-a_{11} & a_{01} \\ a_{10} & z-a_{00}\end{pmatrix} \begin{pmatrix}b_0 \\ b_1\end{pmatrix}+d\left((z-a_{00})(z-a_{11})-a_{01}a_{10}\right)}{(z-a_{00})(z-a_{11})-a_{01}a_{10}} \\ &= \frac{c_0(b_0(z-a_{11}) + b_1a_{01}) + c_1(b_0a_{10} + b_1(z-a_{00}))+d\left((z-a_{00})(z-a_{11})-a_{01}a_{10}\right)}{(z-a_{00})(z-a_{11})-a_{01}a_{10}} \\ &= \frac{dz^2+z(c_0b_0+c_1b_1-d(a_{00}+a_{11}))+c_0(b_1a_{01}-b_0a_{11})+c_1(b_0a_{10}-b_1a_{00})+d(a_{00}a_{11}-a_{01}a_{10})}{z^2-z(a_{00}+a_{11}) + a_{00}a_{11}-a_{01}a_{10}} \end{array} \end{small} \]

We can select all of these quantities to match a given second order transfer function – but how should we choose them? What makes for good coefficients? What does good mean? We could force some values to zero or one to minimize multiplies? If we want to improve numeric performance, given that the \mathbf{A} matrix is responsible for moving states forward in time, it seems reasonable to minimize the impact that errors in the state vector can have when multiplied by \mathbf{A} – this is akin to making \mathbf{A} as well conditioned as possible. This might be a good strategy for fixing our example 6th order system.

Orthogonal matrices are well-conditioned and for those deeply familiar with IIR design, you may also know that the state update for a coupled form IIR implementation is effectively determined by an orthonormal matrix that has been attenuated by some scalar. Coupled form state-updates can only be used to mimic a conjugate pole pair, but for every classical IIR design (Butterworth, Chebyshev I/II, Cauer/Elliptic) of even order (like our example 6th order problematic system), all the poles are in conjugate pairs. Start by defining \mathbf{A} as an orthogonal matrix:

    \[ \mathbf{A} = \begin{pmatrix}a_{00} & -a_{10} \\ a_{10} & a_{00} \end{pmatrix} \]

The transfer function becomes:

    \[ \begin{tiny} \frac{Y(z)}{X(z)} = \frac{dz^2+z(c_0b_0+c_1b_1-2da_{00})+c_0(-b_1a_{10}-b_0a_{00})+c_1(b_0a_{10}-b_1a_{00})+d(a_{00}^2+a_{10}^2)}{z^2-2za_{00} + a_{00}^2+a_{10}^2} \end{tiny} \]

There are two remaining free variables we can pick. Somewhat arbitrarily, we will select c_0=1 and c_1=0 to get:

    \[ \frac{Y(z)}{X(z)} = \frac{dz^2+z(b_0-2da_{00})-b_1a_{10}-b_0a_{00}+d(a_{00}^2+a_{10}^2)}{z^2-2za_{00} + a_{00}^2+a_{10}^2} \]

We now have a system of equations we can solve for the unknown matrix quantities that will create a state-space system that mimics a 2nd order resonant transfer function.

Chaining systems together

Consider we have two systems which we would like to connect together:

    \[ \begin{array}{@{}ll@{}} z\mathbf{Q}_1(z) &= \mathbf{A}_1\mathbf{Q}_1(z) + \mathbf{B}_1\mathbf{X}(z) \\ \mathbf{U}(z) &= \mathbf{C}_1 \mathbf{Q}_1(z) + \mathbf{D}_1\mathbf{X}(z) \end{array} \]

And:

    \[ \begin{array}{@{}ll@{}} z\mathbf{Q}_2(z) &= \mathbf{A}_2\mathbf{Q}_2(z) + \mathbf{B}_2\mathbf{U}(z) \\ \mathbf{Y}(z) &= \mathbf{C}_2 \mathbf{Q}_2(z) + \mathbf{D}_2\mathbf{U}(z) \end{array} \]

We can make direct substitutions to get:

    \[ \begin{array}{@{}ll@{}}  z\mathbf{Q}_1(z) &= \mathbf{A}_1\mathbf{Q}_1(z) + \mathbf{B}_1\mathbf{X}(z) \\ z\mathbf{Q}_2(z) &= \mathbf{A}_2\mathbf{Q}_2(z) + \mathbf{B}_2\mathbf{C}_1 \mathbf{Q}_1(z) + \mathbf{B}_2 \mathbf{D}_1\mathbf{X}(z) \\ \mathbf{Y}(z) &= \mathbf{C}_2 \mathbf{Q}_2(z) + \mathbf{D}_2\mathbf{C}_1 \mathbf{Q}_1(z) + \mathbf{D}_2\mathbf{D}_1\mathbf{X}(z) \end{array} \]

We can re-write the above as a set of block matrix equations:

    \[ \begin{array}{@{}ll@{}} z\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} &= \begin{pmatrix} \mathbf{A}_2 & \mathbf{B}_2 \mathbf{C}_1 \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \begin{pmatrix}\mathbf{B}_2 \mathbf{D}_1 \\ \mathbf{B}_1\end{pmatrix} \mathbf{X}(z) \\ \mathbf{Y}(z) &= \begin{pmatrix}\mathbf{C}_2 & \mathbf{D}_2 \mathbf{C}_1 \end{pmatrix}\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \mathbf{D}_2 \mathbf{D}_1 \mathbf{X}(z) \end{array} \]

This provides a direct method for connecting systems together into one larger system. As \mathbf{A} is an upper-triangular block-diagonal matrix, the eigenvalues of the new system are the same as those of the existing systems and the \mathbf{B}_2\mathbf{C}_1 term in the connected system has no influence. Given that, can we decouple the sections completely? i.e. can we completely eliminate the \mathbf{B}_2\mathbf{C}_1 term by using a similarity transform as described previously? Let’s pose the question using block matrices:

    \[ \begin{pmatrix} \mathbf{A}_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \begin{pmatrix} \mathbf{W}_{00} & \mathbf{W}_{01} \\ \mathbf{W}_{10} & \mathbf{W}_{11} \end{pmatrix} = \begin{pmatrix} \mathbf{W}_{00} & \mathbf{W}_{01} \\ \mathbf{W}_{10} & \mathbf{W}_{11} \end{pmatrix} \begin{pmatrix} \mathbf{A}_2 & \mathbf{B}_2 \mathbf{C}_1 \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \]

    \[ \begin{array}{@{}ll@{}} \mathbf{A}_2 \mathbf{W}_{00} &= \mathbf{W}_{00} \mathbf{A}_2 \\ \mathbf{A}_1 \mathbf{W}_{10} &= \mathbf{W}_{10} \mathbf{A}_2 \\ \mathbf{A}_2 \mathbf{W}_{01} &= \mathbf{W}_{00} \mathbf{B}_2 \mathbf{C}_1 + \mathbf{W}_{01} \mathbf{A}_1 \\ \mathbf{A}_1 \mathbf{W}_{11} &= \mathbf{W}_{10} \mathbf{B}_2 \mathbf{C}_1 + \mathbf{W}_{11} \mathbf{A}_1 \end{array} \]

There is only one valid set of solutions for these equations:

    \[ \begin{array}{@{}ll@{}} \mathbf{W}_{00} &= \mathbf{I} \\ \mathbf{W}_{11} &= \mathbf{I} \\ \mathbf{W}_{10} &= \mathbf{0} \\ \mathbf{A}_2 \mathbf{W}_{01} - \mathbf{W}_{01} \mathbf{A}_1 &= \mathbf{B}_2 \mathbf{C}_1 \end{array} \]

Where the final equation is a Sylvester equation which can be solved without too much pain programatically (i.e. using a matrix inverse) if and only if \mathbf{A}_1 and \mathbf{A}_2 share no common eigenvalues. If we do this, we end up with the following system:

    \[ \begin{array}{@{}ll@{}} z\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} &= \begin{pmatrix} \mathbf{A}_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \mathbf{W}\begin{pmatrix}\mathbf{B}_2 \mathbf{D}_1 \\ \mathbf{B}_1\end{pmatrix} \mathbf{X}(z) \\ \mathbf{Y}(z) &= \begin{pmatrix}\mathbf{C}_2 & \mathbf{D}_2 \mathbf{C}_1 \end{pmatrix}\mathbf{W}^{-1}\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \mathbf{D}_2 \mathbf{D}_1 \mathbf{X}(z) \end{array} \]

This will be better conditioned and have some other benefits in terms of efficient implementations. This block-diagonalisation we just performed is the equivalent of performing the partial-fraction decomposition on the cascaded system (which I’ve never seen a nice looking algorithmic solution for).

Bringing it home

Let’s take our problematic 6th order system, break it up into 3 second order state-space sections using the coupled-form mapping we described and evaluate the performace again:

Performance of the coupled-form state-space systems vs. the double precision difference equation

Performance of the coupled-form state-space systems vs. the double precision difference equation


There are 4 data-sets:

  1. The original double-precision difference equation results.
  2. The concatenated coupled-form systems without block-diagonalisation storing the states in single precision.
  3. The concatenated coupled-form systems without block-diagonalisation storing the states in a q15 format (15 bits of fractional precision).
  4. The concatenated coupled-form systems with block-diagonalisation storing the states in q15 format.

The performance difference between 2 and the block-diagonalised system with states in signle precision is negligible and so is not included in the plots. The performance difference between 1, 2 and 4 is practiacally identical in the passband. There is some fudging going on with the q15 plots because the precision is fixed but the range is not. I’ve tried to make things more even by:

  • 2-Norm matching the \mathbf{B} and \mathbf{C} matrices of the systems without diagonalisation.
  • 2-Norm matching each sub-section of the \mathbf{B} and \mathbf{C} matrices of the system with diagonalisation (which we can do because they are all operating independetly).
  • 2-Norm matching the \mathbf{C} matrices of the diagonalised and undiagonalised systems after the above has happened (which also requires modifying \mathbf{B}).

The performance of 1, 2 and 4 are all incredibly good. The performance of 3 shows that cascading leads to error build-up that significantly compromises the output.

These coupled-form state-space filter realisations are good down to very low frequencies. I have had success synthesising sub 10 Hz 16th order elliptic filters that work just fine with single-precision floating point. Next time, we’ll look at preservation of state meanings when we discretise an analogue system.

An attempt at an intuitive description of the QR decomposition using Householder reflectors

I really like the Householder matrix. I’ve blogged about its ability to generate an orthonormal basis containing a particular vector in a previous blog post. This blog post is a bit of a tutorial which will use the Householder matrix to perform a QR decomposition. Applying Householder reflectors to compute a QR decomposition is nothing new, but I want this blog post to attempt to provide some intuition into how the algorithm works starting from almost nothing. We’ll briefly visit inner products, matrix multiplication, the Householder matrix and then build a QR decomposition in C. I’m not going to use formal language to define these operations unless it’s necessary. I’m going to assume that you understand things like matrix multiplication and what a matrix inverse is, but not much more than that. The post will be written assuming complex numbers are being used (the C implementation will not… maybe I will add one later).

The inner product

The inner product takes two vectors and produces a scalar. You can think of a vector as being an array of length N values representing coordinates in some N-dimensional space.

    \[ \langle \mathbf{a}, \mathbf{b} \rangle = \sum^{N}_{k=1} a_k \overline{b_k} \]

Note the bar in the above indicates conjugation. If \mathbf{a} and \mathbf{b} are row vectors, we could write the above as:

    \[ \langle \mathbf{a}, \mathbf{b} \rangle = \mathbf{a} \mathbf{b}^H \]

Where \mathbf{b}^H is the transposed conjugate of \mathbf{b}. Two vectors are said to be orthogonal if they have an inner product of zero. If the inner product of a vector with itself is equal to one, the vector is said to be a unit-vector. The inner product of a vector with a unit-vector is the proportion of that vector that is pointing in the same direction as the unit-vector i.e. if \mathbf{u} is some unit vector and I define a new vector as:

    \[ \mathbf{b} = \mathbf{a} - \langle \mathbf{a}, \mathbf{u} \rangle \mathbf{u} \]

The vector \mathbf{b} is now orthogonal to \mathbf{u} i.e. \langle\mathbf{b},\mathbf{u}\rangle = 0.

There are some good geometric visualisations of these for two and three dimensional vectors… but everything continues to work in arbitrary dimensions. The above image contains two plots which both contain two perpendicular unit vectors (red and green) and an arbitrary vector (black). In both plots we can verify the red and green vectors are unit vectors by applying the inner product. We can also verify they are orthogonal by applying the inner product. If we compute the inner product of the black vector on the two unit vectors, then multiply each of the scalar results by the corresponding unit vectors used to compute them and sum the results, we should get back the black vector.

Matrix multiplications

We can define matrix multiplication using inner products of the vectors which make up the rows of the left-hand side matrix with the vectors which make up the columns of the right-hand side matrix.

    \[ \begin{tiny} \begin{pmatrix} \langle r_0, \overline{c_0} \rangle & \langle r_0, \overline{c_1} \rangle & \dots & \langle r_0, \overline{c_{N-1}} \rangle \\ \langle r_1, \overline{c_0} \rangle & \langle r_1, \overline{c_1} \rangle & \dots & \langle r_1, \overline{c_{N-1}} \rangle \\ \vdots & \vdots &  & \vdots \\ \langle r_{M-1}, \overline{c_0} \rangle & \langle r_{M-1}, \overline{c_1} \rangle & \dots & \langle r_{M-1}, \overline{c_{N-1}} \rangle \end{pmatrix} = \begin{pmatrix} \dots & r_0 & \dots \\ \dots & r_1 & \dots \\   & \vdots &  \\ \dots & r_{M-1} & \dots \end{pmatrix} \begin{pmatrix} \vdots & \vdots &  & \vdots \\ c_0 & c_1 & \dots & c_{N-1} \\ \vdots & \vdots &  & \vdots \end{pmatrix} \end{tiny} \]

We can use this to demonstrate that a matrix that is made up of rows (or columns) that are all orthogonal to each other is trivially invertible – particularly if the vectors are unit vectors, in which case the inverse of matrix \mathbf{A} is \mathbf{A}^H. To make this particularly clear:

    \[ \begin{tiny} \begin{pmatrix} \langle r_0, r_0 \rangle & \langle r_0, r_1 \rangle & \dots & \langle r_0, r_{N-1} \rangle \\ \langle r_1, r_0 \rangle & \langle r_1, r_1 \rangle & \dots & \langle r_1, r_{N-1} \rangle \\ \vdots & \vdots &  & \vdots \\ \langle r_{N-1}, r_0 \rangle & \langle r_{N-1}, r_1 \rangle & \dots & \langle r_{N-1}, r_{N-1} \rangle \end{pmatrix} = \begin{pmatrix} \dots & r_0 & \dots \\ \dots & r_1 & \dots \\   & \vdots &  \\ \dots & r_{N-1} & \dots \end{pmatrix} \begin{pmatrix} \vdots & \vdots &  & \vdots \\ \overline{r_0} & \overline{r_1} & \dots & \overline{r_{N-1}} \\ \vdots & \vdots &  & \vdots \end{pmatrix} \end{tiny} \]

If the rows are orthogonal to each other, the inner products off the main diagonal will all be zero. If the rows are unit vectors, the diagonal entries are by definition 1 and \mathbf{I} = \mathbf{A}\mathbf{A}^H.

The Householder matrix

Define a matrix \mathbf{P} as:

    \[\mathbf{P}=\mathbf{I} - 2\mathbf{v}\mathbf{v}^H\]

Where \mathbf{v} is a unit vector. Then:

    \[\begin{array}{@{}ll@{}} \mathbf{P}\mathbf{P} &= \mathbf{I}-2\left(\mathbf{v}\mathbf{v}^H+\mathbf{v}\mathbf{v}^H\right)+4\mathbf{v}\mathbf{v}^H\mathbf{v}\mathbf{v}^H \\ &= \mathbf{I}-4\mathbf{v}\mathbf{v}^H+4\mathbf{v}\left(\mathbf{v}^H\mathbf{v}\right)\mathbf{v}^H \\ & = \mathbf{I} \end{array}\]

Thereore \mathbf{P} is its own inverse (\mathbf{P}=\mathbf{P}^{-1}) and must also be unitary (\mathbf{P}=\mathbf{P}^H).

If we can work out a method to choose \mathbf{v} such that \mathbf{P} will contain a particular unit vector \mathbf{u} and we multiply \mathbf{P} by any scaled version of vector \mathbf{u}, we will get a vector which has only one non-zero entry (because all other rows of the matrix will be orthogonal to \mathbf{u}). I have described this process before in a previous blog post but will repeat it here with some more detail. Define a vector \mathbf{u} that we want the first column of \mathbf{P} to equal:

    \[ \begin{array}{@{}ll@{}} \mathbf{u} &= \left( \mathbf{I} - 2 \mathbf{v} \mathbf{v}^H \right) \mathbf{e}_1 \\ \frac{\mathbf{e}_1-\mathbf{u}}{2} &= \mathbf{v} \overline{v_1} \end{array} \]

If u_1 is real, we can break \mathbf{v} into its individual coordinates to solve for their values:

    \[ \begin{array}{@{}ll@{}} v_1 &= \sqrt{\frac{1-u_1}{2}} \\ v_2 &= \frac{-u_2}{2\sqrt{\frac{1-u_1}{2}}} \\ &\vdots \\ v_n &= \frac{-u_n}{2\sqrt{\frac{1-u_1}{2}}} \end{array} \]

Given the way the matrix is defined, the computation of the square root is unnecessary in an implementation. This is best seen if we expand out the 2 \mathbf{v} \mathbf{v}^H matrix:

    \[ \begin{array}{@{}ll@{}} 2 \mathbf{v} \mathbf{v}^H &= \begin{pmatrix} 2 v_1 \overline{v_1} & 2 v_1 \overline{v_2} & \dots & 2 v_1 \overline{v_n} \\ 2 v_2 \overline{v_1} & 2 v_2 \overline{v_2} & \dots & 2 v_2 \overline{v_n} \\ \vdots & \vdots & \ddots & \vdots \\ 2 v_n \overline{v_1} & 2 v_n \overline{v_2} & \dots & 2 v_n \overline{v_n} \end{pmatrix} \\ &= \begin{pmatrix} 1 - u_1 & -\overline{u_2} & \dots & -\overline{u_n} \\ -u_2 & \frac{u_2 \overline{u_2}}{1-u_1} & \dots & \frac{u_2 \overline{u_n}}{1-u_1} \\ \vdots & \vdots & \ddots & \vdots \\ -u_n & \frac{u_n \overline{u_2}}{1-u_1} & \dots & \frac{u_n \overline{u_n}}{1-u_1} \end{pmatrix} \end{array} \]

Define a scalar \alpha=\frac{1}{1-u_1} and a vector \mathbf{w}:

    \[\mathbf{w} = \begin{pmatrix}1 - u_1 \\ -u_2 \\ \vdots \\ -u_n \end{pmatrix} \]

Then 2 \mathbf{v} \mathbf{v}^H = \alpha \mathbf{w} \mathbf{w}^H. This is a particularly useful formulation because we rarely want to compute \mathbf{P} in its entirity and perform a matrix mutliplication. Rather, we compute:

    \[ \mathbf{P}\mathbf{A} = \left(\mathbf{I}-\alpha\mathbf{w}\mathbf{w}^H\right)\mathbf{A}=\mathbf{A}-\alpha\mathbf{w}\mathbf{w}^H\mathbf{A} \]

Which is substantially more efficient. A final necessary change is necessary: if u_1 is close to one (which implies all other coefficients of \mathbf{u} are approaching zero), \alpha will begin to approach infinity. There is a relatively simple change here which is to recognise that \mathbf{u} and -\mathbf{u} are parallel but pointing in different directions (their inner-product is -1). If we don’t care about the direction of the vector, we can change its sign to be the most numerically suitiable for the selection of \alpha i.e. if u_1 < 0 use \mathbf{u} otherwise use -\mathbf{u}.

This is actually very useful in many applications, one of which is the QR decomposition.

The QR decomposition

The QR decomposition breaks down a matrix \mathbf{A} into a unitary matrix \mathbf{Q} and an upper-triangular matrix \mathbf{R}. There are a number of ways to do this, but we are going use the Householder matrix.

    \[\mathbf{A} = \mathbf{Q}\mathbf{R}\]

Let’s start by defining a series of unitary transformations applied to \mathbf{A} as:

    \[\begin{array}{@{}ll@{}} \mathbf{X}_1&=\mathbf{P}_N \mathbf{A} \\ \mathbf{X}_{2} &= \begin{pmatrix} \mathbf{I}_{1} & \mathbf{0} \\ \mathbf{0} & \mathbf{P}_{N-1} \end{pmatrix} \mathbf{X}_{1} \\ \mathbf{X}_{3} &= \begin{pmatrix} \mathbf{I}_{2} & \mathbf{0} \\ \mathbf{0} & \mathbf{P}_{N-2} \end{pmatrix} \mathbf{X}_{2} \\ & \vdots \\ \mathbf{X}_{N-1} &= \begin{pmatrix} \mathbf{I}_{N-2} & \mathbf{0} \\ \mathbf{0} & \mathbf{P}_{2} \end{pmatrix} \mathbf{X}_{N-2} \end{array} \]

Where \mathbf{P}_n are all n-by-n unitary matrices. Notice that the first row of \mathbf{X}_1 will be unmodified by all the transforms that follow. The first two rows of \mathbf{X}_2 will be unmodified by all the transforms that follow – and so-on.

If we can somehow force the first row of \mathbf{P}_N to be the first column of \mathbf{A} scaled to be a unit vector (while keeping \mathbf{P}_N unitary), the first column of \mathbf{X}_1 will contain only one non-zero entry. We then set about finding a way to select \mathbf{P}_{N-1} such that the second column contains only two non-zero entries… and so on. Once the process is finished, \mathbf{X}_{N-1} is upper triangular and the unitary matrix which converts \mathbf{A} into this form can be written as:

    \[ \mathbf{Q}^H = \begin{pmatrix} \mathbf{I}_{N-2} & \mathbf{0} \\ \mathbf{0} & \mathbf{P}_{2} \end{pmatrix} \dots \begin{pmatrix} \mathbf{I}_{2} & \mathbf{0} \\ \mathbf{0} & \mathbf{P}_{N-2} \end{pmatrix}  \begin{pmatrix} \mathbf{I}_{1} & \mathbf{0} \\ \mathbf{0} & \mathbf{P}_{N-1} \end{pmatrix} \mathbf{P}_N \]

And:

    \[ \begin{array}{@{}ll@{}} \mathbf{X}_{N-1} &= \mathbf{Q}^H\mathbf{A} \\ \mathbf{Q}\mathbf{X}_{N-1}&=\mathbf{A} \end{array} \]

When we follow the above process and the \mathbf{P}_n matricies are chosen to be Householder matrices, we have performed the QR decomposition using Householder matrices.

Follows is the C code that implements the decomposition. Because the algorithm repeatedly applies transformations to \mathbf{A} to eventually arrive at \mathbf{R}, we will design an API that operates in-place on the supplied matrix. The orthogonal matrix will be built in parallel. The N-1 iterations over the i variable, correspond to the N-1 transformation steps presented earlier. I make no claims about the computational performance of this code. The numerical performance could be improved by using a different method to compute the length of the vector (first thing that happens in the column loop).

/* Perform the QR decomposition of the given square A matrix.
 * A/R is stored and written in columns i.e.
 * a[0] a[3] a[6]
 * a[1] a[4] a[7]
 * a[2] a[5] a[8]
 *
 * q is stored in rows i.e.
 * q[0] q[1] q[2]
 * q[3] q[4] q[5]
 * q[6] q[7] q[8] */
void qr(double *a_r, double *q, int N) {
	int i, j, k;
 
	/* Initialise qt to be the identity matrix. */
	for (i = 0; i < N; i++) {
		for (j = 0; j < N; j++) {
			q[i*N+j] = (i == j) ? 1.0 : 0.0;
		}
	}
 
	for (i = 0; i < N-1; i++) {
		double  norm, invnorm, alpha;
		double *ccol = a_r + i*N;
 
		/* Compute length of vector (needed to perform normalisation). */
		for (norm = 0.0, j = i; j < N; j++) {
			norm += ccol[j] * ccol[j];
		}
		norm = sqrt(norm);
 
		/* Flip the sign of the normalisation coefficient to have the
		 * effect of negating the u vector. */
		if (ccol[0] > 0.0)
			norm = -norm;
		invnorm = 1.0 / norm;
 
		/* Normalise the column
		 * Convert the column in-place into the w vector
		 * Compute alpha */
		ccol[i] = (1.0 - ccol[i] * invnorm);
		for (j = i+1; j < N; j++) {
			ccol[j] = -ccol[j] * invnorm;
		}
		alpha = 1.0 / ccol[i];
 
		/* Update the orthogonal matrix */
		for (j = 0; j < N; j++) {
			double acc = 0.0;
			for (k = i; k < N; k++) {
				acc += ccol[k] * q[k+j*N];
			}
			acc *= alpha;
			for (k = i; k < N; k++) {
				q[k+j*N] -= ccol[k] * acc;
			}
		}
 
		/* Update the upper triangular matrix */
		for (j = i+1; j < N; j++) {
			double acc = 0.0;
			for (k = i; k < N; k++) {
				acc += ccol[k] * a_r[k+j*N];
			}
			acc *= alpha;
			for (k = i; k < N; k++) {
				a_r[k+j*N] -= ccol[k] * acc;
			}
		}
 
		/* Overwrite the w vector with the column data. */
		ccol[i] = norm;
		for (j = i+1; j < N; j++) {
			ccol[j] = 0.0;
		}
	}
}

The algorithm follows almost everthing described on this page – albiet with some minor obfuscations (the \mathbf{w} vector is stored temporarily in the output matrix to avoid needing any extra storage). It’s worth mentioning that the first step of initialising the \mathbf{Q} matrix to \mathbf{I} isn’t necessary. The initialisation could be performed explicitly in the first iteration of the loop over i – but that introduces more conditional code.