Category Archives: Uncategorized

Even More Householder

In several previous blogs (here and here), I’ve written about the orthonormal Householder matrix and how it can constructed such that it contains a particular unit-vector. A recent tweet from Eric Arnebäck posted a link to some code that was originally put together in 1999 by Tomas Akenine-Möller and John Hughes to find an orthogonal matrix which maps one unit-vector to another. This is a slightly more general problem which can be still solved in the same way using the Householder matrix. The vectors and matrices in this post are assumed to be real but the method can be modified to work with complex numbers.

In the previously linked posts, we described the method to find the Householder matrix \mathbf{P} given a unit-vector \mathbf{x} such that \mathbf{P} \mathbf{x}=\mathbf{e}_1. Another way of wording this is: find the orthogonal matrix \mathbf{P} that contains \mathbf{x} as the first row. Now, we will generalise this to find an orthogonal matrix \mathbf{P} given a unit-vector \mathbf{x} such that \mathbf{P} \mathbf{x}=\mathbf{y} where \mathbf{y} is another unit-vector. We start by defining \mathbf{P} to be the Householder reflector \mathbf{P}=\mathbf{I}-2\mathbf{v}\mathbf{v}^H and attempt to solve for \mathbf{v}:

    \[ \begin{array}{@{}ll@{}} \mathbf{y} &= \left(\mathbf{I} - 2 \mathbf{v} \mathbf{v}^H\right) \mathbf{x} \\ \mathbf{x} - \mathbf{y} &= 2 \mathbf{v} \mathbf{v}^H \mathbf{x} \end{array} \]

As the quantity \mathbf{v}^H \mathbf{x} is a scalar, the vector \mathbf{v} must be a scaled version of \mathbf{x}-\mathbf{y}. Define \mathbf{v}=\alpha\left(\mathbf{x}-\mathbf{y}\right) and make the substitution into the previous equation:

    \[ \begin{array}{@{}ll@{}} \mathbf{x} - \mathbf{y} &= 2 \alpha^2 \left(\mathbf{x}-\mathbf{y}\right) \left(\mathbf{x}-\mathbf{y}\right)^H \mathbf{x} \\ &= 2 \alpha^2 \left(\mathbf{x}-\mathbf{y}\right) \left(1-\mathbf{y}^H \mathbf{x}\right) \\ \frac{1}{\sqrt{2 \left(1-\mathbf{y}^H \mathbf{x}\right)}} &= \alpha \end{array} \]

Observe that \alpha has a singularity as \mathbf{y}^H \mathbf{x} approaches one. This can be addressed (as has been done in the previous posts) by negating the definition of the reflector (i.e. set \mathbf{P}=2\mathbf{v}\mathbf{v}^H-\mathbf{I}) and solving for \mathbf{v} again:

    \[ \begin{array}{@{}ll@{}} \mathbf{y} &= \left(2 \mathbf{v} \mathbf{v}^H - \mathbf{I}\right) \mathbf{x} \\ \mathbf{x} + \mathbf{y} &= 2 \mathbf{v} \mathbf{v}^H \mathbf{x} \end{array} \]

This time, \mathbf{v} will be defined as \mathbf{v} = \alpha\left(\mathbf{x}+\mathbf{y}\right) and the solution for \alpha becomes:

    \[ \frac{1}{\sqrt{2 \left(1+\mathbf{y}^H \mathbf{x}\right)}}\]

This shifts the singularity to \mathbf{y}^H \mathbf{x}=-1. If the quantity \mathbf{y}^H \mathbf{x} is positive, this latter solution will be more numerically robust; if it is negative, the former will be more numerically robust. We will build an algorithm that performs the flip based on the sign of \mathbf{y}^H \mathbf{x} which results in a discontinuity of the resulting matrix when the sign changes. This might be important to consider in some use-cases.

We can avoid computing the square root because \alpha always ends up being used squared. We define 2 \mathbf{v}\mathbf{v}^H=\beta\mathbf{w}\mathbf{w}^H where \mathbf{w} is defined to be either \mathbf{x}-\mathbf{y} or \mathbf{x}+\mathbf{y} leading to \beta being defined as either \frac{1}{1-\mathbf{y}^H \mathbf{x}} or \frac{1}{1+\mathbf{y}^H \mathbf{x}} respectively. We choose which quantity to use based on the sign of \mathbf{y}^H \mathbf{x}. The following C listing is not efficient but provides a reference in very little code for arbitrary dimension vectors.

/* Find the orthonormal symmetric reflection matrix which
 * maps the given unit-vector x to the given unit-vector y.
 * As the matrix is symmetric, it can be interpreted as
 * either row or column major.
 * This isn't efficient. */
void make_reflector(const float *y, const float *x, unsigned N, float *mat) {
	unsigned i, j;
	float proj, sgn, beta;
	for (i = 0, proj = 0.0f; i < N; i++)
		proj += y[i] * x[i];
	sgn  = (proj >= 0.0f) ? 1.0 : -1.0;
	beta = 1.0f / (proj + sgn);
	for (i = 0; i < N; i++) {
		for (j = 0; j < N; j++) {
			float w_i = x[i] + y[i] * sgn;
			float w_j = x[j] + y[j] * sgn;
			mat[i*N+j] = w_i * beta * w_j;
		mat[i*N+i] -= sgn;

Because the resulting matrix is symmetric only n(n+1)/2 elements need to be computed. The w vector could also be precomputed to avoid excessive repeated computation in the inner loop. The completely unrolled and branch-less 3-dimension case can be written as:

void make_reflector(float x[3], float y[3], float mat[3][3]) {
	float proj  = x[0]*y[0] + x[1]*y[1] + x[2]*y[2];
	float sgn   = copysignf(1.0f, proj);
	float beta  = 1.0f / (proj + sgn);
	float w0    = x[0] + y[0] * sgn;
	float w1    = x[1] + y[1] * sgn;
	float w2    = x[2] + y[2] * sgn;
	float w0a   = w0 * beta;
	float w1a   = w1 * beta;
	float w2a   = w2 * beta;
	mat[0][0]   = w0a * w0 - sgn;
	mat[0][1]   = w0a * w1;
	mat[0][2]   = w0a * w2;
	mat[1][0]   = mat[0][1];
	mat[1][1]   = w1a * w1 - sgn;
	mat[1][2]   = w1a * w2;
	mat[2][0]   = mat[0][2];
	mat[2][1]   = mat[1][2];
	mat[2][2]   = w2a * w2 - sgn;

This looks very similar to the listings used to reflect \mathbf{x} to \mathbf{e}_1. None of this is new. The Householder matrix has been used for a very long time to solve this exact problem in the context of finding orthogonal matrices for performing the QR decomposition (google “QR decomposition using reflectors” to find less-verbose descriptions of what I’ve written on this page).

State-Space Representation – An Engineering Primer.

This is the first blog in what will probably be a series on the benefits of using state-space representation to implement discrete infinite impulse response (IIR) filters. Before I write about state-space and some of the things you can do with it, I want to give an overview of the difference equation and explain (hopefully intuitively) why they can be a poor choice for implementing a filter.

The discrete difference equation

The discrete difference equation takes an input value, a list of previous input values and a list of previous output values and combines them linearly to produce the next output value. This is codified mathematically in the following equation which can be recognised as the sum of two convolutions:

    \[ y[n] = \sum_{k=0}^{N_b} b[k] x[n-k] - \sum_{k=1}^{N_a} a[k] y[n-k] \]

Instead of the above, we could write the difference equation as two distinct difference equations chained together:

    \[ \begin{array}{@{}ll@{}} w[n] &= \sum_{k=0}^{N_b} b[k] x[n-k] \\ y[n] &= w[n] - \sum_{k=1}^{N_a} a[k] y[n-k] \end{array} \]

w[n] is the output of an all-zero filter which is used as input into an all-pole filter to produce y[n]. If we were to take the z-transforms of the two filters, we would see quite plainly that we can also swap the order of the filters as:

    \[ \begin{array}{@{}ll@{}} q[n] &= x[n] - \sum_{k=1}^{N_a} a[k] q[n-k] \\ y[n] &= \sum_{k=0}^{N_b} b[k] q[n-k] \end{array} \]

This time, q[n] is the output of an all-pole filter which is used as input into an all-zero filter to produce y[n]. If we set N=N_a=N_b, the benefit of this representation is that the two filters share exactly the same state and so an implementation needs not have delay lines for both output and input. This formulation has the name “Direct Form II” and is a popular implementation choice.

Regardless of whether we implement the difference equation directly or use Direct Form II, the implementation is easy to realise in a C program given a set of filter coefficients… so why would we want to do anything differently?

When difference equations go bad…

Let’s build a 6-th order elliptic filter and run the difference equation explicitly using the following GNU Octave code. The filter is designed with a passband edge frequency of 240 Hz at a 48000 Hz sample rate (which isn’t particularly low) with 6 dB of ripple in the passband and 80 dB of stopband attenuation.

N = 6;
IRLen = 8000;
[b, a] = ellip(N, 6, 80.0, 240.0/24000);
dp_y = zeros(IRLen, 1);
q = zeros(N, 1);
x = 1;
for i = 1:IRLen
  qv      = x - a(2:end) * q;
  dp_y(i) = b * [qv; q];
  q       = [qv; q(1:(N-1))];
  x       = 0;

And here is a plot of the impulse response and frequency response (which both look pretty good!):

Frequency Response of Double Precision Difference Equation - Everything looks good

Frequency Response of Double Precision Difference Equation

Oh yeah, Octave by default is using double-precision for everything. Because we like fast things, let’s specify that the states should always be stored using single precision by replacing the initialisation of q with:

q = single(zeros(N, 1));

And here is the new impulse response:

Frequency Response of Single Precision Difference Equation - it explodes into instability

Frequency Response of Single Precision Difference Equation

Bugger! Storing the states in single precision has lead to an instability in the filter. Small errors in the state are somehow being amplified by the coefficients of the difference equation and the output is growing without bound.

I’m going to try and give some intuition here: imagine that you are trying to synthesise a low-frequency discretely-sampled sinusoid that follows a slowly decaying exponential envelope. You’ve found that there is a z-transform that does exactly what you want from a z-transform table:

    \[ a^n \sin(\omega_0 n) u[n] \Leftrightarrow \frac{a z \sin(\omega_0)}{z^2 - 2az\cos(\omega_0) + a^2} \]

You decide to implement the difference equation directly using Direct Form II leading to states that are the previous outputs of the impulse response of the all-pole component of the system.

The intuition is this: the values of the states close to t=0 are all very close to zero – but after sufficient time, will eventually reach a very high-peak relative to their starting point without any external influence. Tiny errors in these small state values could grow by multiple orders of magnitude as the state evolves. This effectively happens at every zero crossing of the signal – and that is just illustrative component of the problem.

Ideally, the energy in the states of the system would only ever decrease when x[n]=0 regardless of the values of y[n]. Unfortunately, this almost never happens with a resonant system implemented using a difference equation. We should ask the question: can we find better states than previous outputs of the all-pole filter?


The Direct Form II difference equations presented earlier utilising q[n] and y[n] can be mutated into some fairly gross matrix equations:

    \[ \begin{tiny} \begin{array}{@{}ll@{}} \begin{pmatrix} q[n-N+1] \\ \vdots \\ q[n-2] \\ q[n-1] \\ q[n] \end{pmatrix} &= \begin{pmatrix} 0 & 1 & 0 & \dots & 0 \\ 0 & \ddots & \ddots & \ddots & \vdots \\ 0 & \dots & 0 & 1 & 0 \\ 0 & \dots & 0 & 0 & 1 \\ -a[N] & \dots & -a[3] & -a[2] & -a[1] \end{pmatrix} \begin{pmatrix} q[n-N] \\ \vdots \\ q[n-3] \\ q[n-2] \\ q[n-1] \end{pmatrix} + \begin{pmatrix} 0 \\ \vdots \\ \vdots \\ 0 \\ 1 \end{pmatrix} x[n] \\ y[n] &= \begin{pmatrix} b[N]-a[N]b[0] & \dots & b[2]-a[2]b[0] & b[1]-a[1]b[0] \end{pmatrix} \begin{pmatrix} q[n-N] \\ \vdots \\ q[n-2] \\ q[n-1] \end{pmatrix} + b[0] x[n] \end{array} \end{tiny} \]

Work through these and convince yourself they are equivalent. We can rewrite this entire system giving names to the matrices as:

    \[ \begin{array}{@{}ll@{}} \mathbf{q}[n+1] &= \mathbf{A}\mathbf{q}[n] + \mathbf{B}\mathbf{x}[n] \\ \mathbf{y}[n] &= \mathbf{C} \mathbf{q}[n] + \mathbf{D}\mathbf{x}[n] \end{array} \]

These matrix equations are a state-space representation. The explicitly listed matrices before it are known as a “controllable canonical form” representation (for the case when b[0]=0). \mathbf{q} is the state vector and in controllable canonical form, it holds previous values of q[n] as described in the difference equations earlier.

So why would we want to do this? Who cares that we mapped a difference equation into a set of matrix equations? I’m going to list out some reasons which probably deserve their own blogs:

  1. Ability to analyse system using a vast array of matrix mathematical tools.
  2. Improve the numerical stability of the system.
  3. Preserving physical meanings of states during discretisation.
  4. Ability to design for efficient implementations.

In this blog, we’re going to look at points one and two briefly to attempt to address the problems shown earlier.

Firstly, let’s write the state equations in terms of their z-transforms so that we can build an expression that maps input to output:

    \[ \begin{array}{@{}ll@{}} z\mathbf{Q}(z) &= \mathbf{A}\mathbf{Q}(z) + \mathbf{B}\mathbf{X}(z) \\ \mathbf{Y}(z) &= \mathbf{C} \mathbf{Q}(z) + \mathbf{D}\mathbf{X}(z) \end{array} \]

Doing some rearranging and substitution enables us to eliminate \mathbf{Q}(z):

    \[\begin{array}{@{}ll@{}} z\mathbf{Q}(z) &= \mathbf{A}\mathbf{Q}(z)+\mathbf{B}\mathbf{X}(z) \\ \left(\mathbf{I}z-\mathbf{A}\right)\mathbf{Q}(z) &= \mathbf{B}\mathbf{X}(z) \\ \mathbf{Q}(z) &= \left(\mathbf{I}z-\mathbf{A}\right)^{-1} \mathbf{B}\mathbf{X}(z) \\ \mathbf{Y}(z) &= \left( \mathbf{C}\left(\mathbf{I}z-\mathbf{A}\right)^{-1} \mathbf{B}+\mathbf{D} \right) \mathbf{X}(z) \\ &= \frac{\mathbf{C}\text{ adj}\left(\mathbf{I}z-\mathbf{A}\right) \mathbf{B}+\mathbf{D}\det\left(\mathbf{I}z-\mathbf{A}\right) }{\det\left(\mathbf{I}z-\mathbf{A}\right)} \mathbf{X}(z) \end{array}\]

The \det\left(\mathbf{I}z-\mathbf{A}\right) term in the denominator expands to the characteristic polynomial of \mathbf{A} in z and if we are trying to map a z-domain transfer function to a state-space system, it needs to equal the denominator of the transfer function. Just as the poles of a transfer function are roots of the denominator, the poles of a state-space system are the eigenvalues of the \mathbf{A} matrix. The numerator of the above expression is a matrix of polynomials in z showing how state-space is naturally able to handle multiple-input multiple-output (MIMO) systems.

Earlier, we asked can we pick better values for the states? Before getting to that, let’s answer a different question: can we pick different values for the states? The answer is: yes – and we can prove it trivially by substituting \mathbf{q}[n] with \mathbf{W}^{-1}\hat{\mathbf{q}}[n] where \mathbf{W} is some invertible square matrix:

    \[ \begin{array}{@{}ll@{}} \mathbf{W}^{-1}\hat{\mathbf{q}}[n+1] &= \mathbf{A}\mathbf{W}^{-1}\hat{\mathbf{q}}[n] + \mathbf{B}\mathbf{x}[n] \\ \hat{\mathbf{q}}[n+1] &= \mathbf{W}\mathbf{A}\mathbf{W}^{-1}\hat{\mathbf{q}}[n] + \mathbf{W}\mathbf{B}\mathbf{x}[n] \\ \mathbf{y}[n] &= \mathbf{C} \mathbf{W}^{-1}\hat{\mathbf{q}}[n] + \mathbf{D}\mathbf{x}[n] \end{array} \]

This changes the coordinate system that the state vector exists in. We can pick any invertible matrix \mathbf{W} and use it to modify a system’s \mathbf{A}, \mathbf{B} and \mathbf{C} matrices keeping the system output \mathbf{y}[n] identical. The quantity \mathbf{W}\mathbf{A}\mathbf{W}^{-1} is a similarity transform on \mathbf{A} and it preserves the eigenvalues/characteristic polynomial of \mathbf{A} for all invertible \mathbf{W}.

State-space degrees of freedom

The coefficients of a difference equation are unique for a particular impulse response. The coefficients of the state-space representation for N>1 are not. In-fact, we have already demonstrated this previously by showing we can change the coordinate system of the state-vector.

Let’s go through the process of taking a 2×2 SISO system and finding the transfer function.

    \[ \begin{small} \begin{array}{@{}ll@{}} \frac{Y(z)}{X(z)} &= \frac{\mathbf{C}\text{ adj}\left(\mathbf{I}z-\mathbf{A}\right) \mathbf{B}+\mathbf{D}\det\left(\mathbf{I}z-\mathbf{A}\right) }{\det\left(\mathbf{I}z-\mathbf{A}\right)} \\ &= \frac{\begin{pmatrix}c_0 & c_1\end{pmatrix}\text{ adj}\begin{pmatrix}z-a_{00} & -a_{01} \\ -a_{10} & z-a_{11}\end{pmatrix} \begin{pmatrix}b_0 \\ b_1\end{pmatrix}+d\det\begin{pmatrix}z-a_{00} & -a_{01} \\ -a_{10} & z-a_{11}\end{pmatrix} }{\det\begin{pmatrix}z-a_{00} & -a_{01} \\ -a_{10} & z-a_{11}\end{pmatrix}} \\ &= \frac{\begin{pmatrix}c_0 & c_1\end{pmatrix}\begin{pmatrix}z-a_{11} & a_{01} \\ a_{10} & z-a_{00}\end{pmatrix} \begin{pmatrix}b_0 \\ b_1\end{pmatrix}+d\left((z-a_{00})(z-a_{11})-a_{01}a_{10}\right)}{(z-a_{00})(z-a_{11})-a_{01}a_{10}} \\ &= \frac{c_0(b_0(z-a_{11}) + b_1a_{01}) + c_1(b_0a_{10} + b_1(z-a_{00}))+d\left((z-a_{00})(z-a_{11})-a_{01}a_{10}\right)}{(z-a_{00})(z-a_{11})-a_{01}a_{10}} \\ &= \frac{dz^2+z(c_0b_0+c_1b_1-d(a_{00}+a_{11}))+c_0(b_1a_{01}-b_0a_{11})+c_1(b_0a_{10}-b_1a_{00})+d(a_{00}a_{11}-a_{01}a_{10})}{z^2-z(a_{00}+a_{11}) + a_{00}a_{11}-a_{01}a_{10}} \end{array} \end{small} \]

We can select all of these quantities to match a given second order transfer function – but how should we choose them? What makes for good coefficients? What does good mean? We could force some values to zero or one to minimize multiplies? If we want to improve numeric performance, given that the \mathbf{A} matrix is responsible for moving states forward in time, it seems reasonable to minimize the impact that errors in the state vector can have when multiplied by \mathbf{A} – this is akin to making \mathbf{A} as well conditioned as possible. This might be a good strategy for fixing our example 6th order system.

Orthogonal matrices are well-conditioned and for those deeply familiar with IIR design, you may also know that the state update for a coupled form IIR implementation is effectively determined by an orthonormal matrix that has been attenuated by some scalar. Coupled form state-updates can only be used to mimic a conjugate pole pair, but for every classical IIR design (Butterworth, Chebyshev I/II, Cauer/Elliptic) of even order (like our example 6th order problematic system), all the poles are in conjugate pairs. Start by defining \mathbf{A} as an orthogonal matrix:

    \[ \mathbf{A} = \begin{pmatrix}a_{00} & -a_{10} \\ a_{10} & a_{00} \end{pmatrix} \]

The transfer function becomes:

    \[ \begin{tiny} \frac{Y(z)}{X(z)} = \frac{dz^2+z(c_0b_0+c_1b_1-2da_{00})+c_0(-b_1a_{10}-b_0a_{00})+c_1(b_0a_{10}-b_1a_{00})+d(a_{00}^2+a_{10}^2)}{z^2-2za_{00} + a_{00}^2+a_{10}^2} \end{tiny} \]

There are two remaining free variables we can pick. Somewhat arbitrarily, we will select c_0=1 and c_1=0 to get:

    \[ \frac{Y(z)}{X(z)} = \frac{dz^2+z(b_0-2da_{00})-b_1a_{10}-b_0a_{00}+d(a_{00}^2+a_{10}^2)}{z^2-2za_{00} + a_{00}^2+a_{10}^2} \]

We now have a system of equations we can solve for the unknown matrix quantities that will create a state-space system that mimics a 2nd order resonant transfer function.

Chaining systems together

Consider we have two systems which we would like to connect together:

    \[ \begin{array}{@{}ll@{}} z\mathbf{Q}_1(z) &= \mathbf{A}_1\mathbf{Q}_1(z) + \mathbf{B}_1\mathbf{X}(z) \\ \mathbf{U}(z) &= \mathbf{C}_1 \mathbf{Q}_1(z) + \mathbf{D}_1\mathbf{X}(z) \end{array} \]


    \[ \begin{array}{@{}ll@{}} z\mathbf{Q}_2(z) &= \mathbf{A}_2\mathbf{Q}_2(z) + \mathbf{B}_2\mathbf{U}(z) \\ \mathbf{Y}(z) &= \mathbf{C}_2 \mathbf{Q}_2(z) + \mathbf{D}_2\mathbf{U}(z) \end{array} \]

We can make direct substitutions to get:

    \[ \begin{array}{@{}ll@{}}  z\mathbf{Q}_1(z) &= \mathbf{A}_1\mathbf{Q}_1(z) + \mathbf{B}_1\mathbf{X}(z) \\ z\mathbf{Q}_2(z) &= \mathbf{A}_2\mathbf{Q}_2(z) + \mathbf{B}_2\mathbf{C}_1 \mathbf{Q}_1(z) + \mathbf{B}_2 \mathbf{D}_1\mathbf{X}(z) \\ \mathbf{Y}(z) &= \mathbf{C}_2 \mathbf{Q}_2(z) + \mathbf{D}_2\mathbf{C}_1 \mathbf{Q}_1(z) + \mathbf{D}_2\mathbf{D}_1\mathbf{X}(z) \end{array} \]

We can re-write the above as a set of block matrix equations:

    \[ \begin{array}{@{}ll@{}} z\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} &= \begin{pmatrix} \mathbf{A}_2 & \mathbf{B}_2 \mathbf{C}_1 \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \begin{pmatrix}\mathbf{B}_2 \mathbf{D}_1 \\ \mathbf{B}_1\end{pmatrix} \mathbf{X}(z) \\ \mathbf{Y}(z) &= \begin{pmatrix}\mathbf{C}_2 & \mathbf{D}_2 \mathbf{C}_1 \end{pmatrix}\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \mathbf{D}_2 \mathbf{D}_1 \mathbf{X}(z) \end{array} \]

This provides a direct method for connecting systems together into one larger system. As \mathbf{A} is an upper-triangular block-diagonal matrix, the eigenvalues of the new system are the same as those of the existing systems and the \mathbf{B}_2\mathbf{C}_1 term in the connected system has no influence. Given that, can we decouple the sections completely? i.e. can we completely eliminate the \mathbf{B}_2\mathbf{C}_1 term by using a similarity transform as described previously? Let’s pose the question using block matrices:

    \[ \begin{pmatrix} \mathbf{A}_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \begin{pmatrix} \mathbf{W}_{00} & \mathbf{W}_{01} \\ \mathbf{W}_{10} & \mathbf{W}_{11} \end{pmatrix} = \begin{pmatrix} \mathbf{W}_{00} & \mathbf{W}_{01} \\ \mathbf{W}_{10} & \mathbf{W}_{11} \end{pmatrix} \begin{pmatrix} \mathbf{A}_2 & \mathbf{B}_2 \mathbf{C}_1 \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \]

    \[ \begin{array}{@{}ll@{}} \mathbf{A}_2 \mathbf{W}_{00} &= \mathbf{W}_{00} \mathbf{A}_2 \\ \mathbf{A}_1 \mathbf{W}_{10} &= \mathbf{W}_{10} \mathbf{A}_2 \\ \mathbf{A}_2 \mathbf{W}_{01} &= \mathbf{W}_{00} \mathbf{B}_2 \mathbf{C}_1 + \mathbf{W}_{01} \mathbf{A}_1 \\ \mathbf{A}_1 \mathbf{W}_{11} &= \mathbf{W}_{10} \mathbf{B}_2 \mathbf{C}_1 + \mathbf{W}_{11} \mathbf{A}_1 \end{array} \]

There is only one valid set of solutions for these equations:

    \[ \begin{array}{@{}ll@{}} \mathbf{W}_{00} &= \mathbf{I} \\ \mathbf{W}_{11} &= \mathbf{I} \\ \mathbf{W}_{10} &= \mathbf{0} \\ \mathbf{A}_2 \mathbf{W}_{01} - \mathbf{W}_{01} \mathbf{A}_1 &= \mathbf{B}_2 \mathbf{C}_1 \end{array} \]

Where the final equation is a Sylvester equation which can be solved without too much pain programatically (i.e. using a matrix inverse) if and only if \mathbf{A}_1 and \mathbf{A}_2 share no common eigenvalues. If we do this, we end up with the following system:

    \[ \begin{array}{@{}ll@{}} z\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} &= \begin{pmatrix} \mathbf{A}_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{A}_1 \end{pmatrix} \begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \mathbf{W}\begin{pmatrix}\mathbf{B}_2 \mathbf{D}_1 \\ \mathbf{B}_1\end{pmatrix} \mathbf{X}(z) \\ \mathbf{Y}(z) &= \begin{pmatrix}\mathbf{C}_2 & \mathbf{D}_2 \mathbf{C}_1 \end{pmatrix}\mathbf{W}^{-1}\begin{pmatrix}\mathbf{Q}_2(z) \\ \mathbf{Q}_1(z)\end{pmatrix} + \mathbf{D}_2 \mathbf{D}_1 \mathbf{X}(z) \end{array} \]

This will be better conditioned and have some other benefits in terms of efficient implementations. This block-diagonalisation we just performed is the equivalent of performing the partial-fraction decomposition on the cascaded system (which I’ve never seen a nice looking algorithmic solution for).

Bringing it home

Let’s take our problematic 6th order system, break it up into 3 second order state-space sections using the coupled-form mapping we described and evaluate the performace again:

Performance of the coupled-form state-space systems vs. the double precision difference equation

Performance of the coupled-form state-space systems vs. the double precision difference equation

There are 4 data-sets:

  1. The original double-precision difference equation results.
  2. The concatenated coupled-form systems without block-diagonalisation storing the states in single precision.
  3. The concatenated coupled-form systems without block-diagonalisation storing the states in a q15 format (15 bits of fractional precision).
  4. The concatenated coupled-form systems with block-diagonalisation storing the states in q15 format.

The performance difference between 2 and the block-diagonalised system with states in signle precision is negligible and so is not included in the plots. The performance difference between 1, 2 and 4 is practiacally identical in the passband. There is some fudging going on with the q15 plots because the precision is fixed but the range is not. I’ve tried to make things more even by:

  • 2-Norm matching the \mathbf{B} and \mathbf{C} matrices of the systems without diagonalisation.
  • 2-Norm matching each sub-section of the \mathbf{B} and \mathbf{C} matrices of the system with diagonalisation (which we can do because they are all operating independetly).
  • 2-Norm matching the \mathbf{C} matrices of the diagonalised and undiagonalised systems after the above has happened (which also requires modifying \mathbf{B}).

The performance of 1, 2 and 4 are all incredibly good. The performance of 3 shows that cascading leads to error build-up that significantly compromises the output.

These coupled-form state-space filter realisations are good down to very low frequencies. I have had success synthesising sub 10 Hz 16th order elliptic filters that work just fine with single-precision floating point. Next time, we’ll look at preservation of state meanings when we discretise an analogue system.

Arbitrary polynomial-segment signal generation (e.g. square, triangle, sawtooth waves)

There is a tool which is part of my Open Diapason project called “sampletune” which is designed to allow sample-set creators to tune samples by ear and save the tuning information back into the sampler chunk of the wave file. The utility plays back a fixed tone and there are keyboard controls to shift the pitch of the sample up and down until the beats disappear. I was hoping to prototype a sawtooth wave as the tuning signal, but realised during implementation that creating arbitrary frequency piecewise-linear waveforms is actually somewhat a difficult problem – fundamentally due to aliasing… so I parked that idea and ended up summing several harmonic sinusoids at some arbitrary levels.

Recently, I was thinking about this again and set about solving the problem. It was only in the last couple of weeks that I discovered tools like Band Limited Impulse Trains (see link at bottom of post) which seem to be the common way of generating these signals. These are fundamentally FIR techniques. I approached the problem using IIR filters and came up with a solution which I think is kind-of interesting – please let me know if you’ve seen it done this way so I can make some references!

The idea

If we generate a piecewise linear waveform wave at a frequency which is not a factor of the sampling rate, we will get aliasing. In fact, we can work out the number of shifted aliasing spectrum copies which we can expect to see by taking f_0 / \text{gcd}(f_0, f_s). If f_0 is irrational, we have an infinite number of copies. Let’s make the assumption that more-often than not, we are going to be generating a very large number of aliases. The next assumption we are going to make is that most of the signals we are interested in generating have a spectral envelope which more-or-less decays monotonically. With this assumption, we can say: if we generate our signal at a much higher sampling rate, we effectively push the level of the aliasing down in the section of the spectrum we care about (i.e. the audible part). Below is an image which hopefully illustrates this; the upper image shows the spectral envelope of the signal we are trying to create in blue, all of the black lines which follow are envelopes of the aliasing. The second image shows what would have happened if we created the same frequency signal at three times the sampling rate. The interesting content again is in blue and everything in black is either aliasing or components we don’t care about. What we can see is that by increasing the sample rate we generate our content at, we effectively push the aliasing components down in the frequency region we care about.

So let’s assume that instead of generating our waveforms at f_s, we will:

  • Generate them at a sample rate of M f_s where M is some large number.
  • Filter out the high-frequency components above f_s / 2 using some form of IIR filter structure.
  • Decimate the output sequence by a factor of M to get back to f_s (which we can do because of the previous operation).

Now that we’ve postulated a method, let’s figure out if it makes any sense by answering the questions “does it have an efficient implementation?” and “does it actually produce reasonable output for typical signals we may want to generate?”.

Does it have an efficient implementation?

Denote a state-space representation of our SISO discrete system as:

    \[   \begin{array}{@{}ll@{}}     \hat{q}\left[n\right] & = \mathbf{A} \hat{q}\left[n-1\right] + \mathbf{B} x\left[n\right] \\     y\left[n\right] & = \mathbf{C} \hat{q}\left[n-1\right] + D x\left[n\right]   \end{array} \]

For clarity, let’s just consider a single update of the system state where \hat{q}\left[-1\right] represents the “previous state” and x\left[0\right] represents the first sample of new data we are pushing in.

    \[   \begin{array}{@{}ll@{}}     \hat{q}\left[0\right] & = \mathbf{A} \hat{q}\left[-1\right] + \mathbf{B} x\left[0\right] \\     y\left[0\right] & = \mathbf{C} \hat{q}\left[-1\right] + D x\left[0\right]   \end{array} \]

We can derive an expression to insert two samples in via substitution and from that derive expressions for inserting groups of samples.

    \[   \begin{array}{@{}ll@{}}     \hat{q}\left[0\right] & = \mathbf{A} \hat{q}\left[-1\right] + \mathbf{B} x\left[0\right] \\     \hat{q}\left[1\right] & = \mathbf{A} \hat{q}\left[0\right] + \mathbf{B} x\left[1\right] \\  & = \mathbf{A}^2 \hat{q}\left[-1\right] + \mathbf{A}\mathbf{B} x\left[0\right] + \mathbf{B} x\left[1\right] \\ \hat{q}\left[\sigma\right] & = \mathbf{A}^{\sigma+1}\hat{q}\left[-1\right] + \sum_{n=0}^{\sigma}\mathbf{A}^{\sigma-n}\mathbf{B}x\left[n\right]  \\   \end{array} \]

    \[   \begin{array}{@{}ll@{}}     y\left[0\right] & = \mathbf{C} \hat{q}\left[-1\right] + D x\left[0\right] \\     y\left[1\right] & = \mathbf{C} \hat{q}\left[0\right] + D x\left[1\right] \\  & = \mathbf{C}\mathbf{A}\hat{q}\left[-1\right] + \mathbf{C}\mathbf{B}x\left[0\right] + D x\left[1\right] \\ y\left[\sigma\right] & = \mathbf{C}\mathbf{A}^{\sigma}\hat{q}\left[-1\right] + \mathbf{C}\sum_{n=0}^{\sigma-1}\mathbf{A}^{\sigma-1-n}\mathbf{B}x\left[n\right] + D x\left[\sigma\right]   \end{array} \]

We’re interested in inserting groups of samples which form linear segments into our filter. So let’s replace x\left[n\right] with some arbitrary polynomial. This gives us some flexibility; we need only a polynomial of zero degree (constant) for square waves, first degree for saw-tooth and triangle – but we could come up with more interesting waveforms which are defined by higher order polynomials. i.e.

    \[ x\left[n\right] = \sum_{c=0}^{C-1} a_c n^c \]

Let’s substitute this into our state expressions:

    \[   \begin{array}{@{}ll@{}}     \hat{q}\left[0\right] & = \mathbf{A} \hat{q}\left[-1\right] + \mathbf{B} a_0 \\     \hat{q}\left[\sigma\right] & = \mathbf{A}^{\sigma+1}\hat{q}\left[-1\right] + \sum_{n=0}^{\sigma}\mathbf{A}^{\sigma-n}\mathbf{B}\sum_{c=0}^{C-1} a_c n^c \\  & = \left(\mathbf{A}^{\sigma+1}\right)\hat{q}\left[-1\right] + \sum_{c=0}^{C-1} a_c \left(\sum_{n=0}^{\sigma}\mathbf{A}^{\sigma-n}\mathbf{B} n^c\right) \\   \end{array} \]

    \[   \begin{array}{@{}ll@{}}     y\left[0\right] & = \mathbf{C} \hat{q}\left[-1\right] + D a_0 \\     y\left[\sigma\right] & = \mathbf{C}\mathbf{A}^{\sigma}\hat{q}\left[-1\right] + \mathbf{C}\sum_{n=0}^{\sigma-1}\mathbf{A}^{\sigma-1-n}\mathbf{B}\sum_{c=0}^{C-1} a_c n^c + D \sum_{c=0}^{C-1} a_c \sigma^c \\  & = \left(\mathbf{C}\mathbf{A}^{\sigma}\right)\hat{q}\left[-1\right] + \sum_{c=0}^{C-1} a_c \left(\sum_{n=0}^{\sigma-1}\mathbf{C}\mathbf{A}^{\sigma-1-n}\mathbf{B} n^c\right) + \sum_{c=0}^{C-1} a_c \left(D \sigma^c\right)   \end{array} \]

There are some interesting things to note here:

  • The first output sample (y\left[0\right]) is only dependent on a_0 and the previous state. If we restrict our attention to always getting the value of the first sample, we don’t need to do any matrix operations to get an output sample – just vector operations.
  • All of the parenthesised bits in the equations have no dependency on the coefficients of the polynomial we are inserting into the filter – they only depend on the number of samples we want to insert (\sigma), the system state matrices and/or the index of the polynomial coefficient. This means that we can build lookup tables for the parenthesised matrices/vectors that are indexed by \sigma and the polynomial coefficient index and effectively insert a large number of samples of any polynomial into the filter state with a single matrix-matrix multiply and a number of scalar-vector multiplies dependent on the order of the polynomial.

It’s looking like we have an efficient implementation – particularly if we are OK with allowing a tiny (almost one sample at f_s) amount of latency into the system.

Does it work?

Below is a screen capture of Adobe Audition showing the output of a small program which implemented the above scheme. The system is configured using an elliptic filter with 1 dB of passband ripple and a stop-band gain of -60 dB.

Waveform with frequency of 600*pi Hz. Left channel generated using naive modulo, right generated using the proposed scheme.

Waveform with frequency of 600*pi Hz. Left channel generated using naive modulo, right generated using the proposed scheme. The spectrum of the left channel is green.

The waveform has a frequency of 600\pi which is irrational and therefore has an infinite number of aliases when sampled (this is not strictly true in a digital implementation – but it’s still pretty bad). We can see that after the cutoff frequency of the filter, the aliasing components are forced to decay at the rate of the filter – but we predicted this should happen.

Below is another example using an 4th order Bessel filter instead of an elliptic filter. This preserves the shape of the time-domain signal much more closely than the elliptic system, but rolls off much slower. But the performance is still fairly good and might be useful.

Waveform with frequency of 600*pi Hz. Left channel generated using naive modulo, right generated using the proposed scheme with a Bessel filter.

Waveform with frequency of 600*pi Hz. Left channel generated using naive modulo, right generated using the proposed scheme. The spectrum of the left channel is green.

Here is a link to a good paper describing how classic waveforms have been done previously “Alias-Free Digital Synthesis of Classic Analog Waveforms” [
T. Stilson and J. Smith, 1996]

Generating multiple low-quality random numbers… fast.


A while ago I was doing some profiling to find out ways to improve the load performance of my Open Diapason project. The load process is roughly as follows:

  • Load wave file from disk into memory.
  • Apply a linear phase filter to all loaded samples (this is to compensate for the response of the resampler which occurs during playback).
  • Normalise and quantise the filtered audio using TPDF dither. The normalisation ensures that we use the full 16-bits of precision for each sample. Obviously, the scaling coefficient is stored along with the sample to ensure we play the content back at the intended level.

All of these steps are done across multiple threads to maximise performance (there is one thread that is reading audio files from disk ready to spawn threads which apply the filter and requantise the data)… this is one of those times where the speedup was roughly linear with the first couple of threads which I added.

Anyway, back to the profiling: I was expecting that the vast majority of the cycles would be spent in the application of the filter, but this actually turned out not to be the case. The dithering and re-quantisation of the audio – which I assumed would be cheap – turned out to be chewing cycles on the order of 30% of the load time (all of my profiling was done on a mid 2014 MacBook Pro with an i7 processor).

There were multiple things which I did to speed it up, but I think what I pushed in this change list was the most interesting one – so what follows in this blog is a bit of a write up.

The idea

Uniform random numbers are useful and often we need several of them. In my use-case (performing TPDF dither on a sample pair) I needed four random variables. Generating them needs to be fast but the “quality” of the randomness is not that important. The obvious way to do this is to use a Linear Congruential Generator which ends up forming a code pattern like this:

r0 = (state * A + C) % M;
r1 = (r0 * A + C) % M;
r2 = (r1 * A + C) % M;
r3 = (r2 * A + C) % M;
state = r3;

This forms a dependency chain where each random number requires knowledge of the previous one and could end up leading to pipeline issues (this is really dependent on how the compiler re-orders the code and the architecture of the platform). In the code generated on my machine, this turned out to be a bottleneck. We could get around the pipeline issues by creating four different LCGs:

r0 = (state0 * A0 + C0) % M;
r1 = (state1 * A1 + C1) % M;
r2 = (state2 * A2 + C2) % M;
r3 = (state3 * A3 + C3) % M;
state0 = r0;
state1 = r1;
state2 = r2;
state3 = r3;

But this will require more registers to store the different LCG states and possibly more registers or code memory to hold all the new coefficients. It would be nice to have something a bit simpler…

LCGs are really cool toys. We can choose our A_n and C_n values in such a way that the pseudo-random sequence has period M. Given the algorithm generates the next value from the previous, this must mean that the random sequence contains every integer between 0 and M-1. The first part of the idea, is to modify the code as:

r0 = (state0 * A0) % M;
r1 = (state1 * A1) % M;
r2 = (state2 * A2) % M;
r3 = (state3 * A3) % M;
state0 = (r0 + C0 % M) % M;
state1 = (r1 + C1 % M) % M;
state2 = (r2 + C2 % M) % M;
state3 = (r3 + C3 % M) % M;

This produces a different sequence in r_0 to r_3 because the accumulation now happens at the state assignment, but still has a period M. Now for the dodgy part, we change the algorithm to this:

r0 = (state0 * A0) % M;
r1 = (state0 * A1) % M;
r2 = (state0 * A2) % M;
r3 = (state0 * A3) % M;
state0 = (r0 + C0 % M) % M;

Where C_0 = 1, M = 2^{32} and A_0 to A_3 are all different but satisfy the required conditions for an LCG to work. It should not be difficult to infer that r_0 to r_3 still all have period M – but produce different output sequences. I wrote a test program to look at the correlation of these sequences over their period which can be found on GitHub here – they are mostly independent over their period. I doubt they make great random numbers, but for performing dither, they are completely fine. The generated code seems pretty good also:

sampletest[0x10000a99f] <+3503>: imull  $0xd688014d, %r8d, %edi   ; imm = 0xD688014D 
sampletest[0x10000a9a6] <+3510>: imull  $0xdb71f7bd, %r8d, %r15d  ; imm = 0xDB71F7BD 
sampletest[0x10000a9ad] <+3517>: imull  $0xe05f354d, %r8d, %esi   ; imm = 0xE05F354D 
sampletest[0x10000a9b4] <+3524>: imull  $0xe54c7f35, %r8d, %ecx   ; imm = 0xE54C7F35 
sampletest[0x10000a9bb] <+3531>: movss  (%r11,%rdx,2), %xmm2      ; xmm2 = mem[0],zero,zero,zero 
sampletest[0x10000a9c1] <+3537>: mulss  %xmm1, %xmm2
sampletest[0x10000a9c5] <+3541>: movss  (%r10,%rdx,2), %xmm3      ; xmm3 = mem[0],zero,zero,zero 
sampletest[0x10000a9cb] <+3547>: mulss  %xmm1, %xmm3
sampletest[0x10000a9cf] <+3551>: leal   0x1(%rdi), %r8d
sampletest[0x10000a9d3] <+3555>: cvttss2si %xmm2, %r12
sampletest[0x10000a9d8] <+3560>: cvttss2si %xmm3, %r13
sampletest[0x10000a9dd] <+3565>: addq   %rdi, %rsi
sampletest[0x10000a9e0] <+3568>: addq   %r12, %rsi
sampletest[0x10000a9e3] <+3571>: addq   %r15, %rcx
sampletest[0x10000a9e6] <+3574>: addq   %r13, %rcx
sampletest[0x10000a9e9] <+3577>: shrq   $0x21, %rsi
sampletest[0x10000a9ed] <+3581>: shrq   $0x21, %rcx
sampletest[0x10000a9f1] <+3585>: movl   %edx, %edi
sampletest[0x10000a9f3] <+3587>: movw   %si, (%rax,%rdi,2)
sampletest[0x10000a9f7] <+3591>: leal   0x1(%rdx), %esi
sampletest[0x10000a9fa] <+3594>: movw   %cx, (%rax,%rsi,2)
sampletest[0x10000a9fe] <+3598>: addq   $0x2, %rdx
sampletest[0x10000aa02] <+3602>: decl   %ebx
sampletest[0x10000aa04] <+3604>: jne    0x10000a99f

See the code in the original commit for a better understanding of what’s going on. There is a hefty comment explaining the fixed-point dithering process.