A while ago I was doing some profiling to find out ways to improve the load performance of my Open Diapason project. The load process is roughly as follows:
- Load wave file from disk into memory.
- Apply a linear phase filter to all loaded samples (this is to compensate for the response of the resampler which occurs during playback).
- Normalise and quantise the filtered audio using TPDF dither. The normalisation ensures that we use the full 16-bits of precision for each sample. Obviously, the scaling coefficient is stored along with the sample to ensure we play the content back at the intended level.
All of these steps are done across multiple threads to maximise performance (there is one thread that is reading audio files from disk ready to spawn threads which apply the filter and requantise the data)… this is one of those times where the speedup was roughly linear with the first couple of threads which I added.
Anyway, back to the profiling: I was expecting that the vast majority of the cycles would be spent in the application of the filter, but this actually turned out not to be the case. The dithering and re-quantisation of the audio – which I assumed would be cheap – turned out to be chewing cycles on the order of 30% of the load time (all of my profiling was done on a mid 2014 MacBook Pro with an i7 processor).
There were multiple things which I did to speed it up, but I think what I pushed in this change list was the most interesting one – so what follows in this blog is a bit of a write up.
Uniform random numbers are useful and often we need several of them. In my use-case (performing TPDF dither on a sample pair) I needed four random variables. Generating them needs to be fast but the “quality” of the randomness is not that important. The obvious way to do this is to use a Linear Congruential Generator which ends up forming a code pattern like this:
r0 = (state * A + C) % M; r1 = (r0 * A + C) % M; r2 = (r1 * A + C) % M; r3 = (r2 * A + C) % M; state = r3;
This forms a dependency chain where each random number requires knowledge of the previous one and could end up leading to pipeline issues (this is really dependent on how the compiler re-orders the code and the architecture of the platform). In the code generated on my machine, this turned out to be a bottleneck. We could get around the pipeline issues by creating four different LCGs:
r0 = (state0 * A0 + C0) % M; r1 = (state1 * A1 + C1) % M; r2 = (state2 * A2 + C2) % M; r3 = (state3 * A3 + C3) % M; state0 = r0; state1 = r1; state2 = r2; state3 = r3;
But this will require more registers to store the different LCG states and possibly more registers or code memory to hold all the new coefficients. It would be nice to have something a bit simpler…
LCGs are really cool toys. We can choose our and values in such a way that the pseudo-random sequence has period . Given the algorithm generates the next value from the previous, this must mean that the random sequence contains every integer between 0 and . The first part of the idea, is to modify the code as:
r0 = (state0 * A0) % M; r1 = (state1 * A1) % M; r2 = (state2 * A2) % M; r3 = (state3 * A3) % M; state0 = (r0 + C0 % M) % M; state1 = (r1 + C1 % M) % M; state2 = (r2 + C2 % M) % M; state3 = (r3 + C3 % M) % M;
This produces a different sequence in to because the accumulation now happens at the state assignment, but still has a period . Now for the dodgy part, we change the algorithm to this:
r0 = (state0 * A0) % M; r1 = (state0 * A1) % M; r2 = (state0 * A2) % M; r3 = (state0 * A3) % M; state0 = (r0 + C0 % M) % M;
Where , and to are all different but satisfy the required conditions for an LCG to work. It should not be difficult to infer that to still all have period – but produce different output sequences. I wrote a test program to look at the correlation of these sequences over their period which can be found on GitHub here – they are mostly independent over their period. I doubt they make great random numbers, but for performing dither, they are completely fine. The generated code seems pretty good also:
sampletest[0x10000a99f] <+3503>: imull $0xd688014d, %r8d, %edi ; imm = 0xD688014D sampletest[0x10000a9a6] <+3510>: imull $0xdb71f7bd, %r8d, %r15d ; imm = 0xDB71F7BD sampletest[0x10000a9ad] <+3517>: imull $0xe05f354d, %r8d, %esi ; imm = 0xE05F354D sampletest[0x10000a9b4] <+3524>: imull $0xe54c7f35, %r8d, %ecx ; imm = 0xE54C7F35 sampletest[0x10000a9bb] <+3531>: movss (%r11,%rdx,2), %xmm2 ; xmm2 = mem,zero,zero,zero sampletest[0x10000a9c1] <+3537>: mulss %xmm1, %xmm2 sampletest[0x10000a9c5] <+3541>: movss (%r10,%rdx,2), %xmm3 ; xmm3 = mem,zero,zero,zero sampletest[0x10000a9cb] <+3547>: mulss %xmm1, %xmm3 sampletest[0x10000a9cf] <+3551>: leal 0x1(%rdi), %r8d sampletest[0x10000a9d3] <+3555>: cvttss2si %xmm2, %r12 sampletest[0x10000a9d8] <+3560>: cvttss2si %xmm3, %r13 sampletest[0x10000a9dd] <+3565>: addq %rdi, %rsi sampletest[0x10000a9e0] <+3568>: addq %r12, %rsi sampletest[0x10000a9e3] <+3571>: addq %r15, %rcx sampletest[0x10000a9e6] <+3574>: addq %r13, %rcx sampletest[0x10000a9e9] <+3577>: shrq $0x21, %rsi sampletest[0x10000a9ed] <+3581>: shrq $0x21, %rcx sampletest[0x10000a9f1] <+3585>: movl %edx, %edi sampletest[0x10000a9f3] <+3587>: movw %si, (%rax,%rdi,2) sampletest[0x10000a9f7] <+3591>: leal 0x1(%rdx), %esi sampletest[0x10000a9fa] <+3594>: movw %cx, (%rax,%rsi,2) sampletest[0x10000a9fe] <+3598>: addq $0x2, %rdx sampletest[0x10000aa02] <+3602>: decl %ebx sampletest[0x10000aa04] <+3604>: jne 0x10000a99f
See the code in the original commit for a better understanding of what’s going on. There is a hefty comment explaining the fixed-point dithering process.