After playing around with discrete cosine transforms, I thought I would implement an MDCT and document my understanding of how everything works. I use some similar techniques to those used on the Wikipedia page as they are helpful for understanding but will add some scanned drawings which I think help (I’m not even close to being clever enough to get a computer to draw these for me).
Prerequisites
The only real background knowledge which I think is relevant to understanding the MDCT is the data extensions which the DCT-4 transform assumes.

First DCT-4 Basis Function with Shifted 2N Sample Input
I’ll refer to the above image in the Forward Transform overview, but for the mean time, only pay attention to the solid quarter wave. This is the first basis function (i.e. ) for an
length DCT-4. If the basis is continued past
, it has a repeating symmetrical pattern (the dashed line in the image) which repeats every
. The symmetry is even around
and odd around
and holds for every basis function of the DCT-4. i.e. The DCT-4 assumes that the input data continues on forever, repeating itself in the following manner:
,
,
,
.
Forward Transform
The MDCT takes in real data points and produces
real outputs. These inputs are designed to overlap, so the first half of the input data should be the second half of the input data of the previous call. The definition is:
It should be trivial to see from the above that the MDCT can be computed using a DCT-4 with an extended number of input data points, all of which have been shifted by half a basis. Go back to the crappy drawing and notice the concatenated length sequences
,
,
and
. The total length of this sequence is
and begins at
(or half the length of a basis function). We need to get
,
and
back into the
point region if we want to compute the MDCT using a DCT-4, this can be achieved with the following concatenated sequence (I will subscript these sequences with
to denote a reversal of the sequence):
If we take the DCT-4 of this concatenated sequence, we have found the MDCT of the input sequence.
Inverse Transform
The inverse MDCT or IMDCT takes in real data points and produces
real outputs. In this transform, the outputs should overlap such that the first half of the output should be added to the second half of the output data in the previous call. The definition is:
Because we know how the DCT-4 assumes the input and output data repeats in a symmetric pattern, we can get this data trivially in exactly the same fashion as we did in the forward transform. In the following Illustration, we take the output from the forward transform and extend it along the basis:

Extended Projection of the MDCT Output on the First DCT-4 Basis
In output row zero, we can see how to extend the input sequence to obtain the 2N points required. We then see in rows two and three how summing the overlapping blocks causes the aliased sequences to cancel in subsequent calls to the IMDCT.
World’s Dumbest C MDCT Implementation
I validated all this actually works with a small C program. Follows are the MDCT/IMDCT implementations I came up with… ignore the “twid” input, I cache the modulation factors for the FFT which gets called in the dct4 routine:
/* state should contain double the number of elements as the input buffer (N) * and should have all elements initialized to zero prior to calling. The * output buffer is actually the first N elements of state after calling. */ void mdct(double *state, const double *input, double *twid, unsigned lenbits) { unsigned rl = 1u << lenbits; unsigned i; /* Alias the input data with the previous block. */ for (i = 0; i < rl / 2; i++) { state[i] = - input[rl/2+i] - input[rl/2-i-1]; state[rl/2+i] = state[rl+i] - state[rl+rl-i-1]; } /* Save the input block */ for (i = 0; i < rl; i++) state[rl+i] = input[i]; /* DCT-4 */ dct4(state, lenbits, twid); } /* state should contain double the number of elements as the input buffer (N) * and should have all elements initialized to zero prior to calling. The * output buffer is actually the first N elements of state after calling. */ void imdct(double *state, const double *input, double *twid, unsigned lenbits) { unsigned rl = 1u << lenbits; unsigned i; /* Collect contributions from the previous frame to the output buffer */ for (i = 0; i < rl / 2; i++) { state[i] = - state[rl+rl/2-i-1]; state[rl/2+i] = - state[rl+i]; } /* Load the input and run the DCT-4 */ for (i = 0; i < rl; i++) state[rl+i] = input[i]; dct4(state + rl, lenbits, twid); /* Sum contributions from this frame to the output buffer and perform the * required scaling. */ for (i = 0; i < rl / 2; i++) { state[i] = (state[i] + state[rl+rl/2+i]) / rl; state[rl/2+i] = (state[rl/2+i] - state[rl+rl-i-1]) / rl; } } |
Windowed MDCT Implementation
Typical MDCT implementations will window the input and output data (this can also be thought of as windowing the basis functions – which I think is a more helpful way to understand what is happening). It is really important to note that the window function must be carefully chosen to ensure that the basis functions remain orthogonal! The window makes the basis functions always begin and end near zero. The process has the side effect of de-normalising the basis functions (unless the window is rectangular) and means there will be a window-dependent scaling factor which will need to be applied at the output to achieve perfect reconstruction. The following images show the second basis function of the MDCT both un-windowed and windowed with a half-sine window (given at the end of the post).

Second MDCT Basis Function

Sine Windowed Second MDCT Basis Function
In a lossy codec this windowing process is somewhat necessary because if the start and end points are not close to zero, the output is likely to periodically glitch for even the slightest errors in the reconstructed MDCT data. This glitching will occur at the boundaries of the transform (i.e. every N points).
We can work out the necessary conditions for the window to obtain perfect reconstruction using the previous drawings (I’d steer away from equations for this one – it’s easier to validate the results visually) by applying a window function split into 4 segments to each of the input blocks. I’ll do the generic case for a symmetrical window which is applied to both the input and the output. We split the window (which has a length ) into four segments which will be applied to our original input segments
,
,
and
. Because we are defining this window to be symmetric, we can call the pieces:

Symmetrical Window Impact on MDCT
The above illustration shows how our window segments are applied to the input data and the impact that has on the DCT-4 analysed data blob. Following that is the output segments from two sequential IMDCT calls with the windows applied to the output here as well.
We need to make the overlapping terms equal the required output segment i.e.
It is clear from the above that the necessary condition to achieve reconstruction is for (which implies in this case that
must also be true).
A simple solution to this is:
The output requires a scaling factor of 2 for this window.
Hiii
Can you please explain me how a b c d folded back into N numbers and how the following originates ..
(-cr -d, a-br).
Can you please explain me….
How do you start the ball rolling?
What does the first block look like? Especially the first N/2 data points, if a starts after that.
There is a latency involved in the transform of one processing block. i.e.
state = [0, 0, … 0]
state, A = MDCT(state, a);
state, B = MDCT(state, b);
state, C = MDCT(state, c);
state = [0, 0, … 0]
state, ap = IMDCT(state, A) ; This will equal zero
state, bp = IMDCT(state, B) ; This will equal “a”
state, cp = IMDCT(state, C) ; This will equal “b”
state, dp = IMDCT(state, [0, 0, … 0]) ; This will equal “c”
Each call to MDCT, IMDCT transform takes N “state” points (which happen to be the previous N points) and N new data points to produce N new state points and N output bins. I hope this answers your question – the only initialisation is that the state must be zeroed prior to using the MDCT and the IMDCT.
The code for the inverse doesn’t make sense.
The 1st loop puts:
state[N to 3N/2]_reverse into state[0 to N/2]
state{2N/2 to N] into state[N/2 to N]
state[3N/2 to 2N] is ignored and written over with the new input
ie.
state_a = – rev_state_c
state_b = – state_c
state_d ignored and written over
The 3rd loop puts:
state_a = (state_a + rev-state_d) /N
state_b = (state_b + state_d) /N
state_c is ignored
Am I missing something?
fwiw this is what I did in python
https://gist.github.com/vv111y/ec3ba9a299406a1d84deabd9477459ea
Your description of DCT-IV worked perfectly, which is what I’m calling
I looked at the code and on first glance I also thought it might have been wrong! 2013 seems like a very long time ago now and I’m not particularly happy with this blog and probably would have done it differently now.
I ran the code and it appears to work so I dug into it a bit further. It turns out that in the inverse transform, I am using the last “rl/2” values as temporary storage purely as a convenience for running the DCT4. Only the first rl/2 in the second half of the state are actually “state”. The second half of the buffer was used in the previous call.
The same thing can be done in the forward transform by replacing:
for (i = 0; i < rl / 2; i++) {
state[i] = – input[rl/2+i] – input[rl/2-i-1];
state[rl/2+i] = state[rl+i] – state[rl+rl-i-1];
}
/* Save the input block */
for (i = 0; i < rl; i++)
state[rl+i] = input[i];
With:
for (i = 0; i < rl / 2; i++) {
state[i] = – input[rl/2+i] – input[rl/2-i-1];
state[rl/2+i] = state[rl+i];
state[rl+i] = input[i] – input[rl-i-1];
}
Which shows in this case the size for the state buffer can actually be reduced. I'm sure the same thing could be done in the inverse. When I write code, I tend to go for terseness – and sometimes that can be unhelpful.
Also for your post on DCT4 https://www.appletonaudio.com/blog/2013/derivation-of-fast-dct-4-algorithm-based-on-dft/, and referencing Bosi&Goldberg “Intro to Digital Audio Coding and Standards”
1) I assume any window function is applied at the same time as the pre-twiddle factor (I’m also using ). Also, the window is reapplied at the end of the inverse DCT4
2) You keep the complex portion after forward DCT4, appending to the end. In Bosi, it is discarded and only the reals are returned, which is N/2 elements. You should be right since we need N elements for the overlapping. Confused by their take. I’m still reading them.