I am not an expert in DCT, but I wrote several FFT implementations at one time, so I'm going to answer that. Please do the following with a pinch of salt.
void njRowIDCT(int* blk)
You correctly say that the algorithm seems to be 8-bit DCX Radix-2, which uses fixed-point arithmetic with an accuracy of 24: 8. I guess the accuracy, because the last step is shifted by 8 to get the desired one (this is also a story about a fairy tale; )
Since it is 8-length, its power is 3 (2 ^ 3 = 8), which means there are 3 steps in DCT. So far, all this is very similar to the FFT. The "fourth stage", apparently, is simply scaling to restore the original accuracy after fixed-point arithmetic.
As far as I can see, the input frame is a bit-reversal from the input blk array to the local variables x0-x7. x8 seems to be a temporary variable. Sorry, I canβt be more descriptive than that.
Bit reversal stage

Update
Take a look at DSP for scientists and engineers . It provides a clear and accurate explanation of signal processing topics. This chapter is devoted to DCT (please go to p497).
Wn (twiddle coefficients) correspond to the basic functions in this chapter, although note that this is a DCT 8x8 (2D) description.
As for the 3 steps I mentioned, compare with the description of the 8-point FFT:

FFT executes butterflies on a bit-reversible input matrix (which are essentially complex multiple additions), multiplying one path by the factor Wn or twiddle on this path. FFT is performed in stages. I still don't understand what your DCT code does, but it can help decompose it into a diagram.
This or someone who knows what they are talking about the promotion ,-)
I will re-read this page and edit when I decrypt the code.