The fastest solution, the least flexible: use your own data types that will wrap the hardware.
The absolute fastest method for integers will be to have your data scaled to int8 / int16 / int32 or any other native data type. Then, when you need your data to transfer your own data type, it will be done at the hardware level! Very painless and an order of magnitude faster than in the case of the implementation of software packaging, visible here.
As an example of an example:
I found this to be very useful when I need a quick implementation of sin / cos, implemented using a look-up table to implement sin / cos. Basically, you scale your data so that INT16_MAX is pi and INT16_MIN is -pi. Then you have to go.
As a side note, scaling up your data will add some final computational overhead, which usually looks something like this:
int fixedPoint = (int)( floatingPoint * SCALING_FACTOR + 0.5 )
Feel free to swap int for something else you want, e.g. int8_t / int16_t / int32_t.
The next fastest solution, more flexible: the mod operation is slower, if possible, try using bit masks!
Most of the solutions I'm looking at are functionally correct ... but they depend on how the mod works.
The mod operation is very slow because it essentially performs the hardware division . The laymans explanation of why mode and division are slow is to equate the division operation to some pseudo-code for(quotient = 0;inputNum> 0;inputNum -= divisor) { quotient++; } for(quotient = 0;inputNum> 0;inputNum -= divisor) { quotient++; } (def quotient and divisor ). As you can see, hardware splitting can be fast if it is a low number relative to the divisor ... but division can also be terribly slow if it is much larger than a divisor .
If you can scale your data to two, then you can use a bitmask that will execute in one cycle (on 99% of all platforms) and your speed increase will be about one order of magnitude (at least 2 or 3 times faster) .
C code for packaging implementation:
Remember to make #define something run-time. And feel free to tweak the bit mask to be any power of the two you need. Like 0xFFFFFFFF or the power of two that you decide to implement.
ps I highly recommend reading fixed-point processing information when you mess with wrapping / overflow conditions. I suggest reading:
Fixed-Point Arithmetic: Introduction by Randy Yates on August 23, 2007