I'm not looking for a portable SIMD implementation.
All I need is a bit-accurate implementation. Performance does not matter much if it is not very slow.
I want to use it for development and testing at an early stage so that I can compile and run on the host computer for the first 10+ iterations. Then cross-compile and fine tune performance against the ARM target.
I am quite used to this development cycle when I work with TI DSP, such as described here . I want to continue this when I move to ARM NEON.
Is it already done, or do I need to invent a wheel?
source
share