I have an image processing procedure, which, in my opinion, can be very fast. Each pixel should have approximately 2 thousand operations performed on it, so that it does not depend on operations performed by neighbors, so dividing the work into different units is quite simple.
My question is, what is the best way to approach this change so that I get a fast bang-for-the-buck accelerator?
Ideally, the library / approach I'm looking for should meet these criteria:
- It will still be in 5 years. Something like the CUDA or ATI option could be replaced by a less hardware-specific solution in the not too distant future, so I would like something more time-resistant. If my impression of CUDA is wrong, I welcome the fix.
- Perform fast. I already wrote this code, and it works in serial mode, albeit very slowly. Ideally, I would just take my code and recompile it in parallel, but I think it could be a fantasy. If I just rewrote it using a different paradigm (i.e. like shaders or something else), then that would be fine too.
- It does not require too much knowledge about the equipment. I would like to be able to not specify the number of threads or operating units, but rather have something automatically for everyone, which was for me based on the machine used.
- It can be run on cheap equipment. That could mean a $ 150 graphics card or something else.
- It will run on Windows. Something like GCD might be right, but the client base I'm aiming for won't switch to Mac or Linux soon. Note that this makes the answer to the question a little different than this other question .
What libraries / approaches / languages should I look for? I watched things like OpenMP, CUDA, GCD, etc., but I wonder if there are other things that I am missing.
I am now leaning towards something like shaders and opengl 2.0, but this may not be the right call, since I'm not sure how much memory access I can get this way - these 2k operations require access to all neighboring pixels in many ways.