I use C # and CUDAfy.net (yes, this problem is easier in direct C with pointers, but I have reasons to use this approach, given the large system).
I have a video capture card that collects image data of size [1024 x 1024] at 30 FPS. Every 33.3 ms, it fills the slot in the circular buffer and returns System.IntPtrwhich points to this unmanaged 1D vector *byte; The loop buffer has 15 slots.
On a GPU device (Tesla K40), I want to have a global 2D array that is organized as a dense 2D array. That is, I want something like Circular Queue, but on a GPU organized as a dense 2D array.
byte[15, 1024*1024] rawdata;
How can I fill in another line every 33 ms? Am I using something like:
gpu.CopyToDevice<byte>(inputPtr, 0, rawdata, offset, length)
And in my kernel header there is:
[Cudafy]
public static void filter(GThread thread, byte[,] rawdata, int frameSize, byte[] result)
I tried something in this direction. But in CudaFy there is no API template for:
GPGPU.CopyToDevice(T) Method (IntPtr, Int32, T[,], Int32, Int32, Int32)
So, I used the gpu.Cast function to change the array of 2D devices to 1D.
I tried the code below, but I get the CUDA.net exception: ErrorLaunchFailed
FYI: When I try to use the CUDA emulator, it aborts on CopyToDevice claiming that the Data is not distributed across the host
public static byte[] process(System.IntPtr data, int slot)
{
Stopwatch watch = new Stopwatch();
watch.Start();
byte[] output = new byte[FrameSize];
int offset = slot*FrameSize;
gpu.Lock();
byte[] rawdata = gpu.Cast<byte>(grawdata, FrameSize);
gpu.CopyToDevice<byte>(data, 0, rawdata, offset, FrameSize * frameCount);
byte[] goutput = gpu.Allocate<byte>(output);
gpu.Launch(height, width).filter(rawdata, FrameSize, goutput);
runTime = watch.Elapsed.ToString();
gpu.CopyFromDevice(goutput, output);
gpu.Free(goutput);
gpu.Synchronize();
gpu.Unlock();
watch.Stop();
totalRunTime = watch.Elapsed.ToString();
return output;
}