for(uint j=i*90000; j < (i+1)*90000; j++){ " + " c[i] += a[j];"
1) You use global memory (c []) to sum, and this is slow. Use a private variable to make it faster. Something like that:
"__kernel void " + "sampleKernel(__global uint *a," + " __global uint *c)" + "{" + "__private uint intensity_core=0;" <---this is a private variable of each core + " uint i = get_global_id(0);" + " for(uint j=i*90000; j < (i+1)*90000; j++){ " + " intensity_core += a[j];" <---register is at least 100x faster than global memory
Now you have a c [number of images] array of sums of intensity.
The size of your local size is 1, if you have at least 160 images (this is your gpu kernel number), then the calculation will use all the kernels.
You will need 90,000 * num_images to read and num_images to write and 90,000 * num_images to register read / write. Using registers will halve kernel time.
2) You only do 1 math for 2 memory accesses. You need at least 10 math per 1 memory-access to use a small fraction of the peak Gflops of you gpu (250 Gflops peak for 6490M)
Your i7 cpu may have 100 Gflops, but your memory will be the bottleneck. This is even worse when you send the full amount of pci-express data. (HD Graphics 3000 is rated at 125 GFLOPS).
// Obtain a device ID cl_device_id devices[] = new cl_device_id[numDevices]; clGetDeviceIDs(platform, deviceType, numDevices, devices, null); cl_device_id device = devices[deviceIndex]; //one of devices[] element must be your HD3000.Example: devices[0]->gpu devices[1]->cpu //devices[2]-->HD3000
In your program:
// Obtain the cl_device_id for the first device int numDevices = (int) numBytes[0] / Sizeof.cl_device_id; cl_device_id devices[] = new cl_device_id[numDevices]; clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0], Pointer.to(devices), null);
The first device starts gpu.