I have done a lot of OpenGL and shaders before, and now I decided to try OpenCL. I watched some online tutorials and started reading books on this subject. In order to better understand, and because I believe that the best way to learn is to reasonably try and learn from the problems that have arisen, I decided to start introducing a kernel for a fully connected perceptron.
For those who do not know what it is, I will explain the basic idea. This is a neural network in which each neuron of a layer is connected to each neuron of the next layer. Each neuron has only one action to perform: performing the sum of all neurons of the previous layer, weighted by a different value for each neuron.
It seemed simple enough to implement, and after reading the article “Parallel training of a neural network using OpenCL”, I implemented it as follows.
Each level depends on the previous one, they are launched sequentially by the host
To calculate the level, I launch my kernel with the global work size of the number of neurons inside the layer (which can be quite huge, for example, tens of thousands). This makes all neurons carry out their sum independently of each other.
Each neuron (identified by its global_work_id) performs a weighted sum with all neurons of the previous layer.
Here is my fully functional opencl core:
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private float sum = 0.;
for(int i=0; i < in_layer_s; i++) {
sum += in_weights[i*out_layer_s+global_id] * in_value[i];
}
out_values[global_id] = sum;
}
And this is how I call it:
queue.enqueueNDRangeKernel(kernel, cl::NullRange,cl::NDRange(number of neurons within layer),cl::NullRange);
, . , - , , .
, , , , , , .
, ( Nvidia GTX 660M), , . :
2500, 10 000, 2500: 0.018s ~ 60FPS. 4-5 , (Intel Core i7 2,40 )
100 000, 100 000, 500: 140 → , , , , 100 000 . .