Slow CUDA Prime Generator Performance

I am writing my first program in CUDA. This is a prime number generator. It works, but it is only 50% faster than the equivalent single-threaded C ++ code. The CPU version uses 100% of one core. The GPU version uses only 20% of the GPU. CPU - i5 (2310). GPU is the GF104.

How can I improve the performance of this algorithm?

Now my complete program.

int* d_C; using namespace std; __global__ void primo(int* C, int N, int multi) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < N) { if(i%2==0||i%3==0||i%5==0||i%7==0) { C[i]=0; } else { C[i]=i+N*multi; } } } int main() { cout<<"Prime numbers \n"; int N=1000; int h_C[1000]; size_t size=N* sizeof(int); cudaMalloc((void**)&d_C, size); int threadsPerBlock = 1024; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; vector<int> lista(100000000); int c_z=0; for(int i=0;i<100000;i++) { primo<<<blocksPerGrid, threadsPerBlock>>>(d_C, N,i); cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); for(int c=0;c<N;c++) { if(h_C[c]!=0) { lista[c+N*i-c_z]=h_C[c]; } else { c_z++; } } } lista.resize(lista.size()-c_z+1); return(0); } 

I tried using a 2D array and a for loop in the kernel, but could not get the correct results.

+4
source share
1 answer

Welcome to stack overflow.

Here are some potential problems:

  • N = 1000 is too small. Since you have 1024 threadsPerBlock , your kernel runs only one block, which is not enough to use the GPU. Try N = 1,000,000 so that the kernel starts at nearly 1,000 blocks.

  • You do very little work with the GPU (4 modules for each tested number). Therefore, it is probably faster to perform these operations with the processor than to copy them from the GPU (via the PCIe bus).

To use the GPU to search for primes, I think you need to implement the entire algorithm on the GPU, not just the module operation.

+3
source

Source: https://habr.com/ru/post/1433485/


All Articles