Reducing data transfer to GPU-CPU in C ++ Amp

Question

Reducing data transfer to GPU-CPU in C ++ Amp

I had the following problem when trying to optimize my application using C ++ Amp: data transfer. For me there is no problem copying data from the CPU to the GPU (how can I do this in the initial state of the application). The worst part is that I need quick access to the results calculated by the C ++ Amp cores, so the pain between the GPU and the CPU is a bottleneck. I read that there is a performance improvement in Windows 8.1, but I am using Windows 7 and I do not plan to change it. I read about intermediate arrays, but I don't know how they can help solve my problem. I need to return a single float value to the host, and it seems like this is the most time consuming operation.

float Subset::reduction_cascade(unsigned element_count, concurrency::array<float, 1>& a) 
{
static_assert(_tile_count > 0, "Tile count must be positive!");
//static_assert(IS_POWER_OF_2(_tile_size), "Tile size must be a positive integer power of two!");

assert(source.size() <= UINT_MAX);
//unsigned element_count = static_cast<unsigned>(source.size());
assert(element_count != 0); // Cannot reduce an empty sequence.

unsigned stride = _tile_size * _tile_count * 2;

// Reduce tail elements.
float tail_sum = 0.f;
unsigned tail_length = element_count % stride;
// Using arrays as a temporary memory.
//concurrency::array<float, 1> a(element_count, source.begin());
concurrency::array<float, 1> a_partial_result(_tile_count);

concurrency::parallel_for_each(concurrency::extent<1>(_tile_count * _tile_size).tile<_tile_size>(), [=, &a, &a_partial_result] (concurrency::tiled_index<_tile_size> tidx) restrict(amp)
{
    // Use tile_static as a scratchpad memory.
    tile_static float tile_data[_tile_size];

    unsigned local_idx = tidx.local[0];

    // Reduce data strides of twice the tile size into tile_static memory.
    unsigned input_idx = (tidx.tile[0] * 2 * _tile_size) + local_idx;
    tile_data[local_idx] = 0;
    do
    {
        tile_data[local_idx] += a[input_idx] + a[input_idx + _tile_size]; 
        input_idx += stride;
    } while (input_idx < element_count);

    tidx.barrier.wait();

    // Reduce to the tile result using multiple threads.
    for (unsigned stride = _tile_size / 2; stride > 0; stride /= 2)
    {
        if (local_idx < stride)
        {
            tile_data[local_idx] += tile_data[local_idx + stride];
        }

        tidx.barrier.wait();
    }

    // Store the tile result in the global memory.
    if (local_idx == 0)
    {
        a_partial_result[tidx.tile[0]] = tile_data[0];
    }
});

// Reduce results from all tiles on the CPU.
std::vector<float> v_partial_result(_tile_count);
copy(a_partial_result, v_partial_result.begin());
return std::accumulate(v_partial_result.begin(), v_partial_result.end(), tail_sum);  
}

, copy(a_partial_result, v_partial_result.begin());. .

+4

c++ multithreading visual-studio gpgpu c++-amp

Paweł Jastrzębski 19 . '14 19:08

1

Ade Miller · Accepted Answer · 2014-02-21T06:07:51+0000

, , - . , ? CodePlex.

Reduction Release, . - .

Running kernels with 16777216 elements, 65536 KB of data ...
Tile size:     512
Tile count:    128
Using device : NVIDIA GeForce GTX 570

                                                           Total : Calc

SUCCESS: Overhead                                           0.03 : 0.00 (ms)
SUCCESS: CPU sequential                                     9.48 : 9.45 (ms)
SUCCESS: CPU parallel                                       5.92 : 5.89 (ms)
SUCCESS: C++ AMP simple model                              25.34 : 3.19 (ms)
SUCCESS: C++ AMP simple model using array_view             62.09 : 20.61 (ms)
SUCCESS: C++ AMP simple model optimized                    25.24 : 1.81 (ms)
SUCCESS: C++ AMP tiled model                               29.70 : 7.27 (ms)
SUCCESS: C++ AMP tiled model & shared memory               30.40 : 7.56 (ms)
SUCCESS: C++ AMP tiled model & minimized divergence        25.21 : 5.77 (ms)
SUCCESS: C++ AMP tiled model & no bank conflicts           25.52 : 3.92 (ms)
SUCCESS: C++ AMP tiled model & reduced stalled threads     21.25 : 2.03 (ms)
SUCCESS: C++ AMP tiled model & unrolling                   22.94 : 1.55 (ms)
SUCCESS: C++ AMP cascading reduction                       20.17 : 0.92 (ms)
SUCCESS: C++ AMP cascading reduction & unrolling           24.01 : 1.20 (ms)

, . , , .

. , GPU. , .

, :

, CodePlex?
?
, GPU, WARP ( )?

,

?
, , ?

Reducing data transfer to GPU-CPU in C ++ Amp

More articles: