Is the calculation of the integrated image on the GPU really faster than on the processor?

I'm new to GPU computing, so this can be a really naive question.
I did a few searches, and it seems that the integrated image on the GPU is a good idea.
However, when I really delve into this, I wonder, maybe it is not faster than the processor, especially for large images. So I just want to know your ideas about this and some explanations if the GPU is really faster.

So, assuming we have an MxN image, calculating the CPU of the integrated image will require approximately 3xMxN addition, which is O (MxN).
On the GPU, follow the code provided by the 6th edition of the "OpenGL Super Bible", this will require some kind of operation KxMxNxlog2 (N) + KxMxNxlog2 (M), in which K is the number of operations for a large number of bit-shift, multiplication, addition ...
The GPU can run in parallel, say, 32 pixels at a time, device dependent, but it is still O (MxNxlog2 (M)).
I think that even with a total resolution of 640x480, the processor is still faster.

Am I wrong here?
[Edit] This is a shader code directly from the book, the idea uses the 2nd pass: calculates the row integral, and then calculates the integral of the result column from pass 1. This shader code is 1 pass.

#version 430 core
layout (local_size_x = 1024) in;
shared float shared_data[gl_WorkGroupSize.x * 2];
layout (binding = 0, r32f) readonly uniform image2D input_image;
layout (binding = 1, r32f) writeonly uniform image2D output_image;
void main(void)
{
    uint id = gl_LocalInvocationID.x;
    uint rd_id;
    uint wr_id;
    uint mask;
    ivec2 P = ivec2(id * 2, gl_WorkGroupID.x);
    const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
    uint step = 0;
    shared_data[id * 2] = imageLoad(input_image, P).r;
    shared_data[id * 2 + 1] = imageLoad(input_image,
    P + ivec2(1, 0)).r;
    barrier();
    memoryBarrierShared();
    for (step = 0; step < steps; step++)
    {
        mask = (1 << step) - 1;
        rd_id = ((id >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (id & mask);
        shared_data[wr_id] += shared_data[rd_id];
        barrier();
        memoryBarrierShared();
    }
    imageStore(output_image, P.yx, vec4(shared_data[id * 2]));
    imageStore(output_image, P.yx + ivec2(0, 1),
    vec4(shared_data[id * 2 + 1]));
}
+1
source share
1

integral image?

K MxN . O(K.M.N) CPU GPU, , gfx CPU. GPU, , GPU.

K , GPU , , O(K.M.N.log(K)/log(U)) K>U... . , , , . , , , ( , - ).

[Edit1] ,

, NxN . , H- V- ( 2D FFT), . , M . :

N = M.K

N - , M - , K - .

  • .

    , K M. , , . T(0.5*K*M^2*N) , QUAD, ...

  • .

    . K, . T(0.5*K^3*N) , QUAD, ...

  • # 1, # 2

T(2*N*(0.5*K*M^2+0.5*K^3))
T(N*(K*M^2+K^3))
O(N*(K*M^2+K^3))

M ... M,N, :

T(N*((N/M)*M^2+(N/M)^3))
T(N*(N*M+(N/M)^3))

, ,

N*M = (N/M)^3
N*M = N^3/M^3
M^4 = N^2
M^2 = N
M = sqrt(N) = N^0.5

, :

T(N*(N*M+(N/M)^3))
T(N*(N*N^0.5+(N/N^0.5)^3))
T(N^2.5+N^1.5)
O(N^2.5)

, O(N^4) CPU O(N^2) , HW . PS , - . , H V , , O(N^3) O(N^2.5) 2 .

QA:

, .

+2

Source: https://habr.com/ru/post/1686741/


All Articles