The reason your first test is much faster is because there is a difference in the amount of work that each test does. In fact, the ratio is 50x.
Big-O for quadratic multiplication of the matrix O (n ^ 3). See: why is the time complexity of multiplying a square matrix defined as O (n ^ 3)? As a result, a 10k square square matrix actually takes a million times more work to multiply by one 100x100 multiplication. Your 20,000 100x100 multiplication operations do not compensate for the huge amount of work needed to multiply large matrices once.
Matrix multiplication is just a lot of point products. Your algorithm splits only point products into groups for ease of management and does not use any special tricks to reduce the numbers in my calculations below.
For a small matrix test:
Total dot products: 10^4 MADs per dot product: 10^2 Total matrix-multiply operations: 20000 = 2*10^4 Total multiply-adds: 2* 10^(4+2+4) = 2*10^10 = 20,000,000,000
20 billion.
Big matrix test:
Total dot products: 10^8 MADs per dot product: 10^4 Total multiply operations: 1 (or 10^0) Grand total multiply-adds: 10 ^ (8 + 4 + 0) = 10^12 = 1,000,000,000,000
1000 billion.
The 10000x10000 test was technically launched faster - crunching 50 times more operations, only 40 times longer execution time.
More on the "special tricks" here: http://en.wikipedia.org/wiki/Strassen_algorithm . Despite the fact that this algorithm is not considered practical for computing GPUs. Marine sophisticated algorithms also exist, but the brute force approach on graphics hardware seems to be used most often.
Why is your kernel slow at all? There are many different optimizations you can use to speed things up. Below are just a few of them that you can make Google and experiment with yourself. You will probably come across some that I have not mentioned here.
- Optimize the size of workgroups and blocks. see opencl PREFERRED_WORK_GROUP_SIZE
- Use float4 data type. opencl includes a dot product function that calculates a point product for floatn data types.
- Move matrix B before starting the kernel. you can use another kernel for porting.
mfa source share