CUDA __umul24 function, useful or not?

Question

CUDA __umul24 function, useful or not?

Is it worth replacing all the multiplications with the __umul24 function in the CUDA core? I read different and opposing opinions, and I still can’t do a bechmark to figure this out.

+4

cuda multiplication

Marco A. Apr 4 '11 at 21:05

source share

2 answers

Just wanted to hear a slightly different opinion than Ashwin / fabrizioM ...

If you are just trying to teach CUDA, their answer is probably more or less acceptable. But if you are actually trying to deploy a production application for commercial or research purposes, this attitude is generally unacceptable if you are not absolutely sure that your end users (or you if you end the user) are Fermi or later.

Most likely, there are many users who will use CUDA on legacy computers that will benefit from using the corresponding Compute Level feature. And it's not as complicated as Ashwin / fabrizioM does.

eg. in the code I'm working on, I use:

//For prior to Fermi use umul, for Fermi on, use //native mult. __device__ inline void MultiplyFermi(unsigned int a, unsigned int b) { a*b; } __device__ inline void MultiplyAddFermi(unsigned int a, unsigned int b, unsigned int c) { a*b+c; } __device__ inline void MultiplyOld(unsigned int a, unsigned int b) { __umul24(a,b); } __device__ inline void MultiplyAddOld(unsigned int a, unsigned int b, unsigned int c) { __umul24(a,b)+c; } //Maximum Occupancy = //16384 void GetComputeCharacteristics(ComputeCapabilityLimits_t MyCapability) { cudaDeviceProp DeviceProperties; cudaGetDeviceProperties(&DeviceProperties, 0 ); MyCapability.ComputeCapability = double(DeviceProperties.major)+ double(DeviceProperties.minor)*0.1; }

Now there is a flaw. What is it?

In any kernel you use multiplication, you must have two different versions of the kernel.

Is it worth it?

Well, think about it, this is a trivial copy and paste of the task, and you are gaining efficiency, yes, in my opinion. After all, CUDA is not the easiest form of programming (nor is it concurrent programming). If performance is NOT critical, ask yourself: why are you using CUDA?

If performance is critical, casually refer to lazy code and either abandon legacy devices or post less optimal execution if you are not sure you can abandon legacy support for your deployment (which allows optimal execution).

In most cases, it makes sense to provide support based on obsolescence, given that this is not so difficult as soon as you understand how to do it. Remember, this means that you will also need to update your code to adapt to changes in future architectures.

As a rule of thumb, you should pay attention to what latest version the code was aimed at when it was written and, possibly, to print some kind of warning to users if they have computing power exceeding that for which your latest version is optimized.

+3

Jason R. mick Dec 11 '11 at 17:48

source share

fabrizioM · Accepted Answer · 2011-04-04T22:44:53+0000

Only in devices with pre-Fermi architecture, that is, with cuda capabilities up to 2.0, where the integer arithmetic unit is 24 bits.

On a Cuda device with capabilities> = 2.0, 32-bit architecture. _umul24 will be slower rather than faster. The reason is that it has to emulate a 24-bit operation with 32-bit architecture.

The question now: is it worth it to increase speed? Probably not.

CUDA __umul24 function, useful or not?

More articles: