If two cores are issued as indicated above, they will be serialized (they will be executed sequentially). This is due to the fact that without any other code (that is, to switch threads), two cores will be issued to one cuda thread. All cuda calls issued for the same thread are executed sequentially, even if you think you should see differently because you are using cudaMemcpyAsync or something like that.
Of course, it is possible that several cores work asynchronously relative to each other (possibly simultaneously), but for this you need to use the cuda API.
You can see section 3.2.5 "Asynchronous Concurrent Execution" in the CUDA C Programmers Guide to learn more about threads and kernel execution at the same time. In addition, the nvidia CUDA SDK has several examples, such as simple threads that illustrate concepts. An example of parallel cores shows how to run multiple cores simultaneously (using threads). Please note that to run the kernel at the same time, computing power of 2.0 or βhigherβ hardware is required.
Also, to answer your first question, from section 3.2.5.3 of the CUDA C Programming Guide, "The maximum number of kernel starts that a device can execute at the same time is sixteen."
For reference, a "grid" is the entire array of threads associated with the launch of a single core.
source share