DMA memcpy operation on Linux

I want to use dma using the dma_async_memcpy_buf_to_buf function, which is located in the dmaengine.c file (linux / drivers / dma). To do this, I add a function to the dmatest.c file (linux / drivers / dma) as follows:

void foo () { int index = 0; dma_cookie_t cookie; size_t len = 0x20000; ktime_t start, end, end1, end2, end3; s64 actual_time; u16* dest; u16* src; dest = kmalloc(len, GFP_KERNEL); src = kmalloc(len, GFP_KERNEL); for (index = 0; index < len/2; index++) { dest[index] = 0xAA55; src[index] = 0xDEAD; } start = ktime_get(); cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len); while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS) { dma_sync_wait(chan, cookie); } end = ktime_get(); actual_time = ktime_to_ns(ktime_sub(end, start)); printk("Time taken for function() execution dma: %lld\n",(long long)actual_time); memset(dest, 0 , len); start = ktime_get(); memcpy(dest, src, len); end = ktime_get(); actual_time = ktime_to_ns(ktime_sub(end, start)); printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time); } 

There are some problems in DMA:

  • Interestingly, the memcpy function execution time is shorter than the dma_async_memcpy_buf_to_buf function. Perhaps this is due to a problem with the ktime_get () function.

  • Is my method with the foo function right or wrong for performing a DMA operation? I am not sure about that.

  • How can I measure the number of memcpy and dma_async_memcpy_buf_to_buf function labels in terms of CPU usage?

  • Finally, is application-level DMA possible? So far I have used at the kernel level, as you can see above (dmatest.c - inserted kernel module)

+5
source share
1 answer

There are several questions in your question that make it difficult to answer exactly what you are asking:

  • Yes, your general DMA operation invocation algorithm is correct.

  • The fundamental difference between using simple memcpy and DMA operations to copy memory does not get a direct performance gain, but (a) performance gain due to maintaining the cache / prefetcher state of the processor when using the DMA operation (which is likely to be distorted when using the plain old memcpy executed on the processor itself) and (b) a valid background operation that leaves the CPU accessible for other things.

  • Given (a), it makes no sense to use DMA operations on something smaller than the size of the CPU cache, i.e. tens of megabytes. This is usually done to quickly process the stream outside the processor, i.e. moving data that will either be produced / consumed by external devices, such as fast network cards, equipment for streaming / capturing / encoding video, etc.

  • A comparison of asynchronous and synchronizing operations in terms of the elapsed time of a wall clock is incorrect. There may be hundreds of threads / processes, and no one guarantees that you will receive the scheduled next tick, and not several thousand ticks later.

  • Using ktime_get for benchmarking purposes is incorrect - it is rather inaccurate, especially for such short tasks. Profiling kernel code is actually a rather complex and complex task that goes far beyond the scope of this question. A quick recommendation here would be to generally refrain from such micro-tests and a profile of much larger and more complete work - similar to what you are ultimately trying to achieve.

  • Measuring ticks for modern processors is also pointless, although you can use special tools for the processor, such as Intel VTune .

  • Using application-level DMA copy operations is pretty pointless - at least I can't come up with one viable script from above when this will be worth the problem. This is not innately faster, and, more importantly, I seriously doubt that memory bottleneck is your application. To do this, you usually need to do the rest faster than just copying memory, and I can't think of anything at the application level, which would be faster than memcpy . And if we are talking about communicating with some other processing device outside the processor, then it will not automatically be at the application level.

  • Typically, memory copy performance is usually limited by memory speed, that is, clock frequency and timings. You won’t get any miraculous improvements over regular memcpy in direct performance, simply because memcpy running on the processor is fast enough because the processor usually runs at 3x-5x-10x faster than memory.

+8
source

Source: https://habr.com/ru/post/1201211/


All Articles