Off-chip memcpy?

Today I was profiling a program that does a lot of buffered network activity, and this program spent most of my time in memcpy, just moving the data back and forth between library-managed network buffers and its own internal buffers.

This made me wonder why Intel does not have a memcpy instruction that allows RAM itself (or hardware memory outside the processor) to move data without affecting the processor? Since each word should be brought to the central processor, and then thrown back again, when all this can be done asynchronously by the memory itself.

Is there any architectural reason that this would be impractical? Obviously, sometimes copies will be between the physical memory and virtual memory, but these cases are reduced with the cost of RAM these days. And sometimes the processor will wait for the copy to complete so that it can use the result, but not always.

+6
source share
3 answers

This is a big issue that includes network stack efficiency, but I will stick to your specific instruction question. What you offer is an asynchronous non-blocking copy command, and not the synchronous blocking memcpy now available using "rep mov".

Some architectural and practical problems:

1) Non-blocking memcpy must consume some physical resource, such as a copy mechanism, with a lifetime that is potentially different from the corresponding process of the operating system. This is rather unpleasant for the OS. Let's say that thread A hits memcpy right before the context switch to thread B. Thread B also wants to make memcpy and has a much higher priority than A. Should it wait for the memcpy thread to finish? What if memcpy was 1000 GB? Providing more copiers in the main system, but does not solve the problem. This basically violates the traditional roll of time slices and OS time scheduling.

2) In order to be general, like most instructions, any code can release the memcpy assembly at any time without considering what other processes have done or will do. The kernel should have some restriction on the number of asynch memcpy operations in flight at any given time, therefore, when the next process goes on, its memcpy may be at the end of an arbitrarily long lag. An asynchronous copy has no determinism, and the developers will simply return to the old-fashioned synchronous copy.

3) Cache locality affects performance in first order. A traditional copy of the buffer already in the L1 cache is incredibly fast and relatively energy efficient, since at least the destination buffer remains the local core of L1. In the case of a network copy, a copy from the kernel to the user buffer occurs immediately before passing the user buffer to the application. Thus, the application enjoys L1 hits and excellent performance. If the asynchronous memcpy engine lived somewhere other than the kernel, the copy operation will result in snoop strings from the kernel, which will lead to misses in the application cache. Network system performance is likely to be much worse than today.

4) The asynch memcpy command should return some kind of token that identifies the copy to use later in order to ask if the copy is complete (another instruction is required). Given the token, the kernel will have to perform some kind of complex contextual search for this particular pending or in-flight copy - these operations are better handled by the software than the main microcode. What if the OS needs to kill the process and wash all standby operations and pending memcpy operations? How does the OS know how many times a process has used this instruction and what corresponding tokens belong to which process?

--- EDIT ---

5) Another problem: any copy mechanism outside the kernel must compete with raw copy performance with maximum cache bandwidth, which is very high - much higher than the bandwidth of external memory. For gaps in the cache, the memory subsystem could smooth out both synchronization and asynchronous memcpy. For any case where at least some data is in cache, which is a good bet, the kernel will complete the copy faster than the external copy mechanism.

+2
source

Memory to memory transfers supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (for example, TI DaVinci or OMAP processors ).

The problem is that it absorbs your memory bandwidth, which can be a bottleneck on many systems. As srking hinted, responding to reading data into the processorโ€™s cache and then copying it around can be much more efficient than DMA memory. Although DMA may run in the background, there will be a conflict with the processor. No free dinners.

The best solution is a kind of null copy where the buffer is shared between the application and the driver / hardware. That is, incoming network data is read directly into pre-allocated buffers and does not need to be copied, and outgiong data is read directly from application buffers to network equipment. I saw this on real-time embedded / network networks.

+1
source

Net win?

It is not clear that implementing an asynchronous copy mechanism will help. The complexity of such a thing would add overhead that could offset the benefits, and it would not be worth it just for a few programs that are memcpy () - bound.

Heavier user context?

The implementation will either include a user context or kernel-based resources. One of the immediate problems is that, since it is a potentially long-running operation, it should allow interrupts and automatically resume.

And this means that if an implementation is part of a user context, it represents a larger state that must be stored on each context switch or must overlap an existing state.

Overlapping an existing state is exactly how the instructions for moving lines work: they store their parameters in general registers. But if the existing state is consumed, this state is not useful during the operation, and then you can also just use the line-moving instructions, as it really works with the memory copy functions.

Or a remote kernel resource?

If it uses some single-cell state, then it must be a kernel-managed resource. The cross-overhead costs associated with this (trap and core recovery) are quite expensive and will further limit the benefit or turn it into a penalty box.

Idea! This super fast processor does it!

Another way to look at this is that there is already a very tuned and very fast engine for moving memory directly to the center of all these ring cache memory, which should be consistent with the results of the movement. This thing: processor. If a program should do this, then why not apply this fast and complex hardware to this problem?

+1
source

Source: https://habr.com/ru/post/894961/


All Articles