This is a big issue that includes network stack efficiency, but I will stick to your specific instruction question. What you offer is an asynchronous non-blocking copy command, and not the synchronous blocking memcpy now available using "rep mov".
Some architectural and practical problems:
1) Non-blocking memcpy must consume some physical resource, such as a copy mechanism, with a lifetime that is potentially different from the corresponding process of the operating system. This is rather unpleasant for the OS. Let's say that thread A hits memcpy right before the context switch to thread B. Thread B also wants to make memcpy and has a much higher priority than A. Should it wait for the memcpy thread to finish? What if memcpy was 1000 GB? Providing more copiers in the main system, but does not solve the problem. This basically violates the traditional roll of time slices and OS time scheduling.
2) In order to be general, like most instructions, any code can release the memcpy assembly at any time without considering what other processes have done or will do. The kernel should have some restriction on the number of asynch memcpy operations in flight at any given time, therefore, when the next process goes on, its memcpy may be at the end of an arbitrarily long lag. An asynchronous copy has no determinism, and the developers will simply return to the old-fashioned synchronous copy.
3) Cache locality affects performance in first order. A traditional copy of the buffer already in the L1 cache is incredibly fast and relatively energy efficient, since at least the destination buffer remains the local core of L1. In the case of a network copy, a copy from the kernel to the user buffer occurs immediately before passing the user buffer to the application. Thus, the application enjoys L1 hits and excellent performance. If the asynchronous memcpy engine lived somewhere other than the kernel, the copy operation will result in snoop strings from the kernel, which will lead to misses in the application cache. Network system performance is likely to be much worse than today.
4) The asynch memcpy command should return some kind of token that identifies the copy to use later in order to ask if the copy is complete (another instruction is required). Given the token, the kernel will have to perform some kind of complex contextual search for this particular pending or in-flight copy - these operations are better handled by the software than the main microcode. What if the OS needs to kill the process and wash all standby operations and pending memcpy operations? How does the OS know how many times a process has used this instruction and what corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy mechanism outside the kernel must compete with raw copy performance with maximum cache bandwidth, which is very high - much higher than the bandwidth of external memory. For gaps in the cache, the memory subsystem could smooth out both synchronization and asynchronous memcpy. For any case where at least some data is in cache, which is a good bet, the kernel will complete the copy faster than the external copy mechanism.