As long as you have a number of good answers, I feel obligated to add another point: even if the code is theoretically less efficient, it rarely can have any real meaning.
The reason is quite simple: in any case, the processor is much faster than memory. Even pretty cheesy code will still easily saturate the bandwidth between the processor and memory. Even if the data is related to the cache, it still remains unchanged - and (again) even with crap code, the movement will be done too quickly to take care anyway.
Quite a few processors (for example, Intel x86) have a special hardware path that will be used for most moves anyway, so often there will be no difference in speed between implementations that seem quite different at the assembly code level.
Ultimately, if you care about the speed of things moving in memory, you should worry more about eliminating this than speeding it up.
source share