Brief explanation
At the assembly code level, two forms of this command are allowed: the explicit operands form and the nooperand form. The form of explicit operands allows you to specify the source and destination address of the memory in explicit form with characters. This form of explicit operands is provided to provide documentation; however, please note that the documentation provided by this form may be misleading. That is, the symbol must not indicate the correct source and destination address. The source address is always specified by DS: (RSI / ESI / SI), and the destination address is always specified by the ES: (RDI / EDI / DI) registers, which must be loaded correctly before running the movsb . This is how I understand Intel’s official position on this.
Long explanation
REP MOVS DWORD PTR ES:[EDI], DWORD PTR [ESI] is synonymous with REP MOVSD ; and REP MOVS BYTE PTR ES:[EDI], BYTE PTR[ESI] is synonymous with REP MOVSB .
The following MOVS commands based on data sizes exist:
- MOVSB (byte, 8 bit)
- MOVSW (word, 16 bits)
- MOVSD (dword, 32-bit)
- MOVSQ (qword, 64 bit) - only available in 64-bit mode
The MOVS command copies data from DS: (SI / ESI / RSI) to ES: (DI / EDI / RDI) - the size of the SI / DI register is based on your current mode - 16-bit, 32-bit or 64-bit. It also increases (decreases) the SI and DI registers (based on the D flag, sets the CLD to increase the registers).
The MOVS command cannot use registers other than SI / DI, so there is no need to specify them.
If the MOVS command has the prefix REP, copying the number of CX bytes (ECX / RCX) is repeated, decreasing CX, so at the end CX becomes zero.
Since the first Pentium processor, released in 1993, Intel started making simple instructions to execute faster and more complex instructions (like REP MOVS) - slower.
So, REP MOVS became very slow, and there was no more reason to use it.
In 2013, Intel decided to return to REP MOVS. If a processor (created after 2013) has a CPUID ERMSB bit (Encens REP MOVSB bit), the rep movsb and rep stosb commands run differently than on older processors and should be fast. In practice, it is performed only for large blocks, 256 bytes or more, and only if certain conditions are met:
- both the source and destination addresses must be aligned with a 16-byte border (this border size is recommended for Ivy Bridge processors, at a later border it can be larger, up to 64 bytes for Cannonlake);
- the source area should not overlap with the destination area;
- length must be a multiple of 64 bytes to improve performance;
- direction must be directed (CLD).
See Intel Optimization Guide, Section 3.7.6 Advanced REP MOVSB and STOSB (ERMSB) operations http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64- ia-32-architectures-optimization-manual.pdf
REP MOVS instructions are very slow on small blocks since the startup cost is about 35 cycles. If you use a simple simple EAX MOV in a loop, there is no upfront cost, and you can copy a lot of data in these 35 cycles.
Note that ERMSB produces better results for REP MOVSB rather than REP MOVSD (MOVSQ). All REP MOVS instructions are much faster, but REP MOVSB is faster than all.
So, the code that you showed is not optimal for processors without ERMSB (since a simple simple copy of MOV EAX will be faster) or with ERMSB (because only MOVSB works fast and not MOVSD, although the difference is not that big).
The code you provided can only give the best results on very old processors, such as 80386, released in 1985.