Turning around the comments of Raymond Chen and Hans Passan, there are historical reasons why there are two separate instructions and why they do not have the same effect.
None of the two NOP and FNOP were originally designed as explicit instructions without operation. The NOP instruction is actually just an alias for the XCHG AX,AX team. (Or in 32-bit XCHG EAX, EAX mode.) On the early Intel processors, it actually did nothing. Although it did not have an externally visible effect, internally it was executed in exactly the same way as the XCHG instruction, taking as many cycles as needed. "486 was the first Intel processor to process it, it could execute NOP in 1 cycle, and XCHG took 3 cycles to execute any other register-to-register XCHG .
Treatment instructions XCHG AX,AX specifically becomes very important in modern Intel processors. If he still actually changed the same register, he could enter pipelines if the neighboring instruction also used the AX register. Based on this, the CPU does not stop thinking that the NOP should wait for the previous instruction that installs AX , or that the next command should wait for the NOP .
This leads to the fact that there are many different instructions that do nothing, although XCHG AX,AX is the only one, which is one byte (as a special case, the exchange register-with-accumulator with one byte of XCHG encodings ). Often, these instructions are used as a replacement for a single command for consecutive NOP instructions, for example, when aligning the start of a loop for performance reasons. For example, if you need a 6-byte NOP, you can use LEA EAX,[EAX + 00000000] . Intel eventually added an explicit instruction with several bytes of NOP. (Well, not so much has been added as the instruction that was there with the Pentium Pro was officially documented.) However, only one byte form is specially processed; multibyte NOPs will generate kiosks if neighboring teams use the same registers.
When AMD added 64-bit support for its processors, they went even further. NOP no longer the equivalent of XCHG EAX,EAX in 64-bit mode. One of the problems with Intel's instruction set is that there are many instructions that change only part of the register. For example, MOV BX,AX changes only the lower 16 bits of EBX , leaving the upper 16-bit unchanged. These partial modifications make it difficult for the CPU to abandon the kiosks, so AMD decides to prevent this when using 32-bit instructions in 64-bit mode. Whenever the result of a 32-bit operation is stored in a (64-bit) register, the value is zero, extended to 64 bits, so that the entire register is changed . This means that XCHG EAX,EAX no longer NOP, since it clears the upper 32 bits of EAX (and therefore, if you explicitly write XCHG EAX,EAX , it cannot compile at 0x90 and must use 87 C0 encoding). In 64-bit mode, NOP now an explicit NOP without any other interpretation.
Regarding the FNOP instruction, on the original 8087, it is not entirely clear how the FPU handled this instruction, but I am sure that it was also not treated as an explicit non-operation. At least one old Intel manual, the ASM86 Language Rerefence Manual makes a document like something without effect ("saves the top of the stack to the top of the stack"). From its position on the opcode card, it looks like an alias for FST ST or FLD ST , both of which copy the top of the stack to the top of the stack. However, he received special treatment; he performed an average of 13 cycles instead of the average 18 or 20 cycles for the stack for the FST or FLD team, respectively. If it were considered as an instruction without an operation, I would expect it to be even faster, since there are many instructions 8087 that can be executed in half time.
More importantly, the FNOP team behaves differently than the NOP due to the way the FPU instructions were used to implement on Intel processors. The processor itself did not support floating point arithmetic; instead, these duties were uploaded to an additional floating point coprocessor, originally from 8087. One of the nice things about the coprocessor was that it executed instructions in parallel with the processor. However, this means that the CPU sometimes needs to wait for the FPU operation to complete. The CPU automatically waits for the completion of the previous instruction before passing it another command, but the program must explicitly wait (using the WAIT instruction) before it can read the result that the coprocessor wrote to memory.
Since the coprocessor worked in parallel, this also meant that if the FPU command threw a floating-point exception, by the time this CPU was detected, the CPU had already moved to the next command. Usually, when a command throws an exception in the CPU, it is processed while this command is still executing, but when the FPU instruction throws an exception, the CPU has already completed this command by passing it to the FPU. Instead of interrupting the CPU and providing an asynchronous floating point exception, the CPU is only notified if it expects the coprocessor to be explicit or implicit.
In modern processors, FPU is no longer a coprocessor, it is an integral part of the processor. This means that programs no longer have to wait while the FPU writes values ββto memory. However, the way FPU exceptions are handled has not changed. (It turns out that the immediate execution of exceptions is difficult to implement on modern processors, so they took advantage of one case when they did not need to.) Therefore, if the previous FPU instruction threw an exception with an unplanned floating point, NOP leave the exception unavailable, and FNOP , therefore that it is an FPU instruction, will make an implicit "wait" that will lead to the delivery of the floating point excluded.
This example demonstrates the difference:
FLD1 ; push 1.0 onto the FPU stack FLDZ ; push 0.0 FDIV ; divide 1.0 by 0.0 NOP ; does nothing NOP ; does nothing FNOP ; signals a FP zero-divide exception and then does nothing