With variable-length instructions, how does the computer know the length of the received command?

In architectures where not all instructions are the same length, how does a computer know how much to read for a single instruction? For example, in Intel IA-32, some instructions are 4 bytes, some of them are 8 bytes, so how does he know whether to read 4 or 8 bytes? Is the one that the first command red when the machine is turned on has a known size, and each command contains the size of the next?

+7
source share
3 answers

First, the processor does not need to know how many bytes to retrieve, it can get a convenient number of bytes sufficient to provide the target throughput for a typical or average instruction length. Any additional bytes can be placed in the buffer, which will be used in the next group of bytes to be decoded. There are trade-offs in the width and alignment of the sample relative to the supported decoding width of the instructions and even relative to the width of the subsequent parts of the pipeline. Retrieving more bytes than average can reduce the effect of variability in instruction length and effective sample bandwidth associated with received control flow instructions.

(Taken control flow instructions may introduce a sample bubble if the [predicted] target is unavailable before the cycle after the next sample and reduce the effective sample bandwidth with targets that are less aligned than the command sample. For example, if the team selection is aligned by 16 bytes, as usual for a high-performance x86-accepted branch that targets the 16th [last] byte in the chunk, it will only effectively display one byte of code, since the remaining 15 bytes will be discarded.)

Even for fixed-length instructions, getting multiple instructions per cycle leads to similar problems. Some implementations (for example, MIPS R10000) will retrieve as many instructions that can be decoded, even if they are not aligned, if the group of instructions does not cross the boundary of the cache line. (I seem to remember that one RISC implements two banks of Icache tags to allow the sample to cross the cache block, but not the page border.) Other implementations (like POWER4) will extract aligned code fragments even for the branch aimed at the last instruction in such a piece. (For POWER4, 32 byte fragments containing 8 instructions were used, but no more than five instructions could be decoded per cycle. This excessive sampling width can be used to save energy through non-sampling cycles and provide Icache spare cycles to fill the cache -frame after a miss, having only one read / write port for Icache.)

There are two strategies for decoding several instructions per cycle: mirror decode in parallel or wait for the length to be determined and use this information to analyze the flow of instructions into separate instructions. For an ISA such as IBM zArchitecture (a descendant of S / 360), the length of a 16-bit package is trivially determined by two bits in the first package, so waiting for the lengths to be determined makes more sense. ( The reading mechanism of the slightly more complex RISC V format will still be non-speculative decoding friendly.) For encodings such as microMIPS or Thumb2, which have only two lengths defined by the main opcode, and for which encoding instructions of different lengths is significantly different, use non-speculative decoding may be preferable, especially given the likely narrow decoding and emphasis on energy efficiency, although with only two lengths the assumption may be reasonable for small w Rine decoding.

For x86, one strategy used by AMD to avoid excessive use of decoding energy is to use marker bits in the instruction cache indicating which byte ends with the instruction. With these marker bits, it's easy to find the beginning of each command. This method has the disadvantage that it adds to the delay in skipping the instruction cache (instructions must be pre-encoded), and it still requires decoders to check the correct lengths (for example, if there is a transition to what was previously in the middle of the instruction )

Intel seems to prefer speculative parallel decoding. Since the length of the previous instruction in the block to be decoded will be available only after a short delay, for the second and later decoders it may not be necessary to completely decode the instruction for all the starting points.

Since x86 instructions can be relatively complex, there are often restrictions on decoding a template, and at least one earlier design limits the number of prefixes that can be used while maintaining full decoding bandwidth. For example, Haswell restricts the second to fourth instructions decoded to create only one μop, while the first instruction can decode up to four chips (with longer peak sequences using the microcode mechanism). In principle, this is optimization for the general case (relatively simple instructions) due to the less common case.

In later performance-oriented x86 versions, Intel used the μOP cache, which stores instructions in a decoded format, avoiding pattern and sample width limits and reducing the energy consumption associated with decoding.

+8
source

The first bytes of each command indicate its length. If everything was simple, the first byte would indicate the length, but there are prefixes indicating that the next byte is a real instruction, in addition to variable-length suffixes that contain operands of commands .

The real question is that since the modern Out-Of-Order processor decodes 3 or 4 instructions for each cycle, how does it know where the instructions of the 2nd, 3rd ... begin?

The answer is that it decodes all possible starting points in the current 16-byte line of code in parallel, in brute force mode. I am sure that the source of this remark / hunch is Agner Fog, but I can not find the link. I googled for the "suspected decryption of the Agner Fog instruction", but apparently he spends his time suspecting things about decrypting the commands .

+7
source

Deploying the answer to Pascal, on the x86 architecture , the very first byte indicates which category of instructions to decode belongs to:

  • 1 byte length, which means that it has already been read and can be further processed,

  • 1 byte operation code with several bytes (the so-called ModRM and SIB bytes) to indicate which operands follow (registers, memory addresses) and their operands.

  • instruction prefix, which:

    • change the value of an instruction (repetition - REP , semantics of blocking - LOCK )
    • indicate that the following bytes encode an instruction introduced in later iterations of the original 8086 processor, either expand the size of their operands to 32 or 64 bits, or redefine the operation code completely.

In addition, depending on the mode in which the processor operates, some prefixes may or may not be valid: for example, REX and VEX prefixes were introduced to implement 64-bit and vector instructions, respectively, but they are interpreted as such only in 64-bit mode . REX , due to its format, covers a large number of existing instructions in the original instruction set that can no longer be used in 64 bits (I believe the VEX prefix works the same way, although I don't know anything about it. This). Its fields indicate the next instruction operand size or access to additional registers, available only on 64 bits (from R8 to R15 and XMM8 to XMM15 ).

If you study the internal patterns of the operation code, you will notice that certains bits sequentially indicate which category this instruction belongs to, which leads to somewhat quick decoding.

VAX is another architecture (popular from the late 70s to the late 80s) that carries variable-length instructions based on similar principles. For the first iterations, the instructions were probably sequentially decoded, so the end of the instruction indicated the beginning of a new one in the next byte. As you know, the company that created them also created its polar opposite, the RISC Alpha CPU , which became one of the fastest processors of its time (if not the fastest) of its time, with instructions for a fixed length, the choice was certainly made in response to the requirements of conveyor superscalar technologies growing at that time.

+5
source

Source: https://habr.com/ru/post/970957/


All Articles