First, the processor does not need to know how many bytes to retrieve, it can get a convenient number of bytes sufficient to provide the target throughput for a typical or average instruction length. Any additional bytes can be placed in the buffer, which will be used in the next group of bytes to be decoded. There are trade-offs in the width and alignment of the sample relative to the supported decoding width of the instructions and even relative to the width of the subsequent parts of the pipeline. Retrieving more bytes than average can reduce the effect of variability in instruction length and effective sample bandwidth associated with received control flow instructions.
(Taken control flow instructions may introduce a sample bubble if the [predicted] target is unavailable before the cycle after the next sample and reduce the effective sample bandwidth with targets that are less aligned than the command sample. For example, if the team selection is aligned by 16 bytes, as usual for a high-performance x86-accepted branch that targets the 16th [last] byte in the chunk, it will only effectively display one byte of code, since the remaining 15 bytes will be discarded.)
Even for fixed-length instructions, getting multiple instructions per cycle leads to similar problems. Some implementations (for example, MIPS R10000) will retrieve as many instructions that can be decoded, even if they are not aligned, if the group of instructions does not cross the boundary of the cache line. (I seem to remember that one RISC implements two banks of Icache tags to allow the sample to cross the cache block, but not the page border.) Other implementations (like POWER4) will extract aligned code fragments even for the branch aimed at the last instruction in such a piece. (For POWER4, 32 byte fragments containing 8 instructions were used, but no more than five instructions could be decoded per cycle. This excessive sampling width can be used to save energy through non-sampling cycles and provide Icache spare cycles to fill the cache -frame after a miss, having only one read / write port for Icache.)
There are two strategies for decoding several instructions per cycle: mirror decode in parallel or wait for the length to be determined and use this information to analyze the flow of instructions into separate instructions. For an ISA such as IBM zArchitecture (a descendant of S / 360), the length of a 16-bit package is trivially determined by two bits in the first package, so waiting for the lengths to be determined makes more sense. ( The reading mechanism of the slightly more complex RISC V format will still be non-speculative decoding friendly.) For encodings such as microMIPS or Thumb2, which have only two lengths defined by the main opcode, and for which encoding instructions of different lengths is significantly different, use non-speculative decoding may be preferable, especially given the likely narrow decoding and emphasis on energy efficiency, although with only two lengths the assumption may be reasonable for small w Rine decoding.
For x86, one strategy used by AMD to avoid excessive use of decoding energy is to use marker bits in the instruction cache indicating which byte ends with the instruction. With these marker bits, it's easy to find the beginning of each command. This method has the disadvantage that it adds to the delay in skipping the instruction cache (instructions must be pre-encoded), and it still requires decoders to check the correct lengths (for example, if there is a transition to what was previously in the middle of the instruction )
Intel seems to prefer speculative parallel decoding. Since the length of the previous instruction in the block to be decoded will be available only after a short delay, for the second and later decoders it may not be necessary to completely decode the instruction for all the starting points.
Since x86 instructions can be relatively complex, there are often restrictions on decoding a template, and at least one earlier design limits the number of prefixes that can be used while maintaining full decoding bandwidth. For example, Haswell restricts the second to fourth instructions decoded to create only one μop, while the first instruction can decode up to four chips (with longer peak sequences using the microcode mechanism). In principle, this is optimization for the general case (relatively simple instructions) due to the less common case.
In later performance-oriented x86 versions, Intel used the μOP cache, which stores instructions in a decoded format, avoiding pattern and sample width limits and reducing the energy consumption associated with decoding.