They do not directly extract it (due to read limits, bytes are not "deleted" by the processor and cannot be seen by the attacker in the attack).
The attack vector must perform the “extraction” of bits at a time. After the CPU cache has been prepared (flushing the cache where it should be), and was “taught” that the if branch passes while the condition depends on the non-cached data, the processor speculatively executes a couple of lines from if the scope, including access beyond the boundaries (providing byte B), and then immediately gains access to some allowed non-cached array with an index that depends on one bit of secret B (B will never be directly seen by an attacker). Finally, the attacker extracts the same allowed data array from, say, an index calculated using bits B, for example, zero: if this ok byte is searched quickly, the data is still in the cache, that is, bit B is zero. If the search is (relatively) slow, the processor had to load into its cache that matches the data, that is, it was not earlier, that is, bit B was one.
For example, Cond , all ValidArray not cached, LargeEnough are large enough to ensure that the CPU will not load both ValidArray[ valid-index + 0 ] and ValidArray[ valid-index + LargeEnough ] in its cache in one shot
if ( Cond ) { // the next 2 lines are only speculatively executed V = SomeArray[ out-of-bounds-attacked-index ] Dummy = ValidArray [ valid-index + ( V & bit ) * LargeEnough ] } // the next code is always retired (executed, not only speculatively) t1 = get_cpu_precise_time() Dummy2 = ValidArray [ valid-index ] diff = get_cpu_precise_time() - t1 if (diff > SOME_CALCULATED_VALUE) { // bit was its value (1, or 2, or 4, or ... 128) } else { // bit was 0 }
where bit checked sequentially first by 0x01 , then 0x02 ... to 0x80 . Measuring the "time" (the number of CPU cycles), the "next" code takes for each bit, the value V is displayed:
- If
ValidArray[ valid-index + 0 ] is in cache, V & bit is 0 - otherwise
V & bit bit
This takes time, each bit requires preparation of the CPU L1 cache, tries to use one bit several times to minimize time errors, etc.
Then you need to determine the correct attack "displacement" in order to read an interesting area.
A smart attack, but not so easy to implement.