The EOF answer has a good explanation for fewer floating point operations: -ffast-math union of operation types, so I will just answer the other part.
There is no information in the question about which processor microarchitecture was used, but at least it is labeled intel .
Intel processors have some logic for prefetching in L1 and more complex logic for prefetching in L2 (from L3 or main memory). Each core has its own L2, but lower levels of the cache hierarchy are common, so there is an obvious place to host the main prefetch logic.
If you read slower than the memory bandwidth limits, your loads will fall into L2, because the hardware prefader will already select these lines in L2. If the prefetch cannot keep up, you will get L2 cache skips .
Fewer wider loads instead of many scalar loads also means that the% miss will be worse with vectors. (The EOF answer has already done this). This effect does not explain the increase in the absolute number of misses L2, although only (part) of the miss% change. However, it is important to keep in mind when looking at data, however.
From the Intel Optimization Guide (links in the x86 tag wiki), section 2.3.5.4: Prefetching data:
Prefetch data for L2 cache and last level
Streamer . This prefixer tracks read requests from L1 cache for upstream and downstream address sequences .... When a redirect or reverse flow of requests is detected, the expected cache lines are preprogrammed. The preprogrammed line cache should be on one 4K page.
- The stream may issue two prefetch requests for each L2 search. A stream can run up to line 20 before requesting a download.
- Dynamically tuned to the number of outstanding requests to the kernel. If their slightly unfulfilled requests, pre-select the tape drive forward. If there are many outstanding requests, he pre-enters LLC only and less far ahead.
- When the cache lines are far ahead, it pre-selects the cache of the last level, not L2. This method avoids replacing useful cache lines in the L2 cache.
- Detects and supports up to 32 data access threads. For each 4-byte page, you can support one forward and one reverse stream can be supported.
This is from the Sandybridge section, but the Haswell and Skylake sections do not go into details about the changes in the prefetch. It is said to be โimproved preloadโ, but apparently it has the same basic design, only with better heuristics and / or better settings for existing heuristics, and the like.
Thanks @HansPassant: his comment on this question made me think that prefetching is not far behind.