While the hardware may be okay with tacit accesses, code implementation may be based on stealing the least significant 2 or 3 bits of a pointer (always zero for 32 or 64-bit aligned pointers, respectively).
For example, the function (InterlockedPushSList) (Win32) does not save low 2 or 3 bits of the pointer, so any attempt to press or pop up an unaligned object will not work as intended. Typically, for loose code, encode additional information into a pointer-sized object. In most cases, this is not a problem.
Intel processors have always had excellent incompatible access performance. On Nehalem (Core I7) they went all the way: any incorrect access completely within the cache line has no penalty, and inconsistent calls crossing the cache line border have an average fine of 4.5 cycles - very small.
source share