Where is the L1 memory cache of Intel x86 processors documented?

I am trying to profile and optimize algorithms, and I would like to understand the specific effect of caches on various processors. For recent Intel x86 processors (e.g. Q9300), it’s very difficult to find detailed cache structure information. In particular, most websites (including Intel.com ) that postprocessor specifications do not contain links to the L1 cache. Is this because the L1 cache does not exist or is this information considered non-essential for some reason? Are there any articles or discussions about eliminating the L1 cache?

[edit] After running various tests and diagnostic programs (mainly the ones discussed in the answers below), I came to the conclusion that my Q9300 seems to have a 32K L1 data cache. I still have not found a clear explanation of why this information is so hard to find. My current theory of operation is that the details of L1 caching are now seen as Intel's trade secrets.

+50
performance intel cpu-cache cpu-architecture
Apr 04 '09 at 0:08
source share
7 answers

It is nearly impossible to find Intel cache specifications. When I taught the cache class last year, I asked friends inside Intel (in the compiler group) and they couldn't find the specifications.

But wait !!! Jed , bless his soul, tells us that on Linux systems you can compress a lot of information from the kernel:

grep . /sys/devices/system/cpu/cpu0/cache/index*/* 

This will give you associativity, dial size and a ton of other information (but not latency). For example, I found out that although AMD advertises its L1 128K cache, my AMD computer has a broken I and D cache of 64K each.




Two suggestions that are now mostly obsolete thanks to Jed:

  • AMD publishes much more information about its caches, so you can get at least some information about the modern cache. For example, last year, AMD L1 caches delivered two words per cycle (peak).

  • The open source tool valgrind contains all kinds of cache models, and it is invaluable for profiling and understanding the behavior of the cache. It comes with a very nice kcachegrind visualization kcachegrind , which is part of the KDE SDK.




For example: in the third quarter of 2008 AMD K8 / K10 The processors use 64-byte cache lines with 64 KB L1I / L1D cache. L1D is a 2-way associative and exclusive with L2, with a 3-cycle delay. L2 cache has 16-channel associativity and latency is about 12 cycles.

AMD Bulldozer family processors use shared L1 with 16-byte associative L1D for a cluster (2 per core).

Intel processors have supported L1 in the same way for a long time (from Pentium M to Haswell to Skylake and, presumably, many generations after that): Split 32kB each I and D caches, and L1D is an 8-band associative. 64 bytes corresponding to DDR DRAM packet size. The loading delay is ~ 4 cycles.

Also see the x86 tag wiki for links to better performance and microarchitecture data.

+61
Apr 04 '09 at 1:05
source share

This Intel Guide : Intel® 64 and IA-32 Architecture Optimization Reference Guide has a fair discussion of cache considerations.

enter image description here

Page 46, Section 2.2.5.1 Intel® Architecture Optimization Guide 64 and IA-32

Even MicroSlop arouses the need for additional tools to monitor cache usage and performance and has the GetLogicalProcessorInformation () function (... while glowing new paths when creating ridiculously long function names in the process) I think I will code.

UPDATE I: Hazwell increases 2X cache load performance, Inside the Tock; Haswell Architecture

If there were any doubts about how important it was to make the most of the cache, this presentation Cliff-click, previously owned by Azul, should dispel all doubts. According to him, "memory is a new disk!".

Haswell’s URS (Unified Reservation Station)

UPDATE II: SkyLake significantly improved cache performance features.

SkyLake Cache Features

+26
Nov 02 '13 at 21:38
source share

You are looking at consumer specifications, not developer specifications. Here is the necessary documentation. Cache sizes vary depending on the submodels of the processor family, so they are usually not found in IA-32 development guides, but you can easily find them in NewEgg, etc.

Edit: More specifically: Chapter 10 of Volume 3A (System Programming Guide), Chapter 7 of the Optimization Reference Guide, and possibly some of the TLB Page Caching Guide, although I assume it is far away from L1 than you care about.

+8
Apr 04 '09 at 1:06
source share

I investigated a few more. At ETH Zurich, there is a group that built a memory performance evaluation tool that could get information about the size of at least (and possibly also associativity) L1 and L2 caches. The program works using experimental experiments and measuring the obtained throughput. A simplified version was used for the popular Bryant and O'Hallaron textbook .

+8
Apr 04 '09 at 19:03
source share

There are L1 caches on these platforms. This will almost certainly remain true until the speed of the memory bus and the front bus exceeds the processor speed, which is very likely far.

On Windows, you can use GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.). Ex version on Win7 will give even more data, for example, which kernels share this cache. CpuZ also provides this information.

+2
Apr 04 '09 at 0:11
source share

Locality of Reference has a big impact on the performance of some algorithms; The size and caching speed of L1, L2 (and later CPUs L3) obviously play a big role in this. Matrix multiplication is one such algorithm.

+2
Apr 04 '09 at 0:59
source share

Intel Manual Vol. 2 defines the following formula for calculating cache size:

This cache size in bytes

= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)

= (EBX [31:22] + 1) * (EBX [21:12] + 1) * (EBX [11: 0] + 1) * (ECX + 1)

Where Ways , Partitions , Line_Size and Sets requested using cpuid with eax set to 0x04 .

Submit header file declaration

x86_cache_size.h :

 unsigned int get_cache_line_size(unsigned int cache_level); 

The implementation is as follows:

 ;1st argument - the cache level get_cache_line_size: push rbx ;set line number argument to be used with CPUID instruction mov ecx, edi ;set cpuid initial value mov eax, 0x04 cpuid ;cache line size mov eax, ebx and eax, 0x7ff inc eax ;partitions shr ebx, 12 mov edx, ebx and edx, 0x1ff inc edx mul edx ;ways of associativity shr ebx, 10 mov edx, ebx and edx, 0x1ff inc edx mul edx ;number of sets inc ecx mul ecx pop rbx ret 

Which on my machine works as follows:

 #include "x86_cache_size.h" int main(void){ unsigned int L1_cache_size = get_cache_line_size(1); unsigned int L2_cache_size = get_cache_line_size(2); unsigned int L3_cache_size = get_cache_line_size(3); //L1 size = 32768, L2 size = 262144, L3 size = 8388608 printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size); } 
+1
Aug 17 '19 at 12:53 on
source share



All Articles