Why isn't local thread storage implemented with page table mappings?

I was hoping to use the C ++ 11 thread_local for the per-thread boolean flag, which is often accessed.

However, most compilers seem to have implemented a local stream store with a table that maps integer identifiers (slots) to the variable address in the current stream. This search would occur inside critical code with critical characteristics, so I have some problems with its effectiveness.

The way I expected local thread storage to be implemented is the allocation of virtual memory ranges that are supported by different physical pages depending on the stream. Thus, access to the flag will be the same as any other memory access, since the MMU takes care of the mapping.

Why don't any of the mainstream compilers take advantage of page table mappings in this way?

I believe that I can implement my own “thread-specific page” using mmap on Linux and VirtualAlloc on Win32, but this seems like a fairly common use case. If anyone knows about existing or better solutions, please point me to them.

I also considered std::atomic<std::thread::id> inside each object to represent the active thread, but profiling shows that checking for std::this_thread::get_id() == active_thread quite expensive.

+5
source share
6 answers

On Linux / x86-64, the local storage stream is implemented through the special segment register %fs (for x86-64 ABI p. 23 ...)

So, the following code (I use the C + GCC extension __thread extension __thread , but it is the same as C ++ 11 thread_local )

 __thread int x; int f(void) { return x; } 

gcc -O -fverbose-asm -S (with gcc -O -fverbose-asm -S ) to:

  .text .Ltext0: .globl f .type f, @function f: .LFB0: .file 1 "tl.c" .loc 1 3 0 .cfi_startproc .loc 1 3 0 movl %fs: x@tpoff , %eax # x, ret .cfi_endproc .LFE0: .size f, .-f .globl x .section .tbss,"awT",@nobits .align 4 .type x, @object .size x, 4 x: .zero 4 

Therefore, turning to your fears, access to TLS very quickly works on Linux / x86-64. It is not exactly implemented as a table (instead, the kernel and runtime control the %fs segment register to indicate a thread-specific memory area, and the compiler and linker control the offset there). However, the old pthread_getspecific did go through the table, but is almost useless once you have TLS.

BTW, by definition, all threads in the same process share the same address space in virtual memory , because the process has its own address space . (see /proc/self/maps , etc. see proc (5) for more information on /proc/ , as well as mmap (2) ; the C ++ 11 thread library is based on pthreads that are implemented using clone (2) ). Thus, “matching memory by specific threads” is a contradiction: when a task (what is done by the kernel scheduler) has its own address space, it is called a process (not a thread). A defining characteristic of threads in the same process is the sharing of a common address space (and some other objects, such as file descriptors).

+6
source

Operating systems with main threads, such as Linux, OSX, Windows, bind the page to each property of the process, and not to the thread. There is a very good reason for this, page mapping tables are stored in RAM and reading them to calculate the effective physical address will be overly expensive if you need to do this for each instruction.

Thus, the processor does not work, it saves a copy of recently used mapping table entries in fast memory, which is close to the execution kernel. The TLB cache is called.

The invalidity of the TLB cache is very expensive, it must be reloaded from RAM with a low coefficient, that the data is available in one of the memory caches. The processor can stop for thousands of cycles when this should happen.

Thus, your proposed scheme may actually be very inefficient, assuming that the operating system will support it using cheaper indexed search. Processors are very good in simple math, occurs in gigahertz, memory access occurs in megahertz.

+3
source

The sentence does not work, because it will prevent other threads from accessing your thread_local variables with a pointer. These threads will access their own copy of this variable.

Say, for example, that you have a main thread and 100 worker threads. Worker_threads passes a pointer to its own thread_local variable back to the main thread. The main thread now has 100 pointers to these 100 variables. If the TLS memory was mapped to a page table, as suggested, the main thread will have 100 of the same pointers to one uninitialized variable in the TLS of the main thread - certainly not what it was intended to be!

+3
source

Memory mappings do not apply to threads, but to each process. All threads will have the same mapping.

The kernel may offer thread mappings, but currently does not.

+2
source

You are using C ++. Have a stream object in a stream, with a stream working procedure, and all / most of the functions called by it are member functions of this object. Then you can have the stream identifier or any other data related to the stream as member variables.

0
source

One of the current problems is hardware limitation (although, I am sure, this precedes the situations below).

In SPARC T5 processors, each hardware thread has its own MMU, but shares a TLB with seven sister threads on the same core and that TLB can be very mocking.

In MIPS, various memory mappings for threads can cause them to be serialized into a single virtual thread execution context. This is because the hardware thread contexts share the MMU. The kernel can no longer start several processes in the context of neighboring threads, and separate memory mappings for the thread will have the same restrictions.

0
source

Source: https://habr.com/ru/post/1204972/


All Articles