How to reduce or eliminate __tls_init calls?

I use a third-party library that relies on thread_local. This leads to the fact that my program calls __tls_init()several times, even in each iteration of some loops (I did not check them all), despite the fact that the variables thread_localwere unconditionally initialized by another call earlier within the same function (and in fact, near the beginning the entire program).

The first commands in __tls_init()my x86_64:

cmpb    $0, %fs:__tls_guard@tpoff
je      .L530
ret
.L530:
pushq   %rbp
pushq   %rbx
subq    (some stack space), %rsp
movb    $1, %fs:__tls_guard@tpoff

therefore, the first time this is called for each thread, the value in is %fs:__tls_guard@tpoffset to 1, and subsequent calls are immediately returned. But still, this means all the overhead callevery time a variable is executed thread_local, right?

Please note that this is a statically related (actually generated!) Function, therefore the compiler β€œknows” it starts with this condition, and it is quite possible that the flow analysis discovers that it is not necessary to call this function more than once. But this is not so.

Is it possible to get rid of unnecessary instructions, call __tls_initor at least stop the compiler from emitting them in time-critical sections?

An example of a situation from a real compilation: (-O3)

pushq   %r13
movq    %rdi, %r13
pushq   %r12
pushq   %rbp
pushq   %rbx
movq    %rsi, %rbx
subq    $88, %rsp
call    __tls_init              // always gets called
movq    (%rbx), %rdi
call    <some local function>
movq    8(%rax), %r12
subq    (%rax), %r12
movq    %rax, %rbp
sarq    $4, %r12
cmpq    $1, %r12
jbe .L6512
leaq    -2(%r12), %rax
movq    $0, (%rsp)
leaq    48(%rsp), %rbx
movq    %rax, 8(%rsp)
.L6506:
call    __tls_init              // needless and called potentially very many times!
movq    %rsp, %rsi
movq    %rsp, %rdi
addq    $8, %rbx
call    <some other local function>
movq    %rax, -8(%rbx)
leaq    80(%rsp), %rax
cmpq    %rbx, %rax
jne .L6506                      // cycle

Update : the source code of the above is too complicated. Here's the MWE:

void external(int);

struct X {
  volatile int a;   // to prevent optimizing to a constexpr
  X() { a = 5; }    // to enforce calling a c-tor for thread_local
  void f() { external(a); } // to prevent disregarding the value of a
};

thread_local X x;

void f() {
  x.f();
  for(int j = 0; j < 10; j++)
    x.f();  // x is totally initialized now
}

, ( ), fs:__tls_guard@tpoff 0 1, .L4 ( , ), __tls_init - .

g++, CLang (. Compiler Explorer) .

, . ? , . , . , , , ( , MWE, , , - ).

+4
1

, - tls, , TLS :

void f() {
  auto ptr = &x;
  ptr->f();
  for(int j = 0; j < 10; j++)
    ptr->f(); 
}
+3

Source: https://habr.com/ru/post/1659065/


All Articles