How are user-level threads planned / created and how are kernel-level threads created?

Sorry if this question is stupid. I tried to find an answer on the net for quite some time, but could not, and therefore I ask here. I study topics, and I went through this link and this Linux Plumbers Conference 2013 at the kernel level and user level, and as I understand it, using pthreads creates threads in user space, and the kernel does not know about it and only considers it as one process, unaware of how many threads are inside. In this case

  • Who decides on the scheduling of these user threads during the time that the process receives, since the kernel sees it as a separate process and does not know about the threads and how the scheduling is done?
  • If pthreads creates user-level threads, how are kernel or OS-level threads created from user-space programs created?
  • In accordance with the above link, it is stated that the kernel of operating systems provides a system call for creating and managing threads. So does the clone() system call clone() a thread of thread level or a thread of user level?
    • If it creates a kernel-level thread, then the strace simple pthreads program also shows the use of clone () at runtime, but then why would this be considered a user-level thread?
    • If it does not create a kernel-level thread, then how are kernel threads created from user-space programs?
  • According to this link, he says: “For each thread, a complete thread control block (TCB) is required to support thread information. As a result, there is significant overhead and increases in core complexity.” Thus, in kernel-level threads, only the heap is split, and the rest are all separate for the thread?

Edit:

I asked about creating a user-level thread, and he planned, because here, there is a link to Many to One Model, where many user-level threads are displayed to one kernel-level thread, and thread control is performed in the user space by the thread library. I have only seen links to the use of pthreads, but I don’t know if threads create user level or kernel level threads.

+20
c ++ c multithreading linux linux-kernel
Aug 27 '16 at 19:49
source share
3 answers

This is preceded by the top comments.

The documentation you are reading is universal (non-Linux specific) and a bit dated. And, moreover, he uses a different terminology. That is, I think, a source of confusion. So read on ...

What he calls the user-level thread is what I call the "outdated" LWP thread. What he calls a "kernel-level thread" is what is called a native thread in linux. Under linux, what is called a kernel thread is completely different. [Cm. Below].

Using pthreads creates threads in user space, and the kernel does not know about it and treats it as only one process, not suspecting how many threads are inside.

This is how custom thread threads were created before NPTL (the collection of natural posix threads). This is also what SunOS / Solaris is called the lightweight LWP process.

There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some of these]. The kernel did not know this. The kernel has not yet understood or provided thread support.

But, since these “light” threads were switched by code in a user-space-based stream (aka “easy process scheduler”) [just a special user program / process], they switched context very slowly.

In addition, before the appearance of "native" threads, you can have 10 processes. Each process receives 10% of the CPU. If one of the processes was LWP, in which there were 10 threads, these threads had to share this 10% and, thus, received only 1% from each processor.

All of this has been replaced by native threads that the kernel scheduler knows about. This change was made 10-15 years ago.

Now, with the above example, we have 20 threads / processes, each of which receives 5% of the processor. And the context switch is much faster.

It is still possible to have an LWP system under its own stream, but now it is a design choice, not a necessity.

Further, LWP works fine if each thread "interacts". That is, each cycle of the flow periodically calls an explicit call to the "context switch" function. He voluntarily abandons the process slot, so another LWP may work.

However, the implementation prior to NPTL in glibc also had to [force] outperform LWP streams (i.e. implement temporary allocation). I don’t remember which mechanism was used, but here is an example. The flow master had to set an alarm, fall asleep, wake up, and then send the active flow to the signal. The signal handler will affect the context switch. It was dirty, ugly and somewhat unreliable.

Joachim mentioned that the pthread_create function creates a kernel thread

This is [technically] wrong to call it a kernel thread. pthread_create creates its own thread. This runs in user space and runs against time series on an equal footing with processes. Once created, there is a slight difference between the thread and the process.

The main difference is that the process has its own unique address space. However, a thread is a process that shares its address space with other processes / threads that are part of the same thread group.

If it does not create a kernel-level thread, then how are kernel threads created from user programs?

Kernel threads are not user space, NPTL, native, or otherwise threads. They are created by the kernel through the kernel_thread function. They run as part of the kernel and are not associated with any program / process / thread in user space. They have full access to the car. Devices, MMU, etc. Kernel threads run at the highest privilege level: ring 0. They also run in the address space of the kernel, and not in the address space of any user process / thread.

The user program / process may not create a kernel thread. Remember that it creates its own thread using pthread_create , which calls syscall clone to do this.

Themes are useful for work, even for the kernel. Thus, it runs part of its code in different threads. You can see these streams by running ps ax . Take a look and you will see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration , etc. These are kernel threads, not programs / processes.




UPDATE:

You mentioned that the kernel does not know about user threads.

Remember that, as mentioned above, there are two “eras”.

(1) Before the kernel gets thread support (circa 2004?). This used the thread wizard (which, here, I will call the LWP scheduler). The kernel just had syscall.

(2) All kernels after that that understand threads. There is no thread wizard, but we have pthreads and syscall clone . Now fork is implemented as clone . clone is like fork , but accepts some arguments. In particular, the flags argument and the child_stack argument.

More on this below ...

then how can user-level threads have separate stacks?

There is nothing "magic" in the processor stack. I will limit the discussion [mainly] to x86, but this applies to any architecture, even to those that do not even have a stack register (for example, IBM-era mainframes such as the IBM System 370)

On x86, the stack pointer is %rsp . X86 has push and pop instructions. We use them to save and restore things: push %rcx and [later] pop %rcx .

But suppose x86 didn't have %rsp or push/pop instructions? Can we still have a stack? Of course, by agreement. We [as programmers] agree that (ie) %rbx is a stack pointer.

In this case, the "push" from %rcx will be [using the AT&T assembler]:

 subq $8,%rbx movq %rcx,0(%rbx) 

And, the "pop" from %rcx will be:

 movq 0(%rbx),%rcx addq $8,%rbx 

To keep things simple, I will switch to pseudo-code C. Here is the push / pop above in pseudo-code:

 // push %ecx %rbx -= 8; 0(%rbx) = %ecx; // pop %ecx %ecx = 0(%rbx); %rbx += 8; 



To create a thread, the LWP scheduler had to create a stack area using malloc . Then he had to save this pointer in the stream structure, and then start the child LWP. The actual code is a bit complicated, suppose we have a function (for example) LWP_create that looks like pthread_create :

 typedef void * (*LWP_func)(void *); // per-thread control typedef struct tsk tsk_t; struct tsk { tsk_t *tsk_next; // tsk_t *tsk_prev; // void *tsk_stack; // stack base u64 tsk_regsave[16]; }; // list of tasks typedef struct tsklist tsklist_t; struct tsklist { tsk_t *tsk_next; // tsk_t *tsk_prev; // }; tsklist_t tsklist; // list of tasks tsk_t *tskcur; // current thread // LWP_switch -- switch from one task to another void LWP_switch(tsk_t *to) { // NOTE: we use (ie) burn register values as we do our work. in a real // implementation, we'd have to push/pop these in a special way. so, just // pretend that we do that ... // save all registers into tskcur->tsk_regsave tskcur->tsk_regsave[RAX] = %rax; // ... tskcur = to; // restore most registers from tskcur->tsk_regsave %rax = tskcur->tsk_regsave[RAX]; // ... // set stack pointer to new task stack %rsp = tskcur->tsk_regsave[RSP]; // set resume address for task push(%rsp,tskcur->tsk_regsave[RIP]); // issue "ret" instruction ret(); } // LWP_create -- start a new LWP tsk_t * LWP_create(LWP_func start_routine,void *arg) { tsk_t *tsknew; // get per-thread struct for new task tsknew = calloc(1,sizeof(tsk_t)); append_to_tsklist(tsknew); // get new task stack tsknew->tsk_stack = malloc(0x100000) tsknew->tsk_regsave[RSP] = tsknew->tsk_stack; // give task its argument tsknew->tsk_regsave[RDI] = arg; // switch to new task LWP_switch(tsknew); return tsknew; } // LWP_destroy -- destroy an LWP void LWP_destroy(tsk_t *tsk) { // free the task stack free(tsk->tsk_stack); remove_from_tsklist(tsk); // free per-thread struct for dead task free(tsk); } 



With a kernel that understands threads, we use pthread_create and clone , but we still need to create a new thread stack. The kernel does not create / assign a stack for a new thread. The clone script takes a child_stack argument. So pthread_create should allocate a stack for a new thread and pass it to clone :

 // pthread_create -- start a new native thread tsk_t * pthread_create(LWP_func start_routine,void *arg) { tsk_t *tsknew; // get per-thread struct for new task tsknew = calloc(1,sizeof(tsk_t)); append_to_tsklist(tsknew); // get new task stack tsknew->tsk_stack = malloc(0x100000) // start up thread clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg); return tsknew; } // pthread_join -- destroy an LWP void pthread_join(tsk_t *tsk) { // wait for thread to die ... // free the task stack free(tsk->tsk_stack); remove_from_tsklist(tsk); // free per-thread struct for dead task free(tsk); } 



Only the process or main thread is assigned its original stack by the kernel, usually with a high memory address. So, if a process does not use threads, as a rule, it simply uses this pre-assigned stack.

But, if a thread is created, either LWP or the source, the initial process / thread must first allocate an area for the proposed thread using malloc . Note: using malloc is the usual way, but the thread creator may just have a large pool of global memory: char stack_area[MAXTASK][0x100000]; if he wants to do it that way.

If we had a regular program that does not use threads [of any type], it might want to "override" the default stack that it gave.

This process may decide to use malloc and the assembler blender described above to create a much larger stack if it performs an extremely recursive function.

See my answer here: What is the difference between user stack and embedded stack when using memory?

+20
Aug 27 '16 at 21:16
source share

User level threads are usually coroutines in one form or another. Switch context between threads in user mode without kernel involvement. From the core, POV is one thread. The actual thread is actually controlled in user mode, and user mode can pause, switch, resume logical execution threads (i.e. coroutines). All this happens during the quanta planned for the actual flow. The kernel can unceremoniously interrupt the actual thread (kernel thread) and transfer control to the processor to another thread.

User mode contexts require collaborative multitasking. User-mode threads should periodically give up control of other user-mode threads (basically, execution changes the context into a new user-mode thread, without a kernel thread ever noticing anything). What usually happens is that the code knows much better when it wants to release control of the kernel. A poorly encoded coroutine can steal control and starve all other coroutines.

The historical implementation uses setcontext , but now it is deprecated. Boost.context offers a replacement for it, but is not fully portable:

Boost.Context is a foundational library that provides a kind of cooperative multitasking in a single thread. Providing an abstraction of the current execution state in the current thread, including the stack (with local variables) and the stack pointer, all the CPU registers and flags, as well as the instruction pointer, context_context is a specific point in the application execution path.

No wonder Boost.coroutine is based on Boost.context.

Windows provided Fibers ..Net runtime has tasks and async / await.

+8
Aug 27 '16 at 21:33
source share

LinuxThreads follows the so-called one-to-one model: each thread is actually a separate process in the kernel. The kernel scheduler takes care of thread scheduling, just like it schedules regular processes. Threads are created using the Linux clone () system call, which is a generalization of fork (), allowing the new process to share memory space, file descriptors, and parent signal handlers.

Source - Interview with Xavier Leroy (the person who created LinuxThreads) http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#K

+1
Aug 28 '16 at 15:02
source share



All Articles