This is preceded by the top comments.
The documentation you are reading is universal (non-Linux specific) and a bit dated. And, moreover, he uses a different terminology. That is, I think, a source of confusion. So read on ...
What he calls the user-level thread is what I call the "outdated" LWP thread. What he calls a "kernel-level thread" is what is called a native thread in linux. Under linux, what is called a kernel thread is completely different. [Cm. Below].
Using pthreads creates threads in user space, and the kernel does not know about it and treats it as only one process, not suspecting how many threads are inside.
This is how custom thread threads were created before NPTL (the collection of natural posix threads). This is also what SunOS / Solaris is called the lightweight LWP process.
There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some of these]. The kernel did not know this. The kernel has not yet understood or provided thread support.
But, since these “light” threads were switched by code in a user-space-based stream (aka “easy process scheduler”) [just a special user program / process], they switched context very slowly.
In addition, before the appearance of "native" threads, you can have 10 processes. Each process receives 10% of the CPU. If one of the processes was LWP, in which there were 10 threads, these threads had to share this 10% and, thus, received only 1% from each processor.
All of this has been replaced by native threads that the kernel scheduler knows about. This change was made 10-15 years ago.
Now, with the above example, we have 20 threads / processes, each of which receives 5% of the processor. And the context switch is much faster.
It is still possible to have an LWP system under its own stream, but now it is a design choice, not a necessity.
Further, LWP works fine if each thread "interacts". That is, each cycle of the flow periodically calls an explicit call to the "context switch" function. He voluntarily abandons the process slot, so another LWP may work.
However, the implementation prior to NPTL in glibc also had to [force] outperform LWP streams (i.e. implement temporary allocation). I don’t remember which mechanism was used, but here is an example. The flow master had to set an alarm, fall asleep, wake up, and then send the active flow to the signal. The signal handler will affect the context switch. It was dirty, ugly and somewhat unreliable.
Joachim mentioned that the pthread_create function creates a kernel thread
This is [technically] wrong to call it a kernel thread. pthread_create creates its own thread. This runs in user space and runs against time series on an equal footing with processes. Once created, there is a slight difference between the thread and the process.
The main difference is that the process has its own unique address space. However, a thread is a process that shares its address space with other processes / threads that are part of the same thread group.
If it does not create a kernel-level thread, then how are kernel threads created from user programs?
Kernel threads are not user space, NPTL, native, or otherwise threads. They are created by the kernel through the kernel_thread function. They run as part of the kernel and are not associated with any program / process / thread in user space. They have full access to the car. Devices, MMU, etc. Kernel threads run at the highest privilege level: ring 0. They also run in the address space of the kernel, and not in the address space of any user process / thread.
The user program / process may not create a kernel thread. Remember that it creates its own thread using pthread_create , which calls syscall clone to do this.
Themes are useful for work, even for the kernel. Thus, it runs part of its code in different threads. You can see these streams by running ps ax . Take a look and you will see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration , etc. These are kernel threads, not programs / processes.
UPDATE:
You mentioned that the kernel does not know about user threads.
Remember that, as mentioned above, there are two “eras”.
(1) Before the kernel gets thread support (circa 2004?). This used the thread wizard (which, here, I will call the LWP scheduler). The kernel just had syscall.
(2) All kernels after that that understand threads. There is no thread wizard, but we have pthreads and syscall clone . Now fork is implemented as clone . clone is like fork , but accepts some arguments. In particular, the flags argument and the child_stack argument.
More on this below ...
then how can user-level threads have separate stacks?
There is nothing "magic" in the processor stack. I will limit the discussion [mainly] to x86, but this applies to any architecture, even to those that do not even have a stack register (for example, IBM-era mainframes such as the IBM System 370)
On x86, the stack pointer is %rsp . X86 has push and pop instructions. We use them to save and restore things: push %rcx and [later] pop %rcx .
But suppose x86 didn't have %rsp or push/pop instructions? Can we still have a stack? Of course, by agreement. We [as programmers] agree that (ie) %rbx is a stack pointer.
In this case, the "push" from %rcx will be [using the AT&T assembler]:
subq $8,%rbx movq %rcx,0(%rbx)
And, the "pop" from %rcx will be:
movq 0(%rbx),%rcx addq $8,%rbx
To keep things simple, I will switch to pseudo-code C. Here is the push / pop above in pseudo-code:
// push %ecx %rbx -= 8; 0(%rbx) = %ecx; // pop %ecx %ecx = 0(%rbx); %rbx += 8;
To create a thread, the LWP scheduler had to create a stack area using malloc . Then he had to save this pointer in the stream structure, and then start the child LWP. The actual code is a bit complicated, suppose we have a function (for example) LWP_create that looks like pthread_create :
typedef void * (*LWP_func)(void *); // per-thread control typedef struct tsk tsk_t; struct tsk { tsk_t *tsk_next; // tsk_t *tsk_prev; // void *tsk_stack; // stack base u64 tsk_regsave[16]; }; // list of tasks typedef struct tsklist tsklist_t; struct tsklist { tsk_t *tsk_next; // tsk_t *tsk_prev; // }; tsklist_t tsklist; // list of tasks tsk_t *tskcur; // current thread // LWP_switch -- switch from one task to another void LWP_switch(tsk_t *to) { // NOTE: we use (ie) burn register values as we do our work. in a real // implementation, we'd have to push/pop these in a special way. so, just // pretend that we do that ... // save all registers into tskcur->tsk_regsave tskcur->tsk_regsave[RAX] = %rax; // ... tskcur = to; // restore most registers from tskcur->tsk_regsave %rax = tskcur->tsk_regsave[RAX]; // ... // set stack pointer to new task stack %rsp = tskcur->tsk_regsave[RSP]; // set resume address for task push(%rsp,tskcur->tsk_regsave[RIP]); // issue "ret" instruction ret(); } // LWP_create -- start a new LWP tsk_t * LWP_create(LWP_func start_routine,void *arg) { tsk_t *tsknew; // get per-thread struct for new task tsknew = calloc(1,sizeof(tsk_t)); append_to_tsklist(tsknew); // get new task stack tsknew->tsk_stack = malloc(0x100000) tsknew->tsk_regsave[RSP] = tsknew->tsk_stack; // give task its argument tsknew->tsk_regsave[RDI] = arg; // switch to new task LWP_switch(tsknew); return tsknew; } // LWP_destroy -- destroy an LWP void LWP_destroy(tsk_t *tsk) { // free the task stack free(tsk->tsk_stack); remove_from_tsklist(tsk); // free per-thread struct for dead task free(tsk); }
With a kernel that understands threads, we use pthread_create and clone , but we still need to create a new thread stack. The kernel does not create / assign a stack for a new thread. The clone script takes a child_stack argument. So pthread_create should allocate a stack for a new thread and pass it to clone :
// pthread_create -- start a new native thread tsk_t * pthread_create(LWP_func start_routine,void *arg) { tsk_t *tsknew; // get per-thread struct for new task tsknew = calloc(1,sizeof(tsk_t)); append_to_tsklist(tsknew); // get new task stack tsknew->tsk_stack = malloc(0x100000) // start up thread clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg); return tsknew; } // pthread_join -- destroy an LWP void pthread_join(tsk_t *tsk) { // wait for thread to die ... // free the task stack free(tsk->tsk_stack); remove_from_tsklist(tsk); // free per-thread struct for dead task free(tsk); }
Only the process or main thread is assigned its original stack by the kernel, usually with a high memory address. So, if a process does not use threads, as a rule, it simply uses this pre-assigned stack.
But, if a thread is created, either LWP or the source, the initial process / thread must first allocate an area for the proposed thread using malloc . Note: using malloc is the usual way, but the thread creator may just have a large pool of global memory: char stack_area[MAXTASK][0x100000]; if he wants to do it that way.
If we had a regular program that does not use threads [of any type], it might want to "override" the default stack that it gave.
This process may decide to use malloc and the assembler blender described above to create a much larger stack if it performs an extremely recursive function.
See my answer here: What is the difference between user stack and embedded stack when using memory?