There are several aspects to your question.
A key reference for design decisions in the GHC runtime is the document Runtime Support for Multicore Haskell .
Recall that
The GHC runtime system supports millions of lightweight threads by multiplexing them into multiple threads of the operating system, approximately one for each physical processor.
and
Each Haskell thread runs on a stack of finite size that is allocated to the heap. The state of the thread, along with the stack, is stored in the thread state object allocated by the heap (TSO). The TSO size is around 15 words plus the stack and makes up the entire state of the Haskell thread. The stack can grow by copying the TSO to a large area and can subsequently shrink again
GHC does not compile through CPS. Each thread makes recursive calls, and they must allocate a stack. By representing the stack as an object allocated to the heap, everything is simplified.
A thread is more than a closure.
As the thread executes, it begins to allocate a heap and a stack. Thus:
The stack of the thread, and therefore its TSO, is mutable. When a thread executes, the stack will accumulate pointers to new objects, and therefore, if the TSO is in the old generation, it must be added to the memorized set [GC].
The garbage collection objects pointed to by the stacks can be optimized to ensure that the GC takes place on the same physical thread as the thread.
In addition, when the garbage collector is running, it is highly advisable that the TSOs that were executed on this processor are moved by the garbage collector on the same CPU, because the TSO and the data to which it relates are likely to be the cache of this CPU.
So, GHC has a stack for each thread, because compilation provides that threads have access to the stack and heap. By providing each thread with its own stack, threads can execute in parallel more efficiently. Threads are more than “just closing” because they have a mutable stack.