Edit 2 - Further development of the βlack of performance penaltyβ part due to popular demand:
... I would expect a significant reduction in the number of calculations to occur as more threads are added, but the results do not seem to show that.
I made this giant diagram to better illustrate the scaling.

To explain the results:
The blue bar illustrates the total execution time of all throws. Although this time is reduced to 256 threads, the gains from doubling the number of threads are getting smaller and smaller. The processor on which I ran this test had 4 physical and 8 logical cores. Scaling is pretty good up to 4 cores and decent up to 8 cores, and then it drops sharply. The saturation of the pipeline allows you to get insignificant gains up to 256, but it's just not worth it.
The red bar illustrates the time for each throw. It is almost identical for 1 and 2 threads, since the CPU pipeline has not yet reached full saturation. It receives a minor hit in 4 threads, it still works fine, but now the pipeline is saturated, and in 8 threads it really shows that logical flows are not the same as physical ones, which is getting worse and worse than 8 threads.
The green bar illustrates overhead, or how much lower actual performance compared to the expected double increase from doubling flows. Clicking on the available logical cores leads to a rapid increase in overhead. Please note that this is mainly thread synchronization, the actual costs of thread scheduling are probably constant after a given point, there is a minimal window of activity time that a thread should receive, which explains why switching threads does not reach overwhelming performance. In fact, there is no serious decrease in performance up to 4k streams, which is expected since modern systems should be able and often run more than a thousand streams in parallel. And again, most of this fall is due to thread synchronization, not thread switching.
The black outline diagram shows the time difference with respect to the smallest time. In 8 threads, we lose almost 14% of the absolute productivity due to the lack of supersaturation of the pipeline, which is good, because in most cases it is not worth emphasizing the whole system. It also shows that 1 thread is only ~ 6 times slower than the maximum that the CPU can output. This gives an indication of how good logical cores are compared with physical cores, 100% of additional logical cores give a 50% increase in performance, in this case the logical flow is ~ 50%, like the physical flow, which also correlates with an increase of ~ 47% , which we see starting from 4 to 8. Note that this is a very simple workload, although in more complex cases it is close to 20-25% for this particular processor, and in some cases the extreme load is a hit.
Edit 1 - I foolishly forgot to isolate the computational load from the thread synchronization workload.
Running a test with a small amount of work does not show that for a large number of threads, the number of flow controls takes most of the time. Thus, the penalty of switching the thread is really very small and, possibly, after a certain point is a constant.
And that will make a lot of sense if you put yourself in the shoes of the creator of the thread scheduler. The scheduler can be easily protected from clogging by an unreasonably high switch-to-work ratio, therefore, probably, the minimum time that the scheduler will allow the thread before switching to another while the rest will be put on hold. This ensures that the transition to a working ratio will never exceed the reasonable limits. It would be much better to stop other threads than to go crazy while switching threads, as the processor will basically switch and do very little actual work.
The optimal number of threads is the number of logical processor cores available. This ensures optimum pipeline saturation.
If you use more, you will suffer performance degradation due to the cost of switching the context of the thread. The more threads, the greater the penalty.
If you use less, you will not use the full hardware potential.
There is also the problem of scaling the workload, which is very important when you use synchronization, such as a mutex. If your concurrency is too fine, you can experience a drop in performance even when switching from 1 to 2 threads on a machine with 8 threads. You would like to reduce synchronization as little as possible, doing as much work as possible between synchronization, otherwise you may face big performance losses.
Note the difference between the physical and logical core of the CPU. Hyper-thread processors can have more than one logical core per physical core. Secondary logical cores do not have the same processing power as the primary, since they are simply used to use vacancies in the use of the processor pipeline.
For example, if you have a quad-core processor with four cores, in the case of a perfectly scalable workload, you will see a performance increase of 4 times from 1 to 4 threads, but much less, starting from 4 to 8 threads, as can be seen from the vu1p3n0x answer.
Here you can look to determine the number of processor cores available.