I suspect that there are shared resources that should be used by your program, so when their number increases, there are delays, so the resource is free so that it can be used by another process.
You see, you can have 24 cores, but this does not mean that your whole system allows each core to do everything at the same time. As mentioned in the comments, memory access is one thing that can cause delays (due to traffic), the same goes for the disk.
Also consider a network connection, which may also suffer from many accesses. In conclusion, note that these hardware delays are enough to suppress processing time.
General Note: Remember how program performance is determined:
E = S / p, where S is the acceleration, and p is the number of nodes / processes / threads
Now consider scalability. Usually programs are weakly scalable, i.e. You need to increase the problem size and p at the same speed. Increasing only the number p, while keeping the size of your problem ( n in your case) constant, keeping Efficiency constant, gives a highly scalable program.
source share