One of our servers is experiencing a very high processor load with our application. We examined various statistics and problems with finding the source of the problem.
One of the current theories is that there are too many threads involved, and we should try to reduce the number of threads running at the same time. There is only one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given time, there are about 620 separate network I / O operations that must be performed in parallel (using java.NIO is also not an option). In addition, there are about 100 operations that do not have an IO, and are also performed in parallel.
This structure is ineffective, and we want to see if it really causes damage, or just bad practice. The reason is that any change in this system is quite expensive (in terms of man-hours), so we need some proof of the problem.
So now we are wondering if switching thread streams is the reason, since there are far more threads than the required parallel operations. Looking at the logs, we see that on average 14 different threads execute in one second. Considering the presence of two processors (see below), these are 7 threads per processor. It doesn't seem like too much, but we wanted to check it out.
So - can we eliminate context switching or too many threads as a problem?
General Information:
- Java 1.5 (yes, it is old), running on CentOS 5, 64-bit, Linux 2.6.18-128.el5 kernel
- There is only one Java process on the machine. Nothing.
- Two processors under VMware.
- RAM 8 GB
- We do not have the ability to run the profiler on a machine.
- We have no way to update Java and OS.
UPDATE As we recommend below, we captured the average load (using uptime) and the CPU (using vmstat 1 120) on our test server with different loads. We waited 15 minutes between each load change and its measurements in order to stabilize the system around the new load and update the average load values:
50% of the work server workload: http://pastebin.com/GE2kGLkk
34% of the work server workload: http://pastebin.com/V2PWq8CG
25% of the working server workload: http://pastebin.com/0pxxK0Fu
CPU usage seems to decrease as the load decreases, but not at a very sharp level (a change from 50% to 25% does not actually reduce CPU usage by 50%). The average load seems incomparable with the amount of workload.
The question also arises: if our test server is also a virtual machine, can its CPU measurements be affected by other virtual machines running on the same host (which makes these measurements useless)?
UPDATE 2 Attaching a snapshot of threads in three parts (pastebine restrictions)
Part 1: http://pastebin.com/DvNzkB5z
Part 2: http://pastebin.com/72sC00rc
Part 3: http://pastebin.com/YTG9hgF5