We run our own application on our intranet, and we found a problem after updating it recently, when IIS hangs with 100% CPU usage, requiring a reset.
Instead of freezing users, we reverted to the previous release while we define the solution. The first step is to reproduce the problem, but we cannot.
Here is some background:
Prod has one virtual (vmware) web server with two processors and 2 GB of RAM. The database server has 4 GB and 2 CPUs. It is also on VMWare, but separate physical equipment.
Under normal use, the application works fine. The w3wp.exe process typically uses betwen 5-20% CPU and about 200 MB of RAM. CPU and RAM change a bit under normal use, but nothing unusual.
However, when we start to encounter problems, RAM rises sharply, and the processor is tied at 98% (or as much as it can get). The site becomes unresponsive, which requires a restart of IIS. Resetting the application pool does nothing in this situation; a full restart of IIS is required.
This does not happen at night (without use). This happens more often when the site is under load, but it also occurred during off-peak periods.
The first step to solving this problem is reproduction. To simulate the load, we start using JMeter to simulate usage. Our script load is based on actual usage during the crash. Using JMeter, we can increase the frequency of use (2-3 times the load during the crash), but the site behaves perfectly. The CPU is high and the site becomes sluggish, but memory usage is reasonable and nothing hangs.
Does anyone have any tips on how to reproduce such a problem in a non-production environment? We really would like to reproduce the error, determine the solution, and then check again to make sure that we have resolved it. During the process, we discovered a number of small things that we improved that could solve the problem, but I really would feel much more confident if we could reproduce the problem and test the improved version.
Any tools, techniques or theories are greatly appreciated!