First of all, you need to be more specific regarding timeouts.
TCP timeouts : TCP splits the message into packets that are sent one by one. The recipient must confirm receipt of the package. If the receiver does not acknowledge receipt of the packet within a certain period of time, a TCP retransmission occurs, which again sends the same packet. If this happens a couple of times, the sender refuses and kills the connection.
HTTP timeout . An HTTP client, such as a browser, or your server, acting as a client (for example, sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within this period of time, it will disconnect and call this timeout.
Now there are many, many possible reasons for this ... from the more trivial to the less trivial:
Invalid content length calculation . If you send a request with the heading Content-Length: 20 , it means that "I will send you 20 bytes." If you send 19, the other end will wait for the remaining 1. If it takes too much time ... timeout.
Not enough infrastructure . Perhaps you should assign more applications to your application. If (total load / # of CPU cores) exceeds 1 or your memory usage is high, your system may be higher. However, read on ...
Silent exception : the error was thrown, but not registered anywhere. The request never finished processing, which led to the next element.
Resource leaks . Each request must be processed before completion. If you do not, the connection will remain open. In addition, the IncomingMesage object (aka: commonly called req in express code) will remain a reference to other objects (for example: express yourself). Each of these objects can use a lot of memory.
Node puzzle event loop . I will get to the end.
For memory leaks, the symptoms will be: the node process will use an increasing amount of memory.
To make matters worse if the available memory is low and your server is not properly configured to use swap, Linux will start moving memory to disk (swapping), which is very intense for I / O and processor. Servers should not include swapping.
cat /proc/sys/vm/swappiness
will return you the level of swappiness configured on your system (goes from 0 to 100). You can change it at a constant level with /etc/sysctl.conf (a reboot is required) or with a time change: sysctl vm.swappiness=10
Once you have determined that you have a memory leak, you need to get a kernel dump and load it for analysis. A way to do this can be found in this other Stackoverflow answer: Kernel dump analysis tools from Node.js
For connection leaks (you missed the connection without processing the completion request), you will have more and more established connections to your server. You can verify that connections are established using netstat -a -p tcp | grep ESTABLISHED | wc -l netstat -a -p tcp | grep ESTABLISHED | wc -l netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.
The event loop puzzle is now the worst problem. If you have a short code, node works very well. But if you are busy with the processor and have a function that forces the processor to work for excessive time ... for example, 50 ms (50 ms of continuous, blocking, synchronous processor time, and not asynchronous code that takes 50 ms), and the operations are processed a series of events, such as processing HTTP requests, that begin to lag and eventually fail.
The processor bottleneck method uses a performance profiler. nodegrind / qcachegrind are my preferred profiling tools, but others prefer plugins, etc. However, it can be difficult to run a profiler in production. Just take the development server and click on it with the requests. aka: load test. There are many tools for this.
Finally, another way to debug the problem:
env NODE_DEBUG=tls,net node <...arguments for your app>
Node has optional debug statements that are enabled using the NODE_DEBUG environment NODE_DEBUG . Setting NODE_DEBUG to tls,net will make node the source of debug information for the tls and net modules ... so basically everything is sent or received. If there is a timeout, you will see where it comes from.
Source: Experience with long-term deployment of node services.