I am currently experimenting with a small Haskell web server written in Snap that downloads and provides a lot of data to the client. And it is very difficult for me to gain control over the server process. In random moments, the process uses a lot of CPU for several seconds to minutes and becomes immune to client requests. Sometimes memory usage accumulates (and sometimes drops) in hundreds of megabytes for several seconds.
Hopefully someone has more experience with long-running Haskell processes that use a lot of memory and can give me some pointers to make this thing more stable. I have been debugging this thing for several days, and I'm starting to get a little desperate here.
A small overview of my installation:
When starting the server, I read about 5 gigabytes of data into a large (nested) Data.Map structure in memory. A nested map is a strict value, and all values inside the map are data types, and all their fields are also strict. I spent a lot of time not avoiding the unpaid thunders. Import (depending on system load) takes about 5-30 minutes. The strange thing is that the fluctuation in successive runs is much larger than I would expect, but this is another problem.
A large data structure lives inside "TVar", which is shared by all client flows generated by the Snap server. Clients can request arbitrary pieces of data using a small query language. The volume of a data request is usually small (up to 300 kB or so) and applies only to a small part of the data structure. All read-only requests are executed using "readTVarIO", so they do not require any STM transactions.
The server starts with the following flags: + RTS -N -I0 -qg -qb. This will start the server in multithreaded mode, disable downtime and parallel GC. This seems to speed up the process.
The server works without problems. However, from time to time the client request expires, and the processor reaches 100% (or even more than 100%) and continues to do so for a long time. Meanwhile, the server no longer responds to the request.
There are several reasons why I can think of what might lead to CPU usage:
The request is time consuming because there is a lot of work to be done. This is somewhat unlikely, because sometimes this happens for queries that turned out to be very fast in previous runs (with fast, I mean 20-80 ms or so).
There are still some unappreciated thunks that need to be calculated before the data can be processed and sent to the client. This is also unlikely for the same reason as the previous point.
Somehow garbage collection starts and starts scanning my entire heap for 5 GB. I can imagine that this can take a long time.
The problem is that I do not know how to correctly understand what is happening and what to do about it. Since the import process takes so long, the profiling results do not show me anything useful. It seems that there is no way to conditionally turn the profiler on and off from the code.
I personally suspect that the GC problem is here. I am using GHC7, which seems to have many features to customize how GC works.
What GC settings do you recommend when using large heaps with very stable data?
performance garbage-collection memory-management haskell
Sebastiaan Visser Jul 08 2018-11-11T00: 00Z
source share