I have a program in which each stream reads in a file many lines at a time from a file, processes the lines and writes the lines to another file. Four threads split the list of files for processing among them. I have performance issues in two cases:
- Four files with 50,000 lines each
- Throughput starts with processing 700 lines / sec, decreases to ~ 100 lines / sec
- 30,000 files, 12 lines each
- Throughput starts at about 800 lines / sec and remains steady.
This is the internal software I'm working on, so unfortunately I can’t share any source code, but the main steps of the program are:
- Split file list among four workflows
- Run all threads.
- Thread reads up to 100 lines at a time and stores in an array
String[]. - Thread applies the conversion to all the rows in the array.
- Thread writes lines to a file (not the same as the input file).
- 3-5 repetitions for each stream until all files are fully processed.
What I do not understand is why 30 thousand files with 12 lines give me more performance than several files with many lines. I would expect that the overhead of opening and closing files will be more than reading a single file. In addition, the decrease in productivity of the first case is exponential.
I set the maximum heap size to 1024 MB and it seems to use no more than 100 MB, so the problem with an overloaded GC is not a problem. Do you have any other ideas?