Why is Scala parsing slow when parsing large files? What can I do?

I need to parse files with millions of lines. I noticed that my partner parser is getting slower and slower as it parses more and more lines. It seems that the problem is with the scala "rep" or regex parsers, as this behavior occurs even for the simple sample analyzer shown below:

def file: Parser[Int] = rep(line) ^^ { 1 }  // a file is a repetition of lines

def line: Parser[Int] = """(?m)^.*$""".r ^^ { 0 } // reads a line and returns 0

When I try to parse a file with millions of lines of equal length with this simple parser, it first parses 46 lines / ms. After 370000 lines, the speed drops to 20 lines / ms. After 840,000 lines, it drops to 10 lines / ms. After 1790000 lines, 5 lines / ms ...

My questions:

  • Why is this happening?

  • What can I do to prevent this?

+4
source share
1 answer

This is probably the result of a change in Java 7u6 that does not have substrings as part of the original string. Thus, large strings are copied again and again, causing many, many garbage dumps (by the way). As you increase the amount of material that you have analyzed (I assume that you keep at least part of it), the garbage collector needs to do more and more, so creating all the excess garbage has a steeper and steeper penalty.

, Zach Moazeni, , ( ).

, , .

, . , . . ( , .)

+2

Source: https://habr.com/ru/post/1536847/


All Articles