Clojure - process huge files with low memory

I process text files of 60 GB or more. Files are divided into a variable length header section and a data section. I have three functions:

  • head? predicate for extracting header lines from data lines
  • process-header process one line of the header line
  • process-data process one row of data row
  • Functions for processing asynchronous access and database changes in memory

I applied a method of reading files from another SO stream, which should build a lazy string sequence. The idea was to process some lines with one function, then switch the function once and continue processing with the next function.

 (defn lazy-file [file-name] (letfn [(helper [rdr] (lazy-seq (if-let [line (.readLine rdr)] (cons line (helper rdr)) (do (.close rdr) nil))))] (try (helper (clojure.java.io/reader file-name)) (catch Exception e (println "Exception while trying to open file" file-name))))) 

I use it with something like

 (let [lfile (lazy-file "my-file.txt")] (doseq [line lfile :while head?] (process-header line)) (doseq [line (drop-while head? lfile)] (process-data line))) 

Although this works, it is quite inefficient for several reasons:

  • Instead of simply calling process-head until I get to the data and then continuing with process-data , I will have to filter the header lines and process them, and then restart the parsing of the whole file and discard all the header lines to process the data . This is the exact opposite of what was planned by lazy-file .
  • Observing memory consumption shows me that the program, although seemingly lazy, creates as much RAM for use as it takes to store a file in memory.

So, what is a more efficient, idiomatic way to work with my database?

One idea would be to use a multimethod to process the header and data depending on the value of the head? predicate head? , but I believe that this will have a serious impact on speed, especially since there is only one event in which the result of the predicate changes from always true to always false. I haven't rated it yet.

Would it be better to use a different way to build a seq string and parse it with iterate ? It still leaves me to use: while and: drop-while, I think.

My research has mentioned access to NIO files several times, which should improve memory usage. I still could not learn how to use this in idiomatic mode in clojure.

Maybe I'm still poorly versed in the general idea of ​​how to process the file?

As always, any help, ideas or pointers to links are very welcome.

+5
source share
2 answers

You should use standard library functions.

line-seq, c-open and doseq can easily do this.

Something on the line:

 (with-open [rdr (clojure.java.io/reader file-path)] (doseq [line (line-seq rdr)] (if (head? line) (process-header line) (process-data line)))) 
+2
source

There are a few things here:

  • Memory usage

    There are reports that leiningen may add material that preserves links to the head, although doseq does not specifically support the head of the processing sequence, cf. this question is SO . Try checking your claim β€œuse as much RAM as you need to store the file in memory” without using lein repl .

  • Parsing strings

    Instead of using two loops with doseq you can also use the loop/recur . What you expect from parsing will be a second argument like this (unchecked):

      (loop [lfile (lazy-file "my-file.txt") parse-header true] (let [line (first lfile)] (if [and parse-header (head? line)] (do (process-header line) (recur (rest lfile) true)) (do (process-data line) (recur (rest lfile) false))))) 

    There is another option here that will include your processing functions in your file reader function. Thus, instead of just cons to enter a new line and return it, you could just process it right away - as a rule, you could pass the processing function as an argument instead of hard coding.

    Your current code looks like this: processing is a side effect. If so, you could probably end laziness if you turn on processing. You still need to process the whole file (or it seems to be so), and you do it separately. The lazy-seq approach basically just aligns one line with one processing call. Your need for laziness arises in the current solution, because you separate reading (the whole file, line by line) from processing. If you instead move the processing of a string to a reading, you do not need to do it lazily.

0
source

Source: https://habr.com/ru/post/1238494/


All Articles