I process text files of 60 GB or more. Files are divided into a variable length header section and a data section. I have three functions:
head? predicate for extracting header lines from data linesprocess-header process one line of the header lineprocess-data process one row of data row- Functions for processing asynchronous access and database changes in memory
I applied a method of reading files from another SO stream, which should build a lazy string sequence. The idea was to process some lines with one function, then switch the function once and continue processing with the next function.
(defn lazy-file [file-name] (letfn [(helper [rdr] (lazy-seq (if-let [line (.readLine rdr)] (cons line (helper rdr)) (do (.close rdr) nil))))] (try (helper (clojure.java.io/reader file-name)) (catch Exception e (println "Exception while trying to open file" file-name)))))
I use it with something like
(let [lfile (lazy-file "my-file.txt")] (doseq [line lfile :while head?] (process-header line)) (doseq [line (drop-while head? lfile)] (process-data line)))
Although this works, it is quite inefficient for several reasons:
- Instead of simply calling
process-head until I get to the data and then continuing with process-data , I will have to filter the header lines and process them, and then restart the parsing of the whole file and discard all the header lines to process the data . This is the exact opposite of what was planned by lazy-file . - Observing memory consumption shows me that the program, although seemingly lazy, creates as much RAM for use as it takes to store a file in memory.
So, what is a more efficient, idiomatic way to work with my database?
One idea would be to use a multimethod to process the header and data depending on the value of the head? predicate head? , but I believe that this will have a serious impact on speed, especially since there is only one event in which the result of the predicate changes from always true to always false. I haven't rated it yet.
Would it be better to use a different way to build a seq string and parse it with iterate ? It still leaves me to use: while and: drop-while, I think.
My research has mentioned access to NIO files several times, which should improve memory usage. I still could not learn how to use this in idiomatic mode in clojure.
Maybe I'm still poorly versed in the general idea of ββhow to process the file?
As always, any help, ideas or pointers to links are very welcome.