Huge file in Clojure and Java heap space error

I placed it earlier on a huge XML file - it is 287 GB XML with a Wikipedia dump that I want to insert into a CSV file (version editors and timestamps). I did it to some extent. Before I received the StackOverflow error, but now, after solving the first problem, I get: java.lang.OutOfMemoryError: Java heap error.

My code (partially taken from Justin Cramer's answer) looks like this:

(defn process-pages [page] (let [title (article-title page) revisions (filter #(= :revision (:tag %)) (:content page))] (for [revision revisions] (let [user (revision-user revision) time (revision-timestamp revision)] (spit "files/data.csv" (str "\"" time "\";\"" user "\";\"" title "\"\n" ) :append true))))) (defn open-file [file-name] (let [rdr (BufferedReader. (FileReader. file-name))] (->> (:content (data.xml/parse rdr :coalescing false)) (filter #(= :page (:tag %))) (map process-pages)))) 

I do not show the functions article-title , revision-user and revision-title , because they just simply take data from a specific place in the hash of the page or revision. Anyone can help me with this - I'm really new to Clojure and don't understand the problem.

+6
source share
3 answers

Just to be clear, (:content (data.xml/parse rdr :coalescing false)) lazy. Check its class or pull the first element (it will return instantly) if you are not sure.

However, a couple of things to consider when processing large sequences are head holding and unrealized / nested laziness. I think your code suffers from the latter.

Here is what I recommend:

1) Add (dorun) to the end of the chain of chains ->> . This will cause the sequence to be fully realized without holding on to the head.

2) Change the for in the process-page to doseq . You spit on a file, which is a side effect, and you don't want to lazily do this.

As Arthur recommends, you can open the output file once and continue to write to it, rather than open and write (do not care) for each entry on Wikipedia.

UPDATE

Here is a correspondence that tries to make out the problems more clearly:

 (defn filter-tag [tag xml] (filter #(= tag (:tag %)) xml)) ;; lazy (defn revision-seq [xml] (for [page (filter-tag :page (:content xml)) :let [title (article-title page)] revision (filter-tag :revision (:content page)) :let [user (revision-user revision) time (revision-timestamp revision)]] [time user title])) ;; eager (defn transform [in out] (with-open [r (io/input-stream in) w (io/writer out)] (binding [*out* out] (let [xml (data.xml/parse r :coalescing false)] (doseq [[time user title] (revision-seq xml)] (println (str "\"" time "\";\"" user "\";\"" title "\"\n"))))))) (transform "dump.xml" "data.csv") 

I donโ€™t see anything here, which can lead to excessive memory usage.

+4
source

Unfortunately, data.xml/parse not lazy, it tries to read the entire file in memory and then data.xml/parse it.

Instead, use this (lazy) xml library , which contains only the part in which it now works in ram. Then you will need to rebuild the code to write the output, since it reads the input instead of collecting all the xml, and then outputs it.

your line

 (:content (data.xml/parse rdr :coalescing false) 

load all xml into memory, and then request a content key from it. that will blow up a bunch.

The rough outline of a lazy answer would look something like this:

 (with-open [input (java.io.FileInputStream. "/tmp/foo.xml") output (java.io.FileInputStream. "/tmp/foo.csv"] (map #(write-to-file output %) (filter is-the-tag-i-want? (parse input)))) 

Patience always takes time to work with (> data ram) :)

+1
source

I donโ€™t know about Clojure, but in simple Java you can use a SAX-based parser, for example http://docs.oracle.com/javase/1.4.2/docs/api/org/xml/sax/XMLReader.html which is not need to load XML into RAM

0
source

Source: https://habr.com/ru/post/912264/


All Articles