Just to be clear, (:content (data.xml/parse rdr :coalescing false))
lazy. Check its class or pull the first element (it will return instantly) if you are not sure.
However, a couple of things to consider when processing large sequences are head holding and unrealized / nested laziness. I think your code suffers from the latter.
Here is what I recommend:
1) Add (dorun)
to the end of the chain of chains ->>
. This will cause the sequence to be fully realized without holding on to the head.
2) Change the for
in the process-page
to doseq
. You spit on a file, which is a side effect, and you don't want to lazily do this.
As Arthur recommends, you can open the output file once and continue to write to it, rather than open and write (do not care) for each entry on Wikipedia.
UPDATE
Here is a correspondence that tries to make out the problems more clearly:
(defn filter-tag [tag xml] (filter #(= tag (:tag %)) xml)) ;; lazy (defn revision-seq [xml] (for [page (filter-tag :page (:content xml)) :let [title (article-title page)] revision (filter-tag :revision (:content page)) :let [user (revision-user revision) time (revision-timestamp revision)]] [time user title])) ;; eager (defn transform [in out] (with-open [r (io/input-stream in) w (io/writer out)] (binding [*out* out] (let [xml (data.xml/parse r :coalescing false)] (doseq [[time user title] (revision-seq xml)] (println (str "\"" time "\";\"" user "\";\"" title "\"\n"))))))) (transform "dump.xml" "data.csv")
I donโt see anything here, which can lead to excessive memory usage.
source share