Clojure holding the head in a dose of q, run! hinges

Clojure beginner / intermediate here,

I have a large XML file (~ 240M) that I need to process lazily by element for ETL purposes. There is some run-processing function that does a lot of things with side effects, interacting with db, is logged, etc.

When I apply the specified function to the file, everything runs smoothly:

 ... (with-open [source (-> "in.xml" io/file io/input-stream)] (-> source xml/parse ((fn [x] ;; runs fine (run-processing conn config x))))) 

But when I put the same function in any loop (e.g. doseq ), I get an OutOfMemoryException (GC Overhead).

 ... (with-open [source (-> "in.xml" io/file io/input-stream)] (-> source xml/parse ((fn [x] ;; throws OOM GC overhead exception (doseq [i [0]] (run-processing conn config x)))))) 

I don’t understand where the head holding occurs which causes an exception to the GC command line? I already tried run! and even loop recur instead of doseq - the same thing happens.

There must be something wrong with my run-processing function? Then why does it behave normally when I run it directly? Vaguely sorted out, any help collapsed.

+5
source share
3 answers

To understand why your doseq not working, we first need to understand why it works (run-processing conn config x) :

The magic of Clojure here is clearing local networks: it parses any code, and as soon as the local binding is used for the last time, it sets (Java) null before running this expression. So for

 (fn [x]) (run-processing conn config x)) 

x will be cleared before run-processing starts. Note. You can get the same OOM error when disabling LAN cleaning (compiler option).

Now what happens when you write:

 (doseq [_ [0]) (run-processing conn config x)) 

How should the compiler know when x is last used and clear it? I can’t know this: it was used in a loop. Therefore, it is never cleaned, and x will hold its head.

Note. The smart JVM implementation may change this in the future when it realizes that access to the local memory location cannot be called by the calling function and offers binding to the garbage collector. Although, current implementations are not so smart.

Of course, this is easy to fix: do not use x inside the loop. Use other constructs like run! , which is just a function call and will correctly clear the locale before calling run! . Although, if you go to the seq chapter for a function, this function will be held on your head until the function (closing) is closed.

+4
source

While I do not know exactly what is causing OOM, I would like to offer some general suggestions and discuss our discussion in detail in the comments.

That way, the sequence will be stored in memory when I use some kind of loop, but not if I call run-processing directly? But in the dose, he clearly stated that "he does not preserve the head of the sequence." Then what should I do when I need to call run-processing several times (for example, with different arguments)?

So our function:

 (defn process-file! [conn config name] (with-open [source (io/input-stream (io/file name))] (-> (xml/parse source) ((fn [x] (doseq [i [0]] (run-processing conn config x))))))) 

Where x is lazy-seq (if you are using data.xml ), for example:

 x <- xml iterator <- file stream 

If run-proccessing does everything right (consumes x completely and returns nil ), there is nothing wrong with that - the problem is with the x binding itself. While run-processing works, it fully implements the sequence x is the chapter.

 (defn process-xml! [conn config x] (run-processing conn config x) ;; X IS FULLY REALIZED IN MEMORY (run-reporting conn config x)) (defn process-file! [conn config name] (with-open [source (io/input-stream (io/file name))] (->> (xml/parse source) (process-xml! conn config)))) 

As you can see, we do not consume a file element by element and immediately drop them - all thanks to x . doseq has nothing to do with this: it "does not save the head of the sequence" which it consumes, which in our case [0] .


This approach is not very idiomatic for two reasons:

1. run-processing does too much

He knows where the data comes from, in what form they process and what to do with the data. More typical proccess-file! will look like this:

 (defn process-file! [conn config name] (with-open [source (io/input-stream (io/file name))] (->> (xml/parse source) (find-item-nodes) (map node->item) (run! (partial process-item! conn config))))) 

This is not always viable and not suitable for each use case, but there is another reason to do it this way.

2. process-file! should (ideally) never give away x anyone

This is immediately apparent from consideration of the source code: with-open . query from clojure.java.jdbc is a good example. What it does is get a ResultSet , map it to pure Clojure data structures and force it to be fully read (using result-set-fn of doall ) to free the connection.

Note that it never loses a ResultSet , and the only option is to get the result of seq ( result-set-fn ), which is a “callback”: query wants to manage the life cycle of the ResultSet and make sure it is closed once the query returns. Otherwise, it is too easy to make a similar mistake.

(But we can, if we pass it a function similar to process-xml! As result-set-fn .)


Comments replies

As I said, I can’t say exactly what exactly causes OOM. It could be:

  • run-processing . In any case, the JVM is still small, and adding a simple doseq causes OOM. Therefore, I proposed to slightly increase the heap size as a test.

  • Clojure optimizes x binding.

  • (fn [x] (run-processing conn config x)) simply a JVM built-in, subsequently fixing the x binding problem.

So, why does handling a dose wrap q do x keep your head up? In my examples, I do not use x more than once (unlike your "run-processing x THEN run-report on SAME x").

The root of the problem is not the reuse of x , but the only fact that x exists. Let make a simple lazy-seq :

 (let [x (range 1 1e6)]) 

(Forget range being implemented as a Java class.)

What is x ? x is the lazy seq command, which is the recipe for building the next value.

 x = (recipe) 

Let it move forward:

 (let [x (range 1 1e6) y (drop 5 x) z (first y)]) 

Now x , y and y :

 x = (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (recipe) y = (6) -> (recipe) z = 6 

Hopefully now you can see what I mean: "x is the seq head, and run-processing implements it."

About "process-file! Should (ideally) never give x to anyone" - correct me if I'm wrong, but it doesn’t compare with pure Clojure data structures with doall make them live in memory, which would be bad if the file is too large (as in my case)?

process-file! does not use doall . run! is a decrease and returns nil.

+2
source

Can you post a specific example even if it is too small to throw an OOM exception?

The first thing I see is that you create a function using (fn [x] ...) , and then immediately call it with a second pair of parentheses:

  (-> source xml/parse ((fn [x] ;; runs fine (run-processing conn config x))))) 

It looks very strange to me. Why do you structure your code this way?

In the failed doseq example, you have the same structure:

  (-> source xml/parse ((fn [x] ;; throws OOM GC overhead exception (doseq [i [0]] (run-processing conn config x)))))) 

You will also notice that the upper bound in doseq is a singleton vector with a strange symbol inside. Does it mean to be "infinity" or something else? If so, why is it wrapped in a vector? This seems like a problem (or possibly a clojure.core error), since the doseq on a singleton vector should execute exactly once.

Another point, the loop variable i never used - is it intentional? This seems very different from the first (working) example.

Furthermore, it is possible that (depending on the details of your code) there is some relationship between creating a function that contains doseq and then calling it immediately.

Update:

In the form (fn [x] ...) I would write it like this:

 (-> source xml/parse #(run-processing conn config %))) 

or

 (->> source ; note "thread-last" macro xml/parse (run-processing conn config))) 

Perhaps for the dose, you intended something more:

 (-> source xml/parse #(doseq [single-item %] (run-processing conn config single-item))) 

However, in this case, we call run-processing many times for one element at a time, whereas before we called run-processing once and passed the whole lazy result from xml/parse .

-1
source

Source: https://habr.com/ru/post/1275509/


All Articles