Should imperative-style recursive file system algorithms be handled?

Question

Should imperative-style recursive file system algorithms be handled?

I just finished reading Concurrency Programming on the JVM by Venkata Subramaniyama, and in this book the author uses one of his examples, counting the file sizes in the directory tree. It shows implementations without using concurrency, using queues, using latch, and using scala members. In my system, all parallel implementations (queues, latches, and scala actors) can work in less than 9 seconds when repeated through the / usr directory (OSX 10.6.8, Core Duo 2 Ghz, Intel G1 ssd 160GB).

I am studying Clojure and decided that I am transferring the scala actor version to Clojure using agents. Unfortunately, I averaged 11-12 seconds, which is much slower than others. After spending DAYS pulling my hair out, I found that the next bit of code was the culprit (processFile is a function that I send to the file processing agent (s):

(defn processFile [fileProcessor collectorAgent ^String fileName] (let [^File file-obj (File. ^String fileName) fileTotals (transient {:files 0, :bytes 0})] (cond (.isDirectory file-obj) (do (doseq [^File dir (.listFiles file-obj) :when (.isDirectory dir)] (send collectorAgent addFileToProcess (.getPath dir))) (send collectorAgent tallyResult *agent*) (reduce (fn [currentTotal newItem] (assoc! currentTotal :files (inc (:files currentTotal)) :bytes (+ (:bytes currentTotal) newItem))) fileTotals (map #(.length ^File %) (filter #(.isFile ^File %) (.listFiles file-obj)))) (persistent! fileTotals)) (.isFile file-obj) (do (send collectorAgent tallyResult *agent*) {:files 1, :bytes (.length file-obj)}))))

You will notice that I tried using type hints and transients to improve performance, but to no avail. I replaced the above code with the following:

 (defn processChildren [children] (loop [entries children files 0 bytes 0 dirs '()] (let [^File child (first entries)] (cond (not (seq entries)) {:files files, :bytes bytes, :dirs dirs} (.isFile child) (recur (rest entries) (inc files) (+ bytes (.length child)) dirs) (.isDirectory child) (recur (rest entries) files bytes (conj dirs child)) :else (recur (rest entries) files bytes dirs))))) (defn processFile [fileProcessor collectorAgent ^String fileName] (let [{files :files, bytes :bytes, dirs :dirs} (processChildren (.listFiles (File. fileName)))] (doseq [^File dir dirs] (send collectorAgent addFileToProcess (.getPath dir))) (send collectorAgent tallyResult *agent*) {:files files, :bytes bytes}))

This version runs in pairs, if not faster than the scala version and is almost identical to the algorithm used in the scala version. I just assumed that a functional approach to the algorithm would work just as well.

So ... this lengthy question boils down to the following: why is the second version faster?

My hypothesis is that although the first version using map / filter / reduce in the contents of the directory is more functional than the second version and does not necessarily process the directory, it is much less efficient because the contents of the directory are repeated several times. Because file system I / Os are slow, the entire program suffers.

Assuming I'm right, is it possible to say with certainty that any recursive file system algorithm should prefer an imperative approach to performance?

I am new to Clojure, so feel free to copy my code to shreds if I am doing something stupid or non-idiomatic.

+6

clojure

acarlow Jul 07 '11 at 6:38

source share

1 answer

amalloy · Accepted Answer · 2011-07-07T07:51:43+0000

I edited the first version to make it more readable. I have a few comments, but no final helpful sayings:

You added transients and hints without any real evidence that slowed down. It is completely possible to drastically slow down the work with the careless use of these operations, so it's a good idea to consult to find out what actually slows down the work. Your options seem reasonable, but I removed the types that obviously had no effect (for example, the compiler does not need any hint to know what (File ...) gives the File object).
Clojure (indeed, any lisp) strongly prefers some-agent to someAgent . The prefix syntax means you don’t have to worry about that - can be analyzed as a subtraction by the compiler, so we can afford more well-distributed names.
You include calls to a bunch of functions that you don't define here at all, like tallyResult and addFileToProcess. Presumably, they execute perfectly, because you use them in the executable version, but apart from them, you made it difficult for someone else to argue with him and see what speeds up the work.
Consider sending instead of sending for I / O-bound operations: send uses a limited threadpool to avoid loading your processor. This probably doesn’t matter here, since you use only one agent and serialize it, but in the future you will encounter situations when it is important.

In any case, as promised, a more understandable rewrite of your first version:

 (defn process-file [_ collector-agent ^String file-name] (let [file-obj (File. file-name) file-totals (transient {:files 0, :bytes 0})] (cond (.isDirectory file-obj) (do (doseq [^File dir (.listFiles file-obj) :when (.isDirectory dir)] (send collector-agent addFileToProcess (.getPath dir))) (send collector-agent tallyResult *agent*) (reduce (fn [current-total new-item] (assoc! current-total :files (inc (:files current-total)) :bytes (+ (:bytes current-total) new-item))) file-totals (map #(.length ^File %) (filter #(.isFile ^File %) (.listFiles file-obj)))) - (persistent! file-totals)) (.isFile file-obj) (do (send collector-agent tallyResult *agent*) {:files 1, :bytes (.length file-obj)}))))

Edit: you are using transients incorrectly, discarding the result of your reduction. (assoc! mkv) allowed to modify and return the object m , but may return another object if it is more convenient or efficient. So you need something more like (persistent! (reduce ...))

Should imperative-style recursive file system algorithms be handled?

More articles: