I just finished reading Concurrency Programming on the JVM by Venkata Subramaniyama, and in this book the author uses one of his examples, counting the file sizes in the directory tree. It shows implementations without using concurrency, using queues, using latch, and using scala members. In my system, all parallel implementations (queues, latches, and scala actors) can work in less than 9 seconds when repeated through the / usr directory (OSX 10.6.8, Core Duo 2 Ghz, Intel G1 ssd 160GB).
I am studying Clojure and decided that I am transferring the scala actor version to Clojure using agents. Unfortunately, I averaged 11-12 seconds, which is much slower than others. After spending DAYS pulling my hair out, I found that the next bit of code was the culprit (processFile is a function that I send to the file processing agent (s):
(defn processFile [fileProcessor collectorAgent ^String fileName] (let [^File file-obj (File. ^String fileName) fileTotals (transient {:files 0, :bytes 0})] (cond (.isDirectory file-obj) (do (doseq [^File dir (.listFiles file-obj) :when (.isDirectory dir)] (send collectorAgent addFileToProcess (.getPath dir))) (send collectorAgent tallyResult *agent*) (reduce (fn [currentTotal newItem] (assoc! currentTotal :files (inc (:files currentTotal)) :bytes (+ (:bytes currentTotal) newItem))) fileTotals (map
You will notice that I tried using type hints and transients to improve performance, but to no avail. I replaced the above code with the following:
(defn processChildren [children] (loop [entries children files 0 bytes 0 dirs '()] (let [^File child (first entries)] (cond (not (seq entries)) {:files files, :bytes bytes, :dirs dirs} (.isFile child) (recur (rest entries) (inc files) (+ bytes (.length child)) dirs) (.isDirectory child) (recur (rest entries) files bytes (conj dirs child)) :else (recur (rest entries) files bytes dirs))))) (defn processFile [fileProcessor collectorAgent ^String fileName] (let [{files :files, bytes :bytes, dirs :dirs} (processChildren (.listFiles (File. fileName)))] (doseq [^File dir dirs] (send collectorAgent addFileToProcess (.getPath dir))) (send collectorAgent tallyResult *agent*) {:files files, :bytes bytes}))
This version runs in pairs, if not faster than the scala version and is almost identical to the algorithm used in the scala version. I just assumed that a functional approach to the algorithm would work just as well.
So ... this lengthy question boils down to the following: why is the second version faster?
My hypothesis is that although the first version using map / filter / reduce in the contents of the directory is more functional than the second version and does not necessarily process the directory, it is much less efficient because the contents of the directory are repeated several times. Because file system I / Os are slow, the entire program suffers.
Assuming I'm right, is it possible to say with certainty that any recursive file system algorithm should prefer an imperative approach to performance?
I am new to Clojure, so feel free to copy my code to shreds if I am doing something stupid or non-idiomatic.