Do I have to collect the files first and then complete the task or complete the tasks during the build?)

Question

Do I have to collect the files first and then complete the task or complete the tasks during the build?)

I need to go through all the files of a specific folder (and its subfolders) and execute something in each file. I was looking for a way to recursively view all files and found one solution in Apache Commons Io: FileUtils.iterateFiles It returns an iterator. I checked how it is implemented and saw that it goes through all the files and adds them to the collection, and then returns an iterator for the collection. Well, of course, what is he doing. what i was looking for :)

But then I thought: is it efficient to collect all the files first, and then iterate over all of them and do what I want? Or should I, instead of collecting them, simply perform an action in a recursive traverse?

It should be noted that my required actions on files include I / O manipulations on them, which may not work .. (which can be processed in both directions .. but just noted if I missed something in my thoughts) Also , the set of folders and files that I’m browsing MAY reach 400 folders or 5000 files or so, and file sizes can reach several gigs (again, it’s not so important when just browsing files, but it’s relevant because I intend to complete the tasks input-output) ..

Any thoughts?

thanks.

+4

java file io

AAaa Jan 16 '12 at 19:02

source share

3 answers

Maurício Linhares · Answer 1 · 2012-01-16T19:06:58+0000

You should start navigating the file system, create a Runnable / Callable implementation for what you would like to do with these files, and have each file found send it to ThreadPool (you can create it from Executors ).

In this case, you should probably use a fixed thread pool and the size may vary, you should check and see how the number of threads working with your files affects performance.

ligerdave · Answer 2 · 2012-01-16T19:15:45+0000

Reading things from disk is slow and expensive. The best approach is to use multithreading, so you do not waste time waiting for IO input to store the contents of the file. As soon as the file is read, the read / write stream will sleep for a certain time, and the other stream will process what you need to do. as soon as the read / write stream wakes up, it writes to disk.

To answer your question, it is impossible to download all files (and their contents) once and continue processing them due to memory limitations. Using multithread to process multiple files at once. or you can use mapreduce, depending on the task

eppesuig · Answer 3 · 2012-01-25T23:09:04+0000

It probably depends on how big the list is. If this is not a problem with storing the list in memory, I must fill out the list before working on the files. The reason is quite simple: on the one hand, scanning a directory tree usually happens quickly due to the organization of the file system; on the second side, you probably should work sequentially one file at a time to achieve better performance (if you are multitasking and working with many files at the same time, your disk will slow down).

Do I have to collect the files first and then complete the task or complete the tasks during the build?)

More articles: