Central strategy for opening multiple files

Question

Central strategy for opening multiple files

I have a working implementation using Grand Central scheduling queues, which (1) opens a file and computes the DSA OpenSSL hash on "queue1", (2) writes the hash to a new file of the "side machine" for later checking on "queue2".

I would like to open several files at the same time, but based on some logic that does not "choke" from the OS, if the 100th files open and exceed the stable performance of the hard drive. Photo viewer apps like iPhoto or Aperture seem to open several files and display them, so I guess you can.

I assume that the biggest limitation will be disk I / O, since an application can (theoretically) read and write multiple files at once.

Any suggestions?

TIA

+4

optimization c objective-c grand-central-dispatch macos

JeremyT Dec 26 '10 at 2:23

source share

5 answers

bbum · Answer 1 · 2010-12-26T19:28:17+0000

You are correct that you will be connected with I / O, of course. And this will be compounded by the possibility of random access to the opening of several files and their active reading at the same time.

Thus, you need to pick up a little balance. Most likely, one file is not the most efficient, as you noticed.

Personally?

I would use a send semaphore.

Sort of:

@property(nonatomic, assign) dispatch_queue_t dataQueue; @property(nonatomic, assign) dispatch_semaphore_t execSemaphore;

and

 - (void) process:(NSData *)d { dispatch_async(self.dataQueue, ^{ if (!dispatch_semaphore_wait(self.execSemaphore, DISPATCH_TIME_FOREVER)) { dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{ ... do calcualtion work here on d ... dispatch_async(dispatch_get_main_queue(), ^{ .... update main thread w/new data here .... }); dispatch_semaphore_signal(self.execSemaphore); }); } }); }

Where does it start:

 self.dataQueue = dispatch_queue_create("com.yourcompany.dataqueue", NULL); self.execSemaphore = dispatch_semaphore_create(3); [self process: ...]; [self process: ...]; [self process: ...]; [self process: ...]; [self process: ...]; .... etc ....

You need to determine how best you will handle the sequence. If there are many objects, and there is the concept of cancellation, the inclusion of all things is likely to be wasteful. Likewise, you probably want to paste the URLs into files for processing, rather than NSData objects, as described above.

In any case, the above will handle three things at the same time, regardless of how many of them have been queued.

Chris hanson · Answer 2 · 2010-12-26T20:33:03+0000

I would use NSOperation for this because of the ease of handling both dependencies and undo.

I would create one operation to read the data file, calculate the hash of the data file, and write the sidecar file. I make each write operation dependent on a calculation operation associated with it, and each calculation operation depends on a read operation associated with it.

Then I would add read and write operations to a single NSOperationQueue, an I / O queue with a limited width. Calculation operations, which I would add to a separate NSOperationQueue, "calculation queue", with unlimited width.

The reason for the limited width in the I / O queue is because your work is most likely related to I / O binding; you can have a width greater than 1, but it will most likely be directly related to the number of physical disks your input files are on. (Perhaps something like 2x, you'll want to define this experimentally.)

The code will look something like this:

 @implementation FileProcessor static NSOperationQueue *FileProcessorIOQueue = nil; static NSOperationQueue *FileProcessorComputeQueue = nil; + (void)inititalize { if (self == [FileProcessor class]) { FileProcessorIOQueue = [[NSOperationQueue alloc] init]; [FileProcessorIOQueue setName:@"FileProcessorIOQueue"]; [FileProcessorIOQueue setMaxConcurrentOperationCount:2]; // limit width FileProcessorComputeQueue = [[NSOperationQueue alloc] init]; [FileProcessorComputeQueue setName:@"FileProcessorComputeQueue"]; } } - (void)processFilesAtURLs:(NSArray *)URLs { for (NSURL *URL in URLs) { __block NSData *fileData = nil; // set by readOperation __block NSData *fileHashData = nil; // set by computeOperation // Create operations to do the work for this URL NSBlockOperation *readOperation = [NSBlockOperation blockOperationWithBlock:^{ fileData = CreateDataFromFileAtURL(URL); }]; NSBlockOperation *computeOperation = [NSBlockOperation blockOperationWithBlock:^{ fileHashData = CreateHashFromData(fileData); [fileData release]; // created in readOperation }]; NSBlockOperation *writeOperation = [NSBlockOperation blockOperationWithBlock:^{ WriteHashSidecarForFileAtURL(fileHashData, URL); [fileHashData release]; // created in computeOperation }]; // Set up dependencies between operations [computeOperation addDependency:readOperation]; [writeOperation addDependency:computeOperation]; // Add operations to appropriate queues [FileProcessorIOQueue addOperation:readOperation]; [FileProcessorComputeQueue addOperation:computeOperation]; [FileProcessorIOQueue addOperation:writeOperation]; } } @end

It is pretty simple; instead of dealing with multi-threaded sync / async layers, as with the dispatch_* API, NSOperation allows you to define your units of work and your dependencies between them independently. In some situations, this is easier to understand and debug.

Aaron burghardt · Answer 3 · 2010-12-27T11:06:23+0000

You already got great answers, but I wanted to add a couple of points. I worked on projects that list all files in the file system and calculate the MD5 and SHA1 hashes of each file (in addition to other processing). If you are doing something like this, where you are looking for a large number of files, and the files may have arbitrary content, then there are some points to consider:

As already noted, you will be associated with I / O. If you read more than one file at a time, you will have a negative impact on the performance of each calculation. Obviously, the task of planning parallel computing is to maintain disk space between files, but you may want to structure your work in different ways. For example, configure one stream that lists and opens files, and the second stream receives open file descriptors from the first stream one at a time and processes them. The file system will cache information about directories, so listing will not have a serious effect on reading data, and in fact they will have to go to disk.
If the files can be arbitrarily large, Chris’s approach can be impractical because all the content is read into memory.
If you have no other use of data other than hash calculation, I suggest disabling file system caching before reading data.

If you are using NSFileHandles, a simple category method will execute this file:

 @interface NSFileHandle (NSFileHandleCaching) - (BOOL)disableFileSystemCache; @end #include <fcntl.h> @implementation NSFileHandle (NSFileHandleCaching) - (BOOL)disableFileSystemCache { return (fcntl([self fileDescriptor], F_NOCACHE, 1) != -1); } @end

If the sidecar files are small, you can collect them in memory and write them in batches to minimize processing disruption.
A file system (at least HFS) stores file records for files in a directory sequentially, so first move the file system in width (i.e. process each file in a directory before entering subdirectories).

The above, of course, only offers. You want to experiment and measure performance to confirm the actual impact.

Catfish_man · Answer 4 · 2012-09-16T22:31:33+0000

libdispatch actually provides an API for this! Check out dispatch_io; it will handle I / O parallelization, if necessary, and, otherwise, serialize it to avoid disk compression.

JeremyT · Answer 5 · 2012-09-02T23:27:13+0000

The following link to the BitBucket project that I am setting up using NSOperation and Grand Central Dispatch uses a primitive application to ensure file integrity.

https://bitbucket.org/torresj/hashar-cocoa

I hope this helps / use.

Central strategy for opening multiple files

More articles: