Process files at the same time when they arrive in C #

I have an application that works great for processing files that fall into a directory on my server. Process:

1) check for files in a directory 2) queue a user work item to handle each file in the background 3) wait until all workers have completed 4) goto 1 

This works well, and I never worry about the same file being processed twice, or multiple threads created for the same file. However, if there is one file that takes too much time to process, step # 3 hangs in this one file and holds all other processing.

So my question is: which correct paradigm generates exactly one thread for each file that I need to process, and not block if one file takes too much time? I considered FileSystemWatcher, but the files may not be readable, so I constantly look at all the files and start the process for everyone (which will exit immediately if the file is locked).

Should I delete step # 3 and save the list of files that I have already processed? This seems messy, and over time the list will be very large, so I suspect there is a more elegant solution.

+6
source share
4 answers

I would suggest that you maintain a list of files that you are currently processing. Remove the stream from this list when the stream ends. When searching for new files, exclude them in the current list.

+6
source

Move the files to the processing directory before starting the threads. Then you can start and forget threads, and any administrators can immediately see what is happening.

+3
source

Spawning one thread per processing element is almost never a good approach. In your case, when the number of files exceeds several hundred, one thread for each file will make the application work very poorly, and with a 32-bit process, the address space will begin to run out.

The list of Dark Falcon solutions is quite simple and fits your algorithm. I would actually use a queue (likle ConcurrentQueue - http://msdn.microsoft.com/en-us/library/dd267265.aspx ) to place elements for processing on one side (i.e. based on periodic scans of the file observer) and select items for processing by one or more threads on the other hand. You usually want fewer threads (i.e. 1-2x number of processors for heavy CPU usage).

Also consider using a parallel task library (e.g. Parallel.ForEach - http://msdn.microsoft.com/en-us/library/dd989744.aspx ) to work with multiple threads.

To minimize the number of files to process, I would keep a constant (i.e. disk file) list of items already processed - the path to the file + last modified date (if you cannot get this information from another source).

+3
source

My two main questions:

  • What is the size of the files?
  • How often will files be displayed?

Depending on your answer, I can go with the following producer-consumer algorithm:

  • Use the file system watcher to see if there is activity in the directory you are tracking.
  • When an action occurs, start the survey β€œlightly”; that is, check each available file to make sure it is not locked (i.e. try to open write / write permissions using the simple IsLocked extension method, which checks with try..catch); if 1 or more files are not free, set a timer to turn off for some time (longer if you expect more files, less if less and / or more) to retest files
  • Once you see that the file is free, process it (i.e. move it to another folder, place the item in a parallel queue, ask your consumer threads to process the queue, archive the file / results).
  • Have some kind of persistence mechanism like Alex mentions (i.e. disk / database) to restore your processing when you stopped in the event of a system crash.

I feel this is a good combination of non-blocking low Cpu behavior. But measure the results before and after. I would recommend using ThreadPool and try to block threads (i.e. try to ensure thread reuse without blocking, doing something like Thread.Sleep)

Notes:

  • Set the number of thread processing files to the number of processors and cores available on the machine; also consider server loading
  • FileSystemWatcher may be complete; make sure that it works from the same machine on which you control (i.e. do not view the remote server), otherwise you will need to re-initialize the connection from time to time.
  • I would definitely not create another process for each file; multiple threads should be sufficient; thread reuse is better. Spawning processes are a very very expensive operation, and spawning streams are an expensive operation. Alexey has good information on a parallel task library; he uses threadpool.
+1
source

Source: https://habr.com/ru/post/892544/


All Articles