Adding an AsParallel () call causes my code to abort when writing a file

I am creating a console application that needs to process a bunch of documents.

To stay simple, the process:

  • for each year between X and Y, request a database to get a list of document links for the process
  • for each of these links, process the local file

The process method, I think, is independent and should be parallelized as soon as the input arguments are different:

private static bool ProcessDocument( DocumentsDataset.DocumentsRow d, string langCode ) { try { var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm"; var htmFullPath = Path.Combine("x:\path", htmFileName; missingHtmlFile = !File.Exists(htmFullPath); if (!missingHtmlFile) { var html = File.ReadAllText(htmFullPath); // ProcessHtml is quite long : it use a regex search for a list of reference // which are other documents, then sends the result to a custom WS ProcessHtml(ref html); File.WriteAllText(htmFullPath, html); } return true; } catch (Exception exc) { Trace.TraceError("{0,8}Fail processing {1} : {2}","[FATAL]", d.UniqueDocRef, exc.ToString()); return false; } } 

To list my document, I have this method:

  private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments() { return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => { return Document.FindAll((short)year).Documents; }); } 

Document is a business class that carries document searches. The result of this method is a typed dataset (I return the Documents table). The method waits a year, and I am sure that the document cannot be returned for more than one year (the year is actually part of the key).

Note the use of AsParallel() here, but I never had a problem with this.

Now my main method:

  var documents = EnumerateDocuments(); var result = documents.Select(d => { bool success = true; foreach (var langCode in new string[] { "-e","-f" }) { success &= ProcessDocument(d, langCode); } return new { d.UniqueDocRef, success }; }); using (var sw = File.CreateText("summary.csv")) { sw.WriteLine("Level;UniqueDocRef"); foreach (var item in result) { string level; if (!item.success) level = "[ERROR]"; else level = "[OK]"; sw.WriteLine( "{0};{1}", level, item.UniqueDocRef ); //sw.WriteLine(item); } } 

This method works as expected in this form. However, if I replaced

  var documents = EnumerateDocuments(); 

by

  var documents = EnumerateDocuments().AsParrallel(); 

It stops working, and I do not understand why.

The error appears here (in my process method):

 File.WriteAllText(htmFullPath, html); 

He tells me that the file is already open by another program.

I do not understand what might make my program work not as expected. Since my documents variable is IEnumerable returning unique values, why does my process method break?

thanks for the tips

[Change] Code for receiving the document:

  /// <summary> /// Get all documents in data store /// </summary> public static DocumentsDS FindAll(short? year) { Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib DbCommand cm = db.GetStoredProcCommand("Document_Select"); if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value); string[] tableNames = { "Documents", "Years" }; DocumentsDS ds = new DocumentsDS(); db.LoadDataSet(cm, ds, tableNames); return ds; } 

[Edit2] A possible source of my problem, thanks to mquander. If I wrote:

  var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef); var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1); var testLst = testGr.ToList(); Console.WriteLine(testLst.Where(x => x.Count == 1).Count()); Console.WriteLine(testLst.Where(x => x.Count > 1).Count()); 

I get this result:

 0 1758 

Removing AsParallel returns the same result.

Conclusion: my EnumerateDocuments has something wrong and every document is returned twice.

I’ll have to dive here, I think

This is probably my source listing in reason

+4
source share
3 answers

Is Document.FindAll((short)year).Documents thread safe? Since the difference between the first and second versions is that in the second (broken) version, this call is made several times at the same time. This may be the reason for this.

+1
source

I suggest that each task put the file data in a global queue and have a parallel stream, write write requests from the queue and execute the actual recording.

In any case, recording performance in parallel on one disc is much worse than recording sequentially, because the disc must rotate to find the next recording location, so you just bounce around on the disc between searches. Better to take notes sequentially.

+3
source

It looks like you are trying to write to the same file. Only one thread / program can write a file at a given time, so you cannot use Parallel.

If you are reading from the same file, you only need to open the file with read permissions so as not to impose write lock on it.

The easiest way to fix the problem is to place a lock around the File.WriteAllText file, assuming that the recording is fast and that it is worth parallelizing the rest of the code.

-1
source

Source: https://habr.com/ru/post/1385565/


All Articles