I am creating a console application that needs to process a bunch of documents.
To stay simple, the process:
- for each year between X and Y, request a database to get a list of document links for the process
- for each of these links, process the local file
The process method, I think, is independent and should be parallelized as soon as the input arguments are different:
private static bool ProcessDocument( DocumentsDataset.DocumentsRow d, string langCode ) { try { var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm"; var htmFullPath = Path.Combine("x:\path", htmFileName; missingHtmlFile = !File.Exists(htmFullPath); if (!missingHtmlFile) { var html = File.ReadAllText(htmFullPath);
To list my document, I have this method:
private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments() { return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => { return Document.FindAll((short)year).Documents; }); }
Document
is a business class that carries document searches. The result of this method is a typed dataset (I return the Documents table). The method waits a year, and I am sure that the document cannot be returned for more than one year (the year is actually part of the key).
Note the use of AsParallel()
here, but I never had a problem with this.
Now my main method:
var documents = EnumerateDocuments(); var result = documents.Select(d => { bool success = true; foreach (var langCode in new string[] { "-e","-f" }) { success &= ProcessDocument(d, langCode); } return new { d.UniqueDocRef, success }; }); using (var sw = File.CreateText("summary.csv")) { sw.WriteLine("Level;UniqueDocRef"); foreach (var item in result) { string level; if (!item.success) level = "[ERROR]"; else level = "[OK]"; sw.WriteLine( "{0};{1}", level, item.UniqueDocRef );
This method works as expected in this form. However, if I replaced
var documents = EnumerateDocuments();
by
var documents = EnumerateDocuments().AsParrallel();
It stops working, and I do not understand why.
The error appears here (in my process method):
File.WriteAllText(htmFullPath, html);
He tells me that the file is already open by another program.
I do not understand what might make my program work not as expected. Since my documents
variable is IEnumerable
returning unique values, why does my process method break?
thanks for the tips
[Change] Code for receiving the document:
/// <summary> /// Get all documents in data store /// </summary> public static DocumentsDS FindAll(short? year) { Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib DbCommand cm = db.GetStoredProcCommand("Document_Select"); if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value); string[] tableNames = { "Documents", "Years" }; DocumentsDS ds = new DocumentsDS(); db.LoadDataSet(cm, ds, tableNames); return ds; }
[Edit2] A possible source of my problem, thanks to mquander. If I wrote:
var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef); var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1); var testLst = testGr.ToList(); Console.WriteLine(testLst.Where(x => x.Count == 1).Count()); Console.WriteLine(testLst.Where(x => x.Count > 1).Count());
I get this result:
0 1758
Removing AsParallel returns the same result.
Conclusion: my EnumerateDocuments has something wrong and every document is returned twice.
I’ll have to dive here, I think
This is probably my source listing in reason