Parse a large list of Excel files

This is a C # / VSTO program. I am working on a data collection project. Volume is basically "process Excel files sent by various third-party companies." In practice, this means:

  • Find the columns containing the data I want using the search method.
  • Extract data from books
  • Clear data, do some calculations, etc.
  • Output cleared data to a new book

The program I wrote is great for small data sets, ~ 25 books with a total of ~ 1000 lines of relevant data. I grab 7 columns of data from these books. However, one edge case that I have, sometimes I need to run a much larger data set, ~ 50 books with a total of ~ 8000 rows of relevant data (and maybe another 2000 duplicate data that I also need to delete).

Currently, I put the list of files through a loop Parallel.ForEach, inside which I open new Excel.Application()to process each file with multiple ActiveSheets. A parallel process runs much faster in a smaller data set than through each sequential one. But on a large dataset, I seem to hit a wall.

I start to receive the message: Microsoft Excel is waiting for another application to complete an OLE actionand in the end it just fails. Going to sequential foreachallows the program to finish, but it just grinds - starting from 1-3 minutes for parallel average size data set to 20+ minutes for a sequential large data set. If I ParallelOptions.MaxDegreeOfParallelismrun into set to 10, it will complete the loop, but still take 15 minutes. If I set it to 15, it fails. I also really don't like messing with TPL settings if I don't need it. I also tried pasting Thread.Sleepto just slow things down manually, but it only made the failure longer.

I close the book, exit the application, then ReleaseComObjectto the Excel object GC.Collectand GC.WaitForPendingFinalizersat the end of each cycle.

My ideas at the moment:

  • new Excel.Application() , Excel ( # 1, )
  • /, ,

:

  • , ( , Process.Id ?)
  • - , "" , .

: http://reedcopsey.com/2010/01/26/parallelism-in-net-part-5-partitioning-of-work/, : " , , Partitioner". , / .

!

UPDATE

, , Excel 2010, 2010, 2013 . 2013 , - 4 , , . 2010 , ? 2010 - 64- 64- Office, 2013 - 64- 32- Office. ?

+4
2

excel . - . , , , .

, , . , .

1) :

workBook.ActiveSheet.PageSetup

.. relase null .

: :

m_currentWorkBook.ActiveSheet.PageSetup.LeftFooter = str.ToString();

. ( - excel.)

    private bool SetBarcode(string text)
    {
            Excel._Worksheet sheet;
            sheet = (Excel._Worksheet)m_currentWorkbook.ActiveSheet;
            try
            {
                StringBuilder str = new StringBuilder();
                str.Append(@"&""IDAutomationHC39M,Regular""&22(");
                str.Append(text);
                str.Append(")");

                Excel.PageSetup setup;
                setup = sheet.PageSetup;
                try
                {
                    setup.LeftFooter = str.ToString();
                }
                finally
                {
                    RemoveReference(setup);
                    setup = null;
                }
            }
            finally
            {
                RemoveReference(sheet);
                sheet = null;
            }

            return true;

    }

RemoveReference ( )

    private void RemoveReference(object o)
    {
        try
        {
            System.Runtime.InteropServices.Marshal.ReleaseComObject(o);
        }
        catch
        { }
        finally
        {
            o = null;
        }
    }

, , - ..

2) excel excel, excel OleDB. excel sql-, datatables ..

: ( datareader )

    private List<DataTable> getMovieTables()
    {
        List<DataTable> movieTables = new List<DataTable>();
        var connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + excelFilePath + ";Extended Properties=\"Excel 12.0;IMEX=1;HDR=NO;TypeGuessRows=0;ImportMixedTypes=Text\""; ;
        using (var conn = new OleDbConnection(connectionString))
        {
            conn.Open();

            DataRowCollection sheets = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, new object[] { null, null, null, "TABLE" }).Rows;

            foreach (DataRow sheet in sheets)
            {

                using (var cmd = conn.CreateCommand())
                {
                    cmd.CommandText = "SELECT * FROM [" + sheet["TABLE_NAME"].ToString() + "] ";

                    var adapter = new OleDbDataAdapter(cmd);
                    var ds = new DataSet();
                    try
                    {
                        adapter.Fill(ds);
                        movieTables.Add(ds.Tables[0]);
                    }
                    catch (Exception ex)
                    {
                        //Debug.WriteLine(ex.ToString());
                        continue;
                    }
                }
            }
        }
        return movieTables;
    }
+1

, @Mustafa Düman, 4 beta EPPlus. .

:

  • Fast
  • ( < 4)
  • , Office , .

:

  • .xlsx(Excel 2007/2010)

20 excel 12,5 ( 50 . ), , , :)

 Console.Write("Path: ");
 var path = Console.ReadLine();
 var dirInfo = new DirectoryInfo(path);

 while (string.IsNullOrWhiteSpace(path) || !dirInfo.Exists)
 {
     Console.WriteLine("Invalid path");
     Console.Write("Path: ");
     path = Console.ReadLine();
     dirInfo = new DirectoryInfo(path);
 }

 string[] files = null;
 try
 {
     files = Directory.GetFiles(path, "*.xlsx", SearchOption.AllDirectories);
 }
 catch (Exception ex)
 {
     Console.WriteLine(ex.Message);
     Console.ReadLine();
     return;
 }

 Console.WriteLine("{0} files found.", files.Length);

 if (files.Length == 0)
 {
     Console.ReadLine();
     return;
 }

 int succeded = 0;
 int failed = 0;


 Action<string> LoadToDataSet = (filePath) =>
 {
     try
     {
         FileInfo fileInfo = new FileInfo(filePath);
         using (ExcelPackage excel = new ExcelPackage(fileInfo))
         using (DataSet dataSet = new DataSet())
         {
             int workSheetCount = excel.Workbook.Worksheets.Count;

             for (int i = 1; i <= workSheetCount; i++)
             {
                 var worksheet = excel.Workbook.Worksheets[i];

                 var dimension = worksheet.Dimension;
                 if (dimension == null)
                     continue;

                 bool hasData = dimension.End.Row >= 1;

                 if (!hasData)
                     continue;

                 DataTable dataTable = new DataTable();

                 //add columns
                 foreach (var firstRowCell in worksheet.Cells[1, 1, 1, dimension.End.Column])
                 dataTable.Columns.Add(firstRowCell.Start.Address);

                 for (int j = 0; j < dimension.End.Row; j++)
                     dataTable.Rows.Add(worksheet.Cells[j + 1, 1, j + 1, dimension.End.Column].Select(erb => erb.Value).ToArray());

                 dataSet.Tables.Add(dataTable);
             }

             dataSet.Clear();
             dataSet.Tables.Clear();
         }

         Interlocked.Increment(ref succeded);
     }
     catch (Exception)
     {
         Interlocked.Increment(ref failed);
     }
 };

 Stopwatch sw = new Stopwatch();

 sw.Start();
 files.AsParallel().ForAll(LoadToDataSet);
 sw.Stop();

 Console.WriteLine("{0} succeded, {1} failed in {2} seconds", succeded, failed, sw.Elapsed.TotalSeconds);
 Console.ReadLine();
0

Source: https://habr.com/ru/post/1546361/


All Articles