Is there an existing library for extracting text form Open XML file formats (such as docx, pptx and xlsx)?
I need to fill out the lucene.net index.
I found this example that extracts text from docx and seems to work fine. But before building my own solution based on this, I was wondering if there is anything already available for other file formats?
Before spending money, it might be worth looking at the IFilter interface - they were / are designed to do exactly what you want.
http://msdn.microsoft.com/en-us/library/ms691105
http://www.codeproject.com/KB/cs/IFilter.aspx
(Some links at the bottom of the codeprject link).
MS provides IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en
I know that we use this technology so that we can index PDF files using Lucene, but I did not write the actual code and cannot use it. I'm afraid.
If your Google-fu is strong, I'm sure you can dig out more examples of using IFilters to do exactly what you want.
see aspose.com, they have a good library to handle both ppt and pptx.
You can try Toxy, a text / data unpacking environment for .NET. Currently it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.
See here for more details.
Source: https://habr.com/ru/post/1308960/More articles:How to initialize a web application? - initializationTracing calls on Android - androidWhich LINQ query is more efficient? - c #Choosing only the first element of an xpath result set in PHP - phpcompiling C ++ programs, including mysql - c ++SQL query to get the last record for all individual table elements - mysqllibmysqlclient.a not found anywhere - mysqlShakin 'things up - c #sql variable - datetime and string representation of datetime variable - stringThe second font of choice in HTML - htmlAll Articles