ASP.NET library for extracting plaintext from Open XML file formats

Is there an existing library for extracting text form Open XML file formats (such as docx, pptx and xlsx)?

I need to fill out the lucene.net index.

I found this example that extracts text from docx and seems to work fine. But before building my own solution based on this, I was wondering if there is anything already available for other file formats?

+4
source share
3 answers

Before spending money, it might be worth looking at the IFilter interface - they were / are designed to do exactly what you want.

http://msdn.microsoft.com/en-us/library/ms691105

http://www.codeproject.com/KB/cs/IFilter.aspx

(Some links at the bottom of the codeprject link).

MS provides IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en

I know that we use this technology so that we can index PDF files using Lucene, but I did not write the actual code and cannot use it. I'm afraid.

If your Google-fu is strong, I'm sure you can dig out more examples of using IFilters to do exactly what you want.

+1
source

see aspose.com, they have a good library to handle both ppt and pptx.

0
source

You can try Toxy, a text / data unpacking environment for .NET. Currently it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.

See here for more details.

0
source

Source: https://habr.com/ru/post/1308960/


All Articles