Getting clear text from a document using C #

How can I get a blank line from a document that excludes all images or tables or numbers. I will manipulate and create a list of words from these documents. Therefore, I need only the text part of documents using C #

+3
source share
2 answers

You probably need to learn IFilters . This is how most search indexers access plain text from documents in Windows. Here's a tutorial and sample project with source code that you can use to extract text from Office documents and PDF files, etc.

You just need to make sure that the correct IFilters are installed on your computer. Microsoft provides a free set of filters for Office documents . Adobe also provides a filter, but it fills the trash. If you can, try FoxIt IFilter , it is much better.

+1
source

You must maintain the format of each document; There is no general method for reading all document formats.
For example, Microsoft Office Word document files must be interpreted in their own library, unlike OpenOffice document files.

0
source

Source: https://habr.com/ru/post/1776521/


All Articles