How to extract plain text from an MS Word document file in pure C ++?

Is there any pure C ++ library for extracting plaintext from a .doc file?

I am developing a C ++ program to read .doc and .pdf files. I need to extract text from a file and write it to a .txt file.

+6
source share
4 answers

You might take a look at the open source C library used by Abiword, wv .

You can also call the batch conversion tool

+3
source

If you want to manipulate / read .doc files, you can just spend time and study the format and manage the .doc file manually. You can get it on the MSDN page that links to the format specification (PDF file) .
I admit that this is a little read, but if you want to create software for managing / reading files, you must have the appropriate basic knowledge to support all this.

The same applies to the pdf format (which is an open format, and since such specifications should be easily found).

+1
source

For a document - use the Word object model to go to the document and extract the text. This example uses OLE Automation and C. Another link for DOCX that might help you.

For PDF - use Haru .

+1
source

You can always use OIVT (it seems to me, OutsideIn Viewer Technology), which now belongs to oracle.

I will be honest, this is not a cheap solution, and so far this product will allow you to view, print, etc. I think that if I remember correctly, they offer the ability to extract content into text or is it another product that does this. he can do this from virtually any type of document, including doc, docx, pdf (just to name a few), without having to use the "original" application installed, since it has its own set of filters.

Here is the link to get started.

Beyond Viewing Technology

Good luck.

+1
source

Source: https://habr.com/ru/post/902273/


All Articles