.doc to plain text converter

Is there any open source c / C ++ library available to convert MSWord.doc / .docx files to text format?

+4
source share
4 answers
+3
source

These are not libraries, but may be useful. There are two console applications that I know about antiword and catdoc . Antiword is the GPL, the source of catdoc is also available, but I'm not sure about the license. They are written in C, so using them from C ++ should be possible.

+2
source

If all else fails, the .docx file is actually a ZIP file with several directories in it. One of the files in one of these directories has the text of the document in it, as XML with markup. There are several tags that you should handle as they mark the ends of lines, but most of them mark where autocorrect are marked with various things, or randomly distributed nested tags with 5 levels that format the markup.

(I had to do this manually once on a machine without access to the Internet. Someone saved the file from Office 2011 and wanted to open it in Office 2005 or so in another place in boonies.)

+2
source

I do not know about the library for this task, but perhaps you can extract important bits from Antiword . I'm not sure Antiword is handling docx.

+1
source

Source: https://habr.com/ru/post/1387376/


All Articles