How to count words in complex documents (.rtf, .doc, .odt, etc.)?

I am trying to write a Python function that, given the path to a document file, returns the number of words in this document. This is pretty easy to do with .txt files, and there are tools that let me hack support for several more complex document formats together, but I want a really comprehensive solution.

Looking at the OpenOffice.org py-uno script interface and the list of supported formats, it would be ideal to load documents into the mute OOo and call the word count function. However, I cannot find any tutorials or code examples that go beyond the generation of the base document, and even the found code fragments are out of date and a half decades and no longer work.

Can I use OOo and Uno or not, how can I get reliable words for documents of different formats?

+4
source share
2 answers

load documents into headless OOo and call the word count function

PyODConverter is a recent (11-2009) script to use OOo to convert multiple file types. Looking at the script, it has a basic load of all supported OOo documents.

Here's how you start OOo as a headless service:

soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

Then you just need to write a small boot block that calls OOo on the command line, runs your script, and then closes OOo.


+3
source

This may not be an option for you, but in the case of it - you can upload documents to Google Docs, and then export them in .txt format. Google usually does a very nice job for conversion.

You can find the relevant APIs here: http://code.google.com/intl/pl/apis/documents/docs/1.0/developers_guide_python.html

Take a look at the login, download, and export sections.

+2
source

Source: https://habr.com/ru/post/1301179/


All Articles