How to count the characters of words or sentences from the downloaded PDF, Doc, Xls, Csv, etc. Etc.

How to count words from a downloaded file in PDF, Doc, Xls, Csv, etc. Or use PHP, Zend Framework or Java based CLI?

+4
source share
2 answers

There is a third-party application here that does http://www.globalrendering.com/download.html . You can create a simple shell for it. As for wc, it is not accurate for these types of files. See http://ubuntuforums.org/showthread.php?t=566407

+1
source

First of all, you should take a look at tika, which is written in Java, is free (licensed by Apache) and can convert all formats mentioned in the text. After that, the number of words should be trivial.

You can also use linux command line utilities to convert to text and write a simple wrapper around them.

(I cannot refer to them due to lack of reputation. Use Google Fu.)

  • pdf: pdftotext (part of xpdf). see also question # 221359 on SuperUser.
  • doc (x): abiword, catdoc, antiword, docxtotxt ... see also question 165978 about SuperUser.
  • xls (and almost everything, but needs OpenOffice): unoconv
+1
source

Source: https://habr.com/ru/post/1333795/


All Articles