How to save a text file in UTF-8 format using pdftotext

I am using the pdftotext tool to open PDF files to convert files to text files. How to save text files in UTF-8 format to save all accent characters in text files. I use the following conversion command, which extracts the contents to a text file but cannot see any accented characters.

pdftotext -enc UTF-8 book1.pdf book1.txt

Please help me solve this problem.

Thanks in advance,

+3
source share
2 answers

You can get a list of available encodings using the command:

pdftotext -listenc

, -enc. , UTF-8 . .. "UTF-8" .

pdftotext -enc UTF-8 your.pdf

(LC_ALL, LANG,...).

EDIT: PDF: http://www.i18nguy.com/unicode/unicodeexample.pdf

Windows 7 () XPDF 3.02PL5 :

pdftotext.exe -enc UTF-8 unicodeexample.pdf

, , UTF-8, . ? -, , , .

( Firefox ISO-8859-1 UTF-8) .

+9

, .

PDF, "" :

  • PDF Acrobar Reader
  • Unicode ( "" OCR, )

, , , , PDF. , , , Unicode. , "", .

-4

Source: https://habr.com/ru/post/1771860/


All Articles