To extract text from a PDF, try this on a computer with Linux, BSD, etc. Or use Cygwin if on Windows:
pdfinfo -layout some_pdf_file.pdf
A simple text file was created with the name some_pdf_file.txt . The simpler the layout of the PDF file, the easier the output of the .txt file will be.
Hexadecimal characters are often present in the output of a .txt file and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, markers, hyphens, etc. In pdf.
To see the context in which hexadecimal characters are displayed, run this grep command and keep the original PDF file handy to see which characters represent the codes in the PDF file:
grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt
This will provide a unique list of the various octal codes in the document:
grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq
To convert these hexadecimal characters to ASCII equivalents, you can use a combination of grep, sed and bc, I will publish this procedure soon.
keithchristian Jul 26 '19 at 12:28 2019-07-26 12:28
source share