How to parse pdf files using java that generate latex (to get structure like chapters or sections)

I have a question. I am trying to extract structured text from PDF documents. Since pdf files usually have no structure, I thought I could start parsing pdf files created using latex, which should have some structure.

Do you know that there are any templates in pdf files related to latex that I could use to parse pdf?

+4
source share
2 answers

Take a look at the PDF Box for parsing text from PDF documents. Or you can use Apache Tika , which offers parsing of several types of documents with a standard interface (maybe redundant). I would not recommend doing this manually.

+4
source

Infty Reader Commercial Solution

http://www.sciaccess.net/en/InftyReader/index.html

In trial mode, recognition is limited to one page each time and 5 pages per day.

With terminal

  • A quick and dirty solution that is likely to take a lot of attempts and errors.

    • Your pdf needs to be parsed

      • pdftotext 'your-file.pdf' your-file.txt
    • you need a template in your pdf (for example, copyright on each slide)

      • sed -n '/<PATTERN>/{n;n;n;p}' your-file.txt | awk '!x[$0]++' > your-file-structure.txt
      • change {n;n;n;p} since it is currently printing p next next next line n;n;n after your pattern
      • awk '!x[$0]++' removes duplicates
0
source

Source: https://habr.com/ru/post/1444833/


All Articles