How to parse pdf files using java that generate latex (to get structure like chapters or sections)

Question

How to parse pdf files using java that generate latex (to get structure like chapters or sections)

I have a question. I am trying to extract structured text from PDF documents. Since pdf files usually have no structure, I thought I could start parsing pdf files created using latex, which should have some structure.

Do you know that there are any templates in pdf files related to latex that I could use to parse pdf?

+4

java parsing pdf structure latex

user1692091 Nov 08 '12 at 15:04

source share

2 answers

Perception · Answer 1 · 2012-11-08T15:11:41+0000

Take a look at the PDF Box for parsing text from PDF documents. Or you can use Apache Tika , which offers parsing of several types of documents with a standard interface (maybe redundant). I would not recommend doing this manually.

jaminka evening · Answer 2 · 2014-07-07T13:55:37+0000

Infty Reader Commercial Solution

http://www.sciaccess.net/en/InftyReader/index.html

In trial mode, recognition is limited to one page each time and 5 pages per day.

With terminal

A quick and dirty solution that is likely to take a lot of attempts and errors.
- Your pdf needs to be parsed
  - pdftotext 'your-file.pdf' your-file.txt
- you need a template in your pdf (for example, copyright on each slide)
  - sed -n '/<PATTERN>/{n;n;n;p}' your-file.txt | awk '!x[$0]++' > your-file-structure.txt
  - change {n;n;n;p} since it is currently printing p next next next line n;n;n after your pattern
  - awk '!x[$0]++' removes duplicates

How to parse pdf files using java that generate latex (to get structure like chapters or sections)

More articles: