I have a question. I am trying to extract structured text from PDF documents. Since pdf files usually have no structure, I thought I could start parsing pdf files created using latex, which should have some structure.
Do you know that there are any templates in pdf files related to latex that I could use to parse pdf?
Take a look at the PDF Box for parsing text from PDF documents. Or you can use Apache Tika , which offers parsing of several types of documents with a standard interface (maybe redundant). I would not recommend doing this manually.
Infty Reader Commercial Solution
http://www.sciaccess.net/en/InftyReader/index.html
In trial mode, recognition is limited to one page each time and 5 pages per day.
With terminal
A quick and dirty solution that is likely to take a lot of attempts and errors.
Your pdf needs to be parsed
pdftotext 'your-file.pdf' your-file.txt
you need a template in your pdf (for example, copyright on each slide)
sed -n '/<PATTERN>/{n;n;n;p}' your-file.txt | awk '!x[$0]++' > your-file-structure.txt
{n;n;n;p}
p
n;n;n
awk '!x[$0]++'
Source: https://habr.com/ru/post/1444833/More articles:shadow removal in OpenCV 2.4.3 - c ++Why is my linecount in C not working? - cX axis offset - c #convert double to float in Python - pythonFacebook sdk - error: http status code 400 - iosswfobject.embedSWF youtube video overlay in chrome with opacity not working - javascriptActionBar action list - use the same ActionBar in several actions; Initialize in one place - javaBatch file to read the first line of text that was not used, and then mark it as used - cmdOSGi and Hibernate - not suitable for driver - mysqlHow could I smoothly move an element using Hammer.js and jQuery? - javascriptAll Articles