How to extract significant text content from a LaTeX document

I need to extract text only from my dissertation document, written in LaTeX, for automatic verification of anti-plagiarism. I only know about the "draft" version, and this is not enough.

I have to omit:

  • Images
  • tables and other numbers
  • equations
  • captions and footnotes.

It would be nice to delete all links. The output should be a plain text file (UTF-8).

Is there an easy way to do this? I really don't like to copy it manually through the pages.

+4
source share
5 answers

You can try using a comment pack (or one of a dozen alternatives) to transform an equation, shape, table, etc. in the comment environment and \ renewcommand \ footnote [1] {} to remove the footnotes. \ pagestyle {empty} should remove page titles, etc., so running pdftotext on the result should start with what you want.

+1
source

Yes: untex , a simple C script. You can also watch detex .

+1
source

You can use a document converter such as pandoc , or convert the output PDF to plain text with something like Caliber .

+1
source

Usually you need LaTeX processing in the text, let's say you have

\ newcommand * {\ SO} {StackOverflow \ index {StackOverflow} \ xspace}

...

I spend a lot of time on \ SO, blah blah ....

Simply disabling the text paragraph here will not give the text as the intended result when it contains any macros.

Therefore, an attempt to extract things from a * .tex file usually leaves much to be desired from the result. Therefore, it is better to work at the exit from latex. I would recommend converting latex to html and then from html to text. You will probably need manual cleaning, but I think it should be relatively close.

+1
source

While Detex is mentioned, however, there is another project aimed at improving it. It's called opendetex , let it see!

+1
source

Source: https://habr.com/ru/post/1337593/


All Articles