How to extract significant text content from a LaTeX document

Question

How to extract significant text content from a LaTeX document

I need to extract text only from my dissertation document, written in LaTeX, for automatic verification of anti-plagiarism. I only know about the "draft" version, and this is not enough.

I have to omit:

Images
tables and other numbers
equations
captions and footnotes.

It would be nice to delete all links. The output should be a plain text file (UTF-8).

Is there an easy way to do this? I really don't like to copy it manually through the pages.

+4

latex

odiroot Jan 29 '11 at 13:43

source share

5 answers

Yes: untex , a simple C script. You can also watch detex .

+1

huitseeker Jan 29 '11 at 2:04

source share

You can use a document converter such as pandoc , or convert the output PDF to plain text with something like Caliber .

+1

frabjous Feb 01 '11 at 20:42

source share

Usually you need LaTeX processing in the text, let's say you have

\ newcommand * {\ SO} {StackOverflow \ index {StackOverflow} \ xspace}
...
I spend a lot of time on \ SO, blah blah ....

Simply disabling the text paragraph here will not give the text as the intended result when it contains any macros.

Therefore, an attempt to extract things from a * .tex file usually leaves much to be desired from the result. Therefore, it is better to work at the exit from latex. I would recommend converting latex to html and then from html to text. You will probably need manual cleaning, but I think it should be relatively close.

+1

hlovdal Feb 01 '11 at 10:34

source share

While Detex is mentioned, however, there is another project aimed at improving it. It's called opendetex , let it see!

+1

Joel berger Feb 04 '11 at 3:03

source share

Ulrich schwarz · Accepted Answer · 2011-01-29T14:07:47+0000

You can try using a comment pack (or one of a dozen alternatives) to transform an equation, shape, table, etc. in the comment environment and \ renewcommand \ footnote [1] {} to remove the footnotes. \ pagestyle {empty} should remove page titles, etc., so running pdftotext on the result should start with what you want.

How to extract significant text content from a LaTeX document

More articles: