Removing PDF Text Using iText

We are conducting research in the field of information extraction, and we would like to use iText.

We are in the process of learning iText. According to the literature we reviewed, iText is the best tool to use. Is it possible to extract text from PDF to a string in iText? I read the question in this section on stackoverflow related to mine, but it just read the text so as not to extract it. Can someone help me with my problem? Thanks.

+6
source share
2 answers

As Theodore said you can extract text from pdf and as Chris said

while this is actually text (not outlines or bitmaps)

The best thing to do is to buy the book of Bruno Logagi Itext in action. In the second edition, chapter 15 covers text extraction.

But you can see his site with examples. http://itextpdf.com/examples/iia.php?id=279

And you can parse it to create a simple text file. Here is a sample code:

/* * This class is part of the book "iText in Action - 2nd Edition" * written by Bruno Lowagie (ISBN: 9781935182610) * For more info, go to: http://itextpdf.com/examples/ * This example only works with the AGPL version of iText. */ package part4.chapter15; import java.io.FileOutputStream; import java.io.IOException; import java.io.PrintWriter; import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfReaderContentParser; import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy; import com.itextpdf.text.pdf.parser.TextExtractionStrategy; public class ExtractPageContent { /** The original PDF that will be parsed. */ public static final String PREFACE = "resources/pdfs/preface.pdf"; /** The resulting text file. */ public static final String RESULT = "results/part4/chapter15/preface.txt"; /** * Parses a PDF to a plain text file. * @param pdf the original PDF * @param txt the resulting text * @throws IOException */ public void parsePdf(String pdf, String txt) throws IOException { PdfReader reader = new PdfReader(pdf); PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(txt)); TextExtractionStrategy strategy; for (int i = 1; i <= reader.getNumberOfPages(); i++) { strategy = parser.processContent(i, new SimpleTextExtractionStrategy()); out.println(strategy.getResultantText()); } reader.close(); out.flush(); out.close(); } /** * Main method. * @param args no arguments needed * @throws IOException */ public static void main(String[] args) throws IOException { new ExtractPageContent().parsePdf(PREFACE, RESULT); } } 

Pay attention to the license

This example only works with the AGPL version of the AGPL version.

If you look at other examples, this will show how to leave parts of the text or how to extract parts of the PDF.

Hope this helps.

+13
source

iText allows you to do this, but there is no guarantee on the granularity of text blocks that depend on the actual PDF renderings used to create your documents.

It is possible that each word or even a letter has its own text block. Also, they should not be in lexical order; for reliable results, you may need to change the order of text blocks based on their coordinates. You may also need to figure out whether to insert spaces between text blocks.

+3
source

Source: https://habr.com/ru/post/905718/


All Articles