How to remove headers and footers from a PDF using iText in Java

I use the iText PDF library to convert PDF to text.

Below is my code for converting PDF to text file using Java.

public class PdfConverter {

/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";

/**
 * Parses a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));

    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        out.println(strategy.getResultantText());
        System.out.println(strategy.getResultantText());
    }
    out.flush();
    out.close();
    reader.close();
}

/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}

The above code works to extract PDF to text. But my requirement is to ignore the header and footer and only extract content from the PDF file.

+4
source share
2 answers

pdf , ( , ). , ParseTaggedPdf. ExtractPageContentArea, ParseTaggedPdf . , .

. , apache API, PdfBox, tika , PDFTextStream. , , , iText . PdfBox PDFTextStripperByArea PDFTextStripper. JavaDoc , , .

+3

IText, http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/

, , .

PdfReader reader = new PdfReader(pdf);
PrintWriter out= new PrintWriter(new FileOutputStream(txt));
//Creating the rectangle
Rectangle rect=new Rectangle(70,80,420,500);
//creating a filter based on the rectangle
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i+){
    //setting the filter on the text extraction strategy
    strategy= new FilteredTextRenderListener(
      new LocationTextExtractionStrategy(),filter);
    out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy));
}
out.flush();out.close();

-, , PDF .

0

Source: https://habr.com/ru/post/1570746/


All Articles