IText: Import text style and information from an existing PDF file

Question

IText: Import text style and information from an existing PDF file

I create PDF files using iText and it works great. But I need to somehow import the html information from an existing PDF file. I know that I can simply use the XMLWorker class to generate text directly from html in my own document. But I'm not sure if it supports all the html functions I'm looking for to get around this. Therefore, PDF is created from html using XSLT. The contents of this PDF should then be copied to my document. There are two ways in the book ("iText in Action"). One that parses a PDF and retrieves text (or other information) from a document using PdfReaderContentParser and TextExtractionStrategy. It looks like this:

PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i++){
strategy = parser.processContent(i, new LocationTextExtractionStrategy());
document.add(new Chunk(strategy.getResultantText()));
}

But it only prints plain text in the document. Obviously, there is ExtractionStrategys, and maybe one of them does exactly what I want, but I could not find it.

The second way is to copy the itextpdf.text.Image file from each side of the PDF into your document. This is obviously not a good idea, because it will add the entire page to your document, even if there is only one line of text in the existing PDF. This is done like this:

PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
PdfReader reader = new PdfReader(pdf);
PdfImportedPage page;
for(int i=1;i<=reader.getNumberOfPages();i++){
page = writer.getImportedPage(reader,i);
document.add(Image.getInstance(page));
}

As I said, this copies all blank lines at the end of the PDF file, but I need to continue my text right after the last line of text. If I could convert this itext.text.Image to java.awt.BufferedImage, I could use getSubImage (); and the information I can extract from PDF to cut off all blank lines. But I could not find a way to do this.

, . , : , , , PDF iText?

0

java html pdf itext

moli 13 . '15 6:29

1

mkl · Accepted Answer · 2015-08-14T10:45:09+0000

XSLT PDF , , .

iTextPDF . , , -, , , PdfWriter.getImportedPage.

PdfReader, :

static void cropPdf(PdfReader reader) throws IOException
{
    int n = reader.getNumberOfPages();
    for (int i = 1; i <= n; i++)
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        MarginFinder finder = parser.processContent(i, new MarginFinder());
        Rectangle rect = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());

        PdfDictionary page = reader.getPageN(i);
        page.put(PdfName.MEDIABOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
    }
}

( ImportPageWithoutFreeSpace.java)

MarginFinder , . : MarginFinder.java.

PdfReader readerText = new PdfReader(docText);
cropPdf(readerText);
PdfReader readerGraphics = new PdfReader(docGraphics);
cropPdf(readerGraphics);
try (   FileOutputStream fos = new FileOutputStream(new File(RESULT_FOLDER, "importPages.pdf")))
{
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, fos);
    document.open();
    document.add(new Paragraph("Let import 'textOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
    document.add(Image.getInstance(writer.getImportedPage(readerText, 1)));
    document.add(new Paragraph("and now 'graphicsOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
    document.add(Image.getInstance(writer.getImportedPage(readerGraphics, 1)));
    document.add(new Paragraph("That all, folks!", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));

    document.close();
}
finally
{
    readerText.close();
    readerGraphics.close();
}

( unit test testImportPages ImportPageWithoutFreeSpace.java)

docText

docGraphics

, . :

, , .

IText: Import text style and information from an existing PDF file

More articles: