Determine if the PDF page contains text or just an image

How to determine if a PDF page contains text or is it purely a picture using Java?

I searched a lot of forums and sites, but haven’t found an answer yet.

Can I extract text from a PDF, find out if the page is in image or text format?

PdfReader reader = new PdfReader(INPUTFILE); PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE)); for (int i = 1; i <= reader.getNumberOfPages(); i++) { // here I want to test the structure of the page !!!! if it possible out.println(PdfTextExtractor.getTextFromPage(reader, i)); } 
+6
source share
1 answer

There is no waterproof way to do what you want.

Text can be displayed differently in a PDF file. For example: you can use all glyphs using graphical state operators, instead of using a text state. (Sorry if this sounds like Chinese to you, but I can assure you of the correct PDF format.)

If an ad hoc solution that covers the most common situations and occasionally skips an exotic PDF file is right for you, then you already have a good first workaround.

In your code, you go through all the pages, and you ask iText if there is any text on the page. This is already a good indication.

Inside, your code uses the RenderListener interface. iText parses the contents of the page and invokes methods in a specific RenderListener implementation. This is a custom implementation: MyTextRenderListener . This custom implementation is used in the ParsingHelloWorld example.

There is also a renderImage() method (see, for example, MyImageListener ). If this method is started, you are 100% sure that the page also has an image, and you can use the ImageRenderInfo object to get the position, width and height of the image (that is: if you know how to interpret the Matrix returned by the getImageCTM() method )

Using all these elements, you can already get a big way to achieve what you need, but keep in mind that there will always be exotic PDF files that will avoid all your checks.

+6
source

Source: https://habr.com/ru/post/945083/


All Articles