How to extract text layer and background layer from pdf?

Question

How to extract text layer and background layer from pdf?

In my project, I have to make a PDF Viewer in HTML5 / CSS3, and the application should allow the user to add comments and annotations. In fact, I have to do something very similar to crocodoc.com.

In the beginning, I thought of creating images from a PDF and allowing users to create areas and post comments in that area. Unfortunately, the client also wants to move around in this PDF file and add only comments to the allowed sections (for example, paragraphs or selected text).

And now I have one problem to get the text and the best way to do it. If anyone has any clues how I can achieve this, I would appreciate it.

I tried pdftohtml , but the output is not like the original document, which is really complex ( example document ). Even this one does not really reflect the output, but much better than pdftohtml .

I am open to any solutions, with a preference for the command line under linux.

+6

linux html5 php pdf ghostscript

yvan Sep 08 '11 at 18:30

source share

5 answers

Tom · Answer 1 · 2011-09-17T14:00:30+0000

I was on the same path as you, with even more complex tasks.

After testing everything that I finished, used C # under Mono (this works on linux) with iTextSharp.

Even with a very complete library such as iTextSharp, some tasks require trial and error :)

Retrieving text from a page is easy (check out the screenshot below), however, if you intend to save text coordinates, fonts and sizes, you will have more work.

 int pdf_page = 5; string page_text = ""; PdfReader reader = new PdfReader("path/to/pdf/file.pdf"); PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page)); while(token.NextToken()) { if(token.TokenType == PRTokeniser.TokType.STRING) { page_text += token.StringValue; } else if(token.StringValue == "Tj") { page_text += " "; } }

Make Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDF files. This way you can detect coordinates, font, font size, etc.

Addition:

Given the assignment you must complete, I have a suggestion for you:

Extract text with coordinates and font families and sizes - all the information about each paragraph. Then, in PDF-to-images and in your online browser, apply invisible selectable text above the paragraphs in the image where necessary.

Thus, your users can select part of the text where necessary, without the need to restore the entire PDF file in html :)

Timothy Allyn Drake · Answer 2 · 2011-09-21T12:36:37+0000

I recently researched and discovered my own PHP solution to achieve this with FOSS. The FPDI PHP class can be used to import a PDF document for use with TCPDF or FPDF PHP classes, both of which provide functionality for creating, reading, updating, and writing PDF documents. Personally, I prefer TCPDF because it provides a wider range of functions ( TCPDF vs FPDF ), a richer API ( TCPDF vs FPDF ), more usage examples ( TCPDF vs FPDF ) and a more active community forum ( TCPDF vs FPDF ).

Select one of the previous classes or another to programmatically process PDF documents. Focusing on both current and possible future results, as well as on the desired user experience, decide where (for example, the server is PHP, the client is JavaScript, and both) and to what extent (using the function) your interactive logic must be implemented.

Personally, I would like to use a TCPDF instance obtained by importing a PDF document via FPDI to iteratively validate, translate to a common format (XML, JSON, etc.) and store the resulting view in relational tables designed to store data related to desired level of hierarchy of documents and details. The necessary level of detail is often dictated by a specification document and its mention of both current and possible future results.

Note. . In this case, I highly recommend translating documents and storing them in a common format to create a layer of abstraction and transparency. For example, a possible and unanticipated future achievement may be to provide the same functionality for users who download Microsoft Word documents. If the downloaded Microsoft Word document was not translated and saved in a common format, then you will almost certainly need to update the web service API and the dependent business logic. This ultimately leads to the storage of inflated, suboptimal data and inefficient use of development resources in the development, development and support of several translators. It would also be inefficient to use server resources to translate outgoing data for each request, in contrast to translating incoming data into the optimal format only once.

Then I expanded the tables of the base document by developing and linking additional tables for persisting data about specific document resources, such as:

Supported Versions / Editing / Deleting

what
- Page header
- Text
  - Initial value
  - New value
- Picture
  - Page (one, many or all)
  - Location (relative - text anchor, absolute x / y coordinates)
  - File (relative or absolute directory or URL)
- Brush (drawing)
  - Page (one, many or all)
  - Location (relative - text anchor, absolute x / y coordinates)
  - Form (x / y coordinates for redrawing a line, square, circle, user, etc.).
  - Type (pen, pencil, marker, etc.)
  - Weight (1px, 3px, 5px, etc.)
  - Color
- annotation
  - Page
  - Location (relative - text anchor, absolute x / y coordinates)
  - Shape (line, square, circle, custom, etc.)
  - Value (annotation text)
- A comment
  - Purpose (page, other text / image / brush / annotation object, parent comment - streams)
  - Value (comment text)
When
- date
- Time
Who
- User

After some, all or more of the document and its resource data are saved, I would develop, document and develop the PHP web service API to expose the CRUD and PDF document loading functionality for the user user interface, business regulations. At the moment, the remaining work is now on the client side. Currently, I have relational tables that store both the document and its resource data, as well as an API that demonstrates sufficient functionality for the consumer, in this case, client-side JavaScript.

Now I can develop and develop a client application using the latest web technologies such as HTML5, JavaScript and CSS3. I can download and request PDF documents using the web service API and easily output the returned generic format to the browser, but I decide (maybe HTML in this case). Then I can use 100% native JavaScript and / or third-party libraries for DOM functionality, creating vector graphics to provide drawing and annotation functions, as well as access and control the functional and stylistic attributes of the currently selected text and / or image of the document. I can provide real-time shared experience using WebSockets (before the mentioned WebService API is not applied) or a semi-delayed, but still fairly simple experience using XMLHttpRequest.

From now on, the sky is the limit, and the ball is in your court!

cweiske · Answer 3 · 2011-09-15T08:23:31+0000

This is the difficult task you are trying to complete.

To read text from a PDF, see the PEAR PDF_Reader offer code.

Lars · Answer 4 · 2011-09-15T09:28:54+0000

There is also very extensive documentation on Zend_PDF () , which also allows you to download and parse a PDF document. Various PDF elements can be repeated and thus also converted to HTML5 or whatever you like. You can even embed posts from your site into PDF files and vice versa.

However, it is not easy for you to complete the task. Good luck.

robermorales · Answer 5 · 2011-09-20T09:02:00+0000

pdftk is a very good tool to do such thoughts (I don’t know if it can do just that).

http://www.pdflabs.com/docs/pdftk-cli-examples/

How to extract text layer and background layer from pdf?

More articles: