C # PDFSharp: examples of how to cut text from PDF?

I have a fairly simple task: I need to read a PDF file and write its image content, ignoring its text content. Therefore, in essence, I need to make the addition of "save as text."

Ideally, I would prefer to avoid any re-compression of the image content, but if this is not possible, this is also normal.

Are there examples of how to do this?

Thanks!

+6
source share
3 answers

Extracting text from a PDF file using PDFsharp is not an easy task.

This topic has been discussed recently: fooobar.com/questions/267494 / ...

+4
source

Extracting text from PDF using PdfSharp can be very simple, depending on the type of document and what you are going to do with it. If the text is in the document as text, not an image, and you do not care about the position or format, then this is quite simple. This code gets all the text of the first page in the PDF files I'm working with:

var doc = PdfReader.Open(docPath); string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString(); 

doc.Pages.Count gives you the total number of pages, and you get access to each of them through the doc.Pages array with the index. I do not recommend using foreach and Linq here, as interfaces are not well implemented. The index passed to GetDictionary is an element of the PDF document - this may vary depending on how the documents are created. If you don’t get the text you are looking for, try looping through all the elements.

The text that is created will be filled with various PDF formatting codes. If all you have to do is string extraction, you can find the ones you want to use with Regex or any other suitable string search code. If you need to do something with formatting or positioning, then good luck - from what I can say, you will need.

+1
source

An example of PDFSharp libraries that extract images from a .pdf file:

link

library

EDIT:

Then, if you want to extract text from an image, you need to use the OCR libraries.

There are two good OCR tessnet and MODI
Stack stream reference
But I can fully recommend the MODI that I am using now. Some examples are @ codeproject .

EDIT 2:

If you do not want to read the text from the extracted images, you should write a new PDF document and put everything in it. For writing PDF files, I use MigraDoc . It’s easy to use this library.

0
source

Source: https://habr.com/ru/post/910116/


All Articles