Extracting text from PDF using PdfSharp can be very simple, depending on the type of document and what you are going to do with it. If the text is in the document as text, not an image, and you do not care about the position or format, then this is quite simple. This code gets all the text of the first page in the PDF files I'm working with:
var doc = PdfReader.Open(docPath); string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Count
gives you the total number of pages, and you get access to each of them through the doc.Pages
array with the index. I do not recommend using foreach
and Linq here, as interfaces are not well implemented. The index passed to GetDictionary
is an element of the PDF document - this may vary depending on how the documents are created. If you donβt get the text you are looking for, try looping through all the elements.
The text that is created will be filled with various PDF formatting codes. If all you have to do is string extraction, you can find the ones you want to use with Regex or any other suitable string search code. If you need to do something with formatting or positioning, then good luck - from what I can say, you will need.
Mason source share