Extract data from nested tables to PDF

  • I have several PDF files that were created from word or excel files.

  • I need to get the information contained in the tables.

  • The text in the document is not an image, so I can extract the text using tools such as pdfbox.

  • When I have text, I have no way of knowing which cells in the table it belongs, because I do not know where the table borders are.

  • Iv'e tried several desktop tools such as abby or solid pdf converter, and they can convert files to beautiful text documents, but this does not suit my needs, because I want this to be possible programmatically in C #.

  • Some of the tables have nested tables, which in my opinion make this a bit more complicated.

I appreciate your help

+3
source share
1 answer

The difficulty here is that the text in the PDF is not contained in any table. It may look like this, but below the surface it is not.

So, there are several options that I can think of. But not one of them will be as satisfactory as you would like.

  • There are several companies that offer SDKs for converting PDF to Excel / Word. Investintech and Iceni are a few examples. But these solutions are not free.
  • PDF , , SDK, PDF, , , , , . , , .

, , , .

+1

Source: https://habr.com/ru/post/1760684/


All Articles