Define and extract table from pdf using java

I have different types of PDF that contain several things like text, table, etc. A table can exist anywhere in pdf (top, middle, bottom). I want to extract only the table data (column number, the number of rows and data in the table) from this pdf using java without passing the location.

What I have done so far: -

1. I used the iText API for reading and retrieving. Used code: -

PdfTextExtractor.getTextFromPage

but it only returns the data as text. I didn’t get any hints to determine where the table exists in pdf and how to extract data from this table.

2. I also used the PDFBox API, but it also did not help solve my problem.

3. I also followed this link: - extracting a PDF table But this does not give me the expected result. This algorithm needs, except for the line position and everything.

I can not determine where to find the table in pdf.

Can someone tell me how to solve this problem using the iText and PDF APIs or is there any open source API that can help me solve this problem?

Or can we convert PDF to html so that we can identify the table and read from the table tags;)?

+1
2

, .

PDF html-. html- , "" "". pdf ( ) , . , "", " , , , .."

, pdf, ().

, pdf PDF. Tagged pdfs . , .

, . , , iText7 IEventListener. eventOccurred(), , (, , ..).

, , , .

IText pdf2Data, .

0

Tabula - PDF-. tabula-java . .

, PDFBox Apache Tika .

0

Source: https://habr.com/ru/post/1533233/


All Articles