As a newbie to pdfbox, I plan to extract data into a table, but tables with special formats, for example, with combined column headers, should be processed using boundary tables. Therefore, it is necessary to extract the coordinates of the text and at least the horizontal borders of the table.
To extract text from a table, I used PDFTextStripperto get a list of objects TextPosition; to extract horizontal lines from the same page, I used PDFGraphicsStreamEngineto extract a list of objects with smooth ones GeneralPath, and inside the stocked object GeneralPaththere is a corresponding object Rectangle2Drepresenting the line (height = 0). But it seems that the aforementioned object coordinates TextPositionand the object coordinates are GeneralPathnot in the same quadrant, but with different rays of the Y axis starting from the same origin.
According to my research, the source of the object TextPositionis the upper left corner, while the beginning Rectangle2Dis the lower left corner, and the direction of each of the Y-axis is different from each other.
Firstly, I would like to confirm that my investigation is correct. If so, I'd like to get some hints on how to make the coordinates Rectangle2Dand TextPositionin the same quadrant.
Thanks in advance
source
share