Extract columns of text from a pdf file using iText

I need to extract text from pdf files using iText.

The problem is that some PDF files contain 2 columns, and when I extract the text, I get a text file in which the columns are combined as a result (i.e. the text from both columns on the same line).

this is the code:

public class pdf
{
    private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ;
    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException, IOException {
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE));
        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);
        int n = reader.getNumberOfPages();

        PdfImportedPage page;

        // Go through all pages
        for (int i = 1; i <= n; i++) {
            page = writer.getImportedPage(reader, i);
            Image instance = Image.getInstance(page);
            document.add(instance);
        }

        document.close();

        PdfReader readerN = new PdfReader(OUTPUTFILE);
        for (int i = 1; i <= n; i++) {
            String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
            System.out.println(myLine);

            try {             
                FileWriter fw = new FileWriter("c:/yo.txt",true);
                fw.write(myLine);
                fw.close();
            }catch (IOException ioe) {ioe.printStackTrace(); }
    }
}

Could you help me with this task?

+3
source share
6 answers

I am the author of the iText text extraction subsystem. What you need to do is develop your own text extraction strategy (if you look at how it is implemented PdfTextExtractor.getTextFromPage, you will see that you can provide a compatible strategy).

, , - . PDF ( , - , , ). , , text render listener ( iText, iText In Action ).

, - ( - , ). , :

  • , (LocationAware...), X/Y ( ).
  • , . , X.
  • , X ( X). / .
  • , X Y, .

, , ( , ). iText , .

+23

PdfBox, PDF - .

+1

, . PDF. , , .

    /**
 * Get plain text from a specific page in a pdf file.
 * @param pdfPath
 * @return
 * @throws IOException
 */
public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter();  

        try {
            output.append(PdfTextExtractor.getTextFromPage(reader, pageNumber, new SimpleTextExtractionStrategy()));

        } catch (OutOfMemoryError e) {

            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    return output.toString();
}

, 1 , . , , ( ). , . . , , . .

public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{

    PDDocument pdDoc = PDDocument.load(pdfPath);
    PDPage specPage = (PDPage)pdDoc.getDocumentCatalog().getAllPages().get( 0 );

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
float width = (specPage.getMediaBox().getHeight())*25.4f;
float height = (specPage.getMediaBox().getWidth())*25.4f;
Rectangle rect = new Rectangle( 0, 0, Math.round(width), Math.round(height));
stripper.addRegion( "class1", rect );
List allPages = pdDoc.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( pageNumber-1 );
stripper.extractRegions( firstPage );

return stripper.getTextForRegion( "class1" );

}

+1

PDFTextStream - ! , . iText . .

api , . . . ( iText).

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class PDFText {
    public static void main(String[] args) throws java.io.IOException {
        String pdfFilePath = "xyz.pdf";

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
   }
}

stackoverflow!

+1

, , . , , . .

? OCR , .

0

Tables do not exist as structures in a PDF unless the file uses structured content. Do you understand what a PDF file is? I wrote a blog article explaining text extraction issues at http://www.jpedal.org/PDFblog/?p=228

0
source

Source: https://habr.com/ru/post/1771583/


All Articles