Extract columns of text from a pdf file using iText

Question

Extract columns of text from a pdf file using iText

I need to extract text from pdf files using iText.

The problem is that some PDF files contain 2 columns, and when I extract the text, I get a text file in which the columns are combined as a result (i.e. the text from both columns on the same line).

this is the code:

public class pdf
{
    private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ;
    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException, IOException {
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE));
        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);
        int n = reader.getNumberOfPages();

        PdfImportedPage page;

        // Go through all pages
        for (int i = 1; i <= n; i++) {
            page = writer.getImportedPage(reader, i);
            Image instance = Image.getInstance(page);
            document.add(instance);
        }

        document.close();

        PdfReader readerN = new PdfReader(OUTPUTFILE);
        for (int i = 1; i <= n; i++) {
            String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
            System.out.println(myLine);

            try {             
                FileWriter fw = new FileWriter("c:/yo.txt",true);
                fw.write(myLine);
                fw.close();
            }catch (IOException ioe) {ioe.printStackTrace(); }
    }
}

Could you help me with this task?

+3

java pdf itext text-extraction

Rim Oct 26 '10 at 21:37

source share

6 answers

PdfBox, PDF - .

+1

mark stephens 27 . '10 7:28

, . PDF. , , .

    /**
 * Get plain text from a specific page in a pdf file.
 * @param pdfPath
 * @return
 * @throws IOException
 */
public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter();  

        try {
            output.append(PdfTextExtractor.getTextFromPage(reader, pageNumber, new SimpleTextExtractionStrategy()));

        } catch (OutOfMemoryError e) {

            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    return output.toString();
}

, 1 , . , , ( ). , . . , , . .

public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{

    PDDocument pdDoc = PDDocument.load(pdfPath);
    PDPage specPage = (PDPage)pdDoc.getDocumentCatalog().getAllPages().get( 0 );

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
float width = (specPage.getMediaBox().getHeight())*25.4f;
float height = (specPage.getMediaBox().getWidth())*25.4f;
Rectangle rect = new Rectangle( 0, 0, Math.round(width), Math.round(height));
stripper.addRegion( "class1", rect );
List allPages = pdDoc.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( pageNumber-1 );
stripper.extractRegions( firstPage );

return stripper.getTextForRegion( "class1" );

}

+1

PhDeveloper 13 . '14 17:19

PDFTextStream - ! , . iText . .

api , . . . ( iText).

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class PDFText {
    public static void main(String[] args) throws java.io.IOException {
        String pdfFilePath = "xyz.pdf";

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
   }
}

stackoverflow!

+1

Darpan27 06 . '16 16:39

, , . , , . .

? OCR , .

0

Andrew Cash 27 . '10 5:30

Tables do not exist as structures in a PDF unless the file uses structured content. Do you understand what a PDF file is? I wrote a blog article explaining text extraction issues at http://www.jpedal.org/PDFblog/?p=228

0

mark stephens Oct 27 '10 at 7:27

source share

Kevin Day · Accepted Answer · 2010-10-27T07:04:25+0000

I am the author of the iText text extraction subsystem. What you need to do is develop your own text extraction strategy (if you look at how it is implemented PdfTextExtractor.getTextFromPage, you will see that you can provide a compatible strategy).

, , - . PDF ( , - , , ). , , text render listener ( iText, iText In Action ).

, - ( - , ). , :

, (LocationAware...), X/Y ( ).
, . , X.
, X ( X). / .
, X Y, .

, , ( , ). iText , .

Extract columns of text from a pdf file using iText

More articles: