As Theodore said you can extract text from pdf and as Chris said
while this is actually text (not outlines or bitmaps)
The best thing to do is to buy the book of Bruno Logagi Itext in action. In the second edition, chapter 15 covers text extraction.
But you can see his site with examples. http://itextpdf.com/examples/iia.php?id=279
And you can parse it to create a simple text file. Here is a sample code:
package part4.chapter15; import java.io.FileOutputStream; import java.io.IOException; import java.io.PrintWriter; import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfReaderContentParser; import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy; import com.itextpdf.text.pdf.parser.TextExtractionStrategy; public class ExtractPageContent { public static final String PREFACE = "resources/pdfs/preface.pdf"; public static final String RESULT = "results/part4/chapter15/preface.txt"; public void parsePdf(String pdf, String txt) throws IOException { PdfReader reader = new PdfReader(pdf); PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(txt)); TextExtractionStrategy strategy; for (int i = 1; i <= reader.getNumberOfPages(); i++) { strategy = parser.processContent(i, new SimpleTextExtractionStrategy()); out.println(strategy.getResultantText()); } reader.close(); out.flush(); out.close(); } public static void main(String[] args) throws IOException { new ExtractPageContent().parsePdf(PREFACE, RESULT); } }
Pay attention to the license
This example only works with the AGPL version of the AGPL version.
If you look at other examples, this will show how to leave parts of the text or how to extract parts of the PDF.
Hope this helps.
source share