How to convert PDF to HTML?

Question

How to convert PDF to HTML?

What good libraries exist in any common language for converting PDF to HTML?

+25

html pdf pdf-scraping

user178644 Oct 28 '09 at 17:52

source share

8 answers

John Thorhauer · Answer 1 · 2009-11-23 17:47

PDFBox in apache has the ability to extract html. http://pdfbox.apache.org/

William Daniel · Answer 2 · 2009-10-29 19:01

If you are working in a windows window, I think Amyuni has a library for this. Their PDF Document Convertor is available as a DLL, can be widely used among the languages supported by Visual Studio, and can convert to RTF, TML, EXCEL, JPEG and TIFF.

AZ_ · Answer 3 · 2009-10-30 04:26

http://www.lowagie.com/iText/ An open source library for Java and C #

Ether · Answer 4 · 2009-10-28 18:07

In Perl, you can use the SWISH :: Filter plugin SWISH :: Filters :: Pdf2HTML . (Requires xpdf package .)

For the opposite (HTML to PDF) see this question .

Russ Bradberry · Answer 5 · 2009-10-28 18:22

If you are looking for a way to convert PDF to HTML once or twice, I recommend Adobe Online Conversion

If this is an API, after which http://www.pdfonline.com/ has an SDK that should suit your needs.

If this is the library that you will later, let us know which server language you prefer.

Karim · Answer 6 · 2009-10-30 02:04

Given the uncertainty of the original question, I'm going to go and give a solution that will work in any language that can run applications from the command line. Although it can be a little tricky to configure, OpenOffice can be run headless on the server and, using jodconverter , can convert any file format to any other file format (well, any format conversions that openoffice can handle, that is).

Here are a few links that help with customization:

Karsten W. · Answer 7

The pdftohtml program converts PDF to html and xml and saves text location information, which is useful for scraping tables.

It seems to be based on the xpdf library and has a Windows binary.

Zon · Answer 8

In linux install pdftohtml - To batch convert all files in a folder, use:

ls *.pdf | xargs -I{} pdftohtml {}

This will create an html site with all the links and images from the original documents. Each page in a separate html file. It is very useful to convert project documentation to search for files by phrase using the general system file search.

How to convert PDF to HTML?

More articles: