How to convert PDF to HTML?

What good libraries exist in any common language for converting PDF to HTML?

+25
html pdf pdf-scraping
Oct 28 '09 at 17:52
source share
8 answers

PDFBox in apache has the ability to extract html. http://pdfbox.apache.org/

+5
Nov 23 '09 at 17:47
source share

If you are working in a windows window, I think Amyuni has a library for this. Their PDF Document Convertor is available as a DLL, can be widely used among the languages ​​supported by Visual Studio, and can convert to RTF, TML, EXCEL, JPEG and TIFF.

+3
Oct 29 '09 at 19:01
source share

http://www.lowagie.com/iText/ An open source library for Java and C #

+1
Oct 30 '09 at 4:26
source share

In Perl, you can use the SWISH :: Filter plugin SWISH :: Filters :: Pdf2HTML . (Requires xpdf package .)

For the opposite (HTML to PDF) see this question .

0
28 Oct '09 at 18:07
source share

If you are looking for a way to convert PDF to HTML once or twice, I recommend Adobe Online Conversion

If this is an API, after which http://www.pdfonline.com/ has an SDK that should suit your needs.

If this is the library that you will later, let us know which server language you prefer.

0
Oct 28 '09 at 18:22
source share

Given the uncertainty of the original question, I'm going to go and give a solution that will work in any language that can run applications from the command line. Although it can be a little tricky to configure, OpenOffice can be run headless on the server and, using jodconverter , can convert any file format to any other file format (well, any format conversions that openoffice can handle, that is).

Here are a few links that help with customization:

0
Oct 30 '09 at 2:04
source share

The pdftohtml program converts PDF to html and xml and saves text location information, which is useful for scraping tables.

It seems to be based on the xpdf library and has a Windows binary.

0
Oct 04 '10 at 7:56
source share

In linux install pdftohtml - To batch convert all files in a folder, use:

ls *.pdf | xargs -I{} pdftohtml {} 

This will create an html site with all the links and images from the original documents. Each page in a separate html file. It is very useful to convert project documentation to search for files by phrase using the general system file search.

0
Apr 10 '14 at 4:51
source share



All Articles