How to convert PDF to html in a pure format?

Is there a website or piece of software that can purely convert a PDF document into an HTML document without a lot of jabberish HTML?

+4
source share
2 answers

The problem is that PDF is a layout language, not a semantic language, but rather for HTML.

This means that when converting to HTML with any hope that you will remain readable for the end user, you must force HTML to make a layout by positioning individual words (and sometimes letters), and the semantic structure is often distorted or lost - hence gibberish.

You can feel the problem by opening almost any PDF file representing a text document and trying (by eye) to find words or paragraphs in the text.

Compare this with an HTML document that is often read directly from the source.

+3
source

HTML jibberish is usually called by the PDF file itself, and not by the software used to convert it. You can use any number of packages to convert PDF to HTML. Some options include PDF Miner , PDFTOHTML , and I believe PDFTK . Regardless of whether you get any jabberish HTML, it is not so clearly defined.

0
source

Source: https://habr.com/ru/post/1433460/


All Articles