I maintain an HTML formatted article database. Unfortunately, the editors who wrote the articles did not know the proper HTML, so they often write things like:
<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>
I tried using HTML::TreeBuilder
to parse this HTML code, but after parsing it and dropping the resulting tree, all the elements between <div class="highlight">...</div>
disappeared. All I have left is <div class="highlight"></div>
.
Editors often also did things like:
<div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div>
HTML::TreeBuilder
with HTML::TreeBuilder
will again result in an empty <div class="article"></div>
.
Any ideas on how to approach this broken HTML and is it really possible?
source share