Saving file offsets when parsing HTML using the DOM?

Question

Saving file offsets when parsing HTML using the DOM?

I want to change attributes <img src="">in not too distorted HTML (WordPress posts). I know that I can take a simple way and use regular expressions, but I'm afraid that people in blue fluffy suits will come to me in a dream .

If I use the DOM parser to read the HTML code and modify the tags <img>, I’m afraid that I can’t restore the post exactly as it was (only with my modification), because the DOM parser will probably do too much cleaning and maybe , delete important data. The SAX parser probably cannot process invalid XML, so this will not work either.

So, is there a middle way where I can use the DOM parser, but one that knows where each element was launched, so I can do a string replacement or something similar from there? I know that some nodes in the DOM tree will not exist in the source document (it <b>Some <i>bizarre</b> formatting</i>will probably call it), but does this mean that this is always impossible? I see that in PHP 5.3 there is a function DOMNode::getLineNo(), but I am using 5.2.x.

+3

dom php html-parsing

Jan fabry Nov 11 '10 at 14:15

source share

1 answer

Pekka 웃 · Answer 1 · 2010-11-11T14:24:41+0000

If the PHP DOM writes too clean results, you can try String -based SimpleHTMLDOM

, , , , , "". , , .

DOM DOMNode getLineNo(). , , , , . , .

Saving file offsets when parsing HTML using the DOM?

More articles: