I developed an HTML parser and a PHP filter that can be used for this purpose.
It consists of a set of classes that can be combined together to perform a series of parsing, filtering, and conversion operations in HTML / XML code.
It was designed to work with real-world pages, so it can deal with distorted tag and data structures, so it can save the original document as much as possible.
One of the filter classes that it comes with can perform a DTD check. Another may drop insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all the links to the documents.
All of these filter classes are optional. You can bind them together the way you want, if you need something.
So, to solve your problem, I donβt think that PHP already has any specific solution, but a special class of filters can be developed for it. Take a look at the package. It is fully documented.
If you need help, just check my profile and write to me, and I can even develop a filter that does exactly what you need, eventually inspired by any solutions that exist for other languages.
source share