Regular expression to match closing HTML tags

I am working on a small Python script to clear HTML documents. It works by accepting a tag list for KEEP, and then parsing HTML tags that are not in the list that I used for regular expressions, and I was able to match the opening tags and the self-closing tags but not closing the tags. The pattern I experimented to match the closing tags is - </(?!a)>. It seems logical to me, so why doesn't it work? (?!a)should coincide with everything that is NOT an anchor tag (not that "a" can be anything - it's just an example).

Edit: AGG! I think the regex did not show!

+3
source share
3 answers
<TAG\b[^>]*>(.*?)</TAG> 

Matches the opening and closing pairs of a specific HTML tag.

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

Will match the opening and closing pairs of any HTML tag.

See here .

+5
source

HTML. .

Use an XML parser instead. Try BeautifulSoup or lxml .

+3
source

Source: https://habr.com/ru/post/1760602/


All Articles