Regular expression to remove doctype

I am looking for a regex to remove the following doctype declarations from a set of XML documents:

<!DOCTYPE refentry [ <!ENTITY % mathent SYSTEM "math.ent"> %mathent; ]>

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook MathML Module V1.1b1//EN"
          "http://www.oasis-open.org/docbook/xml/mathml/1.1CR1/dbmathml.dtd">

This is a very common question about stackoverflow elsewhere, but none of the answers are actually capable of handling both cases.

My naive approach <!DOCTYPE((.|\n|\r)*?)(\"|])>will correctly correspond to the second case, but the failure of the first (it stops at the first ">and leaves it %mathen; ]>unsurpassed.) If I try to make regex more greedy, it tries to use the whole document.

Full test cases:

+4
source share
1 answer

EDIT: , TheFiddler

, - ( );

<!DOCTYPE[^>[]*(\[[^]]*\])?>

a <! > [, , [], >.

JSfiddle .

,

<!DOCTYPE     -- matches the string <!DOCTYPE
[^>[]*        -- matches anything up to a > or [
(\[[^]]*\])?  -- matches an optional section surrounded by []
>             -- matches the string >
+4

Source: https://habr.com/ru/post/1534054/


All Articles