Remove simple HTML tags from String in Oracle through RegExp. Explanation needed

I donโ€™t understand why my reg1 and reg2 columns remove โ€œbbbโ€ from my row and only reg3 works as expected.

WITH t AS (SELECT 'aaa <b>bbb</b> ccc' AS teststring FROM dual) SELECT teststring, regexp_replace(teststring, '<.+>') AS reg1, regexp_replace(teststring, '<.*>') AS reg2, regexp_replace(teststring, '<.*?>') AS reg3 FROM t TESTSTRING REG1 REG2 REG3 aaa <b>bbb</b> ccc aaa ccc aaa ccc aaa bbb ccc 

Thanks a lot!

+6
source share
2 answers

Because regex is greedy by default. That is, the expressions .* Or .+ Try to take as many characters as possible. Therefore, <.+> Will span from the first < to the last > . Make lazy using a lazy operator ? :

 regexp_replace(teststring, '<.+?>') 

or

 regexp_replace(teststring, '<.*?>') 

Now the search > will be stopped at the first encounter > .

Please note that . also includes > , so the greedy option (without ? ) swallows everything > , but the latter.

+7
source

Since the first and second are in this coincidence: <b>bbb</b> - in this case b>bbb</b matches both .* And .+

The third one will not do what you need either. You are looking for something like this: <[^>]*> . But you also need to replace all matches with "

0
source

Source: https://habr.com/ru/post/988850/


All Articles