Repeating numbered capture groups in Perl

Question

Repeating numbered capture groups in Perl

Imagine trying to parse the following html using a Perl regular expression:

<h4>test</h4> <p>num1</p> <p>num2</p> <p>num3</p> <h4>test</h4> <p>num1</p> <p>num2</p> <p>num3</p> <p>num4</p>

using the following regular expression:

 <h4>([\w\s]*)</h4>(?:<p>([\w\s]+)</p>)+

How will groups be numbered in Perl? $ 1 will obviously contain the text of the <h4> , but when the capture groups are repeated, will the captured <p> tags be sent to $ 2 $ 3 and $ 4? Is there a good way to capture all the <p> tags in an array? Does this even support perl? Or am I forced to write one regex for <h4> , then another for <p> ?

(I know that I could use HTML::Tree or something similar to html parsing, but this is just a simplified example that I use to help describe this question, I'm really only interested in how re-numbered group capture work in Perl)

+4

regex perl

user782161 May 28 '13 at 19:44

source share

1 answer

melwil · Accepted Answer · 2013-05-28T19:48:48+0000

When you repeat a capture group, only the last matching group will be saved in matches.

If you want to get each match from a repeating group, you can use replaceAll with a callback function or repeat each match one after another.

Most languages also have a “match all” that I don’t know how to do in perl. This usually saves all matches in an array for you, but duplicate groups are still only saved as the last matched group.

Repeating numbered capture groups in Perl

More articles: