A regular expression to match each new character of the string (\ n) inside the <content> tag

Question

A regular expression to match each new character of the string (\ n) inside the <content> tag

I am looking for a regular expression to match each new character of the string ( \n ) inside the XML <content> or inside any tag that is inside the <content> , for example:

 <blog> <text> (Do NOT match new lines here) </text> <content> (DO match new lines here) <p> (Do match new lines here) </p> </content> (Do NOT match new lines here) <content> (DO match new lines here) </content>

+45

regex

Moayad Mardini Jul 13 '09 at 5:19

source share

2 answers

 <content>(?:[^\n]*(\n+))+</content>

+4

Ross Light Jul 13 '09 at 5:22

source share

Tom · Accepted Answer · 2009-07-13 05:40

Actually ... you cannot use a simple regular expression here, at least not one. You probably need to worry about the comments! Someone might write:

 <!-- <content> blah </content> -->

Here you can use two approaches:

First write down all the comments. Then use a regex approach.
Do not use regular expressions and use a context-sensitive parsing approach that can track whether you are embedded in a comment.

Be careful.

I'm also not sure if you can match all newlines immediately. @Quartz suggested the following:

 <content>([^\n]*\n+)+</content>

This will match any content tags that have a newline. RIGHT BEFORE the closing tag ... but I'm not sure what you mean by matching all newlines. Do you want to have access to all matching newlines? If so, it’s best to grab all the content tags and then look for all the newlines that are nested between them. Something else like this:

 <content>.*</content>

BUT THERE IS ONE PRINT: regular expressions are greedy, so this regular expression will match the first opening tag until the last close. Instead, you need to suppress the regular expression so that it is not greedy. In languages like python, can you do this with ?? regex.

I hope that with this you can see some pitfalls and figure out how you want to continue. You are probably better off using the XML parsing library and then repeating all the content tags.

I know that I may not offer a better solution, but at least I hope that you will see difficulties in this and why other answers may be wrong ...

UPDATE 1:

Let me summarize a little more and add some details to my answer. I will use the regex python syntax because this is what I'm more used to (forgive me ahead of time ... you may need to avoid some characters ... comment on my post and I will fix it):

To disable comments, use this regex: Notice the "?" suppresses. * to make it inanimate.

Similarly, to search for content tags, use:. *?

Alternatively, you can try this and access each newline using groups of matching objects ():

 <content>(.*?(\n))+.*?</content>

I know that my escape is disabled, but it captures the idea. This last example probably won't work, but I think this is the best way to express what you want. My suggestion remains: either capture all the content tags, or do it yourself, or use a parsing library.

UPDATE 2:

So here is the python code that should work. I'm still not sure what you mean by "finding" all new lines. Do you want all the lines? Or just count how many lines. To get the actual lines try:

 #!/usr/bin/python import re def FindContentNewlines(xml_text): # May want to compile these regexes elsewhere, but I do it here for brevity comments = re.compile(r"<!--.*?-->", re.DOTALL) content = re.compile(r"<content>(.*?)</content>", re.DOTALL) newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL) # strip comments: this actually may not be reliable for "nested comments" # How does xml handle <!-- <!-- --> -->. I am not sure. But that COULD # be trouble. xml_text = re.sub(comments, "", xml_text) result = [] all_contents = re.findall(content, xml_text) for c in all_contents: result.extend(re.findall(newlines, c)) return result if __name__ == "__main__": example = """ <!-- This stuff ought to be omitted <content> omitted </content> --> This stuff is good <content> <p> haha! </p> </content> This is not found """ print FindContentNewlines(example)

This program prints the result:

  ['', '<p>', ' haha!', '</p>', '']

The first and last empty lines are taken from the newlines immediately preceding the first <p> , and the next immediately after </p> . In general, this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print the material in the middle so you can see that regular expressions match and don't match.

Hope this helps :-).

PS - I didn’t really manage to check my regular expression from my first update to capture all new lines ... let me know if you do.

A regular expression to match each new character of the string (\ n) inside the <content> tag

More articles: