Support for parsing strings

Question

Support for parsing strings

I have a line like the following:

$string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> ";

I would like to split this string into an array in contact with the text found between the <paragraph></paragraph> tags. For example, something like this:

 $string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> "; $paragraphs = splitParagraphs($string); /* $paragraphs now contains: $paragraphs[0] = apples are red... $paragraphs[1] = john is a boy... $paragraphs[1] = this is dummy text... */

Any ideas?

PS, it must be case insensitive, <paragraph>, <PARAGRAPH>, <Paragraph> should be treated the same way.

Edit: This is not XML, there are many things here that will ruin the XML structure, so I cannot use SimpleXML, etc. I need a regex that parses this.

+4

string php regex

Click upvote Mar 25 '10 at 21:07

source share

7 answers

If it is a simple structure, without nesting:

 preg_split("#</?paragraph>#i", $string);

To ignore empty tokens:

 preg_split("#</?paragraph>#i", $string, -1, PREG_SPLIT_NO_EMPTY);

Source: http://php.net/manual/en/function.preg-split.php

+2

Kobi Mar 25 '10 at 21:12

source share

If you really parse XML, then here is the PHP DOM . You may have a trivial example above, but if you are parsing XML, I would use a dedicated XML API.

0

Brian agnew Mar 25 '10 at 21:11

source share

It looks furiously like XML. If this is true, you should use SimpleXMLElement or any other XML-parcing PHP tool.

 $xml = new SimpleXMLElement('<root>' . $paragraphs . '</root>'); foreach($xml->paragraph as $paragraph) { // do stuff to $paragraph; it strval is the contents of the paragraph }

0

zneak Mar 25 '10 at 21:11

source share

Well, you should use an XML parser like SimpleXML or XMLReader .

However, if you want to hack something, the following will work:

 $string = str_replace("<paragraph>", "", $string); $string = str_replace("</paragraph>", "", $string); $paragraphs = explode("\n", $string);

This will work as long as you have one item per line. If you have everything on one line, replace the second line of code above:

 $string = str_replace("</paragraph>", "\n", $string);

Good luck

0

Mike cialowicz Mar 25 '10 at 21:13

source share

So, assuming you have some things in the paragraphs that will violate the XML format, or you just want to learn a little more about parsing regular expressions, this should do the job for the example you are using posted. This is not particularly cool, but why people like to use XML because it has received formal syntax that simplifies the analysis. or easier, anyway. In particular, this decision depends on the line that is analyzed starting with the paragraph tag and ending with the paragraph closing tag, as well as where there is nothing but spaces between each pair of paragraphs. This is a very literal solution to your problem with an example. But since this is the only existing specification document for your custom data format, this was the best I could do :)

 $string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> "; $paragraphs = preg_replace('/(^\s*<paragraph>|<\/paragraph>\s*$)/', '', preg_split('/(?<=<\/paragraph>)\s*(?=<paragraph>)/', $string));

What happens here is that you use zero-width search images in the preg_split function call to search for the beginning and end of each paragraph, and then call preg_replace to cut tags from the beginning to the end of each piece. You will end up with the contents of $paragraphs

 array ( 0 => 'apples are red...', 1 => 'john is a boy..', 2 => 'this is dummy text......', )

0

intuited Mar 25 '10 at 21:30

source share

After making the changes (case insensitivity and tags that are too large to process the XML parser), the following should work:

 $paragraphs = array(); $exploded = explode("</", $string); unset($exploded[count($exploded) - 1]); //remove the useless, final "paragraph>" item $exploded[0] = str_replace("<paragraph>", "", $exploded[0]); // first item is a special case foreach($exploded as $item) { array_push($paragraphs, str_replace("paragraph>\n<paragraph>", "", $item)); }

0

Mike cialowicz Mar 25 '10 at 21:32

source share

Mark byers · Accepted Answer · 2010-03-25T21:15:25+0000

If this is actually XML, I agree with the other answers. But if it is invalid XML, but just something similar to XML, then you should not try to parse it with the XML parser. Instead, you can use a regex:

 $matches = array(); preg_match_all(":<paragraph>(.*?)</paragraph>:is", $string, $matches); $result = $matches[1]; print_r($result);

Conclusion:

 Array ( [0] => apples are red... [1] => john is a boy.. [2] => this is dummy text...... )

Note that i means case insensitive, and s allows you to match newlines in the text. All text that does not contain paragraph tags will be ignored.

Support for parsing strings

More articles: