apples are red... john is a boy..

Support for parsing strings

I have a line like the following:

$string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> "; 

I would like to split this string into an array in contact with the text found between the <paragraph></paragraph> tags. For example, something like this:

 $string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> "; $paragraphs = splitParagraphs($string); /* $paragraphs now contains: $paragraphs[0] = apples are red... $paragraphs[1] = john is a boy... $paragraphs[1] = this is dummy text... */ 

Any ideas?

PS, it must be case insensitive, <paragraph>, <PARAGRAPH>, <Paragraph> should be treated the same way.

Edit: This is not XML, there are many things here that will ruin the XML structure, so I cannot use SimpleXML, etc. I need a regex that parses this.

+4
source share
7 answers

If this is actually XML, I agree with the other answers. But if it is invalid XML, but just something similar to XML, then you should not try to parse it with the XML parser. Instead, you can use a regex:

 $matches = array(); preg_match_all(":<paragraph>(.*?)</paragraph>:is", $string, $matches); $result = $matches[1]; print_r($result); 

Conclusion:

 Array ( [0] => apples are red... [1] => john is a boy.. [2] => this is dummy text...... ) 

Note that i means case insensitive, and s allows you to match newlines in the text. All text that does not contain paragraph tags will be ignored.

+5
source

If it is a simple structure, without nesting:

 preg_split("#</?paragraph>#i", $string); 

To ignore empty tokens:

 preg_split("#</?paragraph>#i", $string, -1, PREG_SPLIT_NO_EMPTY); 

Source: http://php.net/manual/en/function.preg-split.php

+2
source

If you really parse XML, then here is the PHP DOM . You may have a trivial example above, but if you are parsing XML, I would use a dedicated XML API.

0
source

It looks furiously like XML. If this is true, you should use SimpleXMLElement or any other XML-parcing PHP tool.

 $xml = new SimpleXMLElement('<root>' . $paragraphs . '</root>'); foreach($xml->paragraph as $paragraph) { // do stuff to $paragraph; it strval is the contents of the paragraph } 
0
source

Well, you should use an XML parser like SimpleXML or XMLReader .

However, if you want to hack something, the following will work:

 $string = str_replace("<paragraph>", "", $string); $string = str_replace("</paragraph>", "", $string); $paragraphs = explode("\n", $string); 

This will work as long as you have one item per line. If you have everything on one line, replace the second line of code above:

 $string = str_replace("</paragraph>", "\n", $string); 

Good luck

0
source

So, assuming you have some things in the paragraphs that will violate the XML format, or you just want to learn a little more about parsing regular expressions, this should do the job for the example you are using posted. This is not particularly cool, but why people like to use XML because it has received formal syntax that simplifies the analysis. or easier, anyway. In particular, this decision depends on the line that is analyzed starting with the paragraph tag and ending with the paragraph closing tag, as well as where there is nothing but spaces between each pair of paragraphs. This is a very literal solution to your problem with an example. But since this is the only existing specification document for your custom data format, this was the best I could do :)

 $string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> "; $paragraphs = preg_replace('/(^\s*<paragraph>|<\/paragraph>\s*$)/', '', preg_split('/(?<=<\/paragraph>)\s*(?=<paragraph>)/', $string)); 

What happens here is that you use zero-width search images in the preg_split function call to search for the beginning and end of each paragraph, and then call preg_replace to cut tags from the beginning to the end of each piece. You will end up with the contents of $paragraphs

 array ( 0 => 'apples are red...', 1 => 'john is a boy..', 2 => 'this is dummy text......', ) 
0
source

After making the changes (case insensitivity and tags that are too large to process the XML parser), the following should work:

 $paragraphs = array(); $exploded = explode("</", $string); unset($exploded[count($exploded) - 1]); //remove the useless, final "paragraph>" item $exploded[0] = str_replace("<paragraph>", "", $exploded[0]); // first item is a special case foreach($exploded as $item) { array_push($paragraphs, str_replace("paragraph>\n<paragraph>", "", $item)); } 
0
source

Source: https://habr.com/ru/post/1305153/


All Articles