PHP analysis of xml file

Question

PHP analysis of xml file

I am trying to use simpleXML to retrieve data from http://rates.fxcm.com/RatesXML Using simplexml_load_file() , I had errors because this site always had strange lines / numbers before and after the xml file. Example:

 2000<?xml version="1.0" encoding="UTF-8"?> <Rates> <Rate Symbol="EURUSD"> <Bid>1.27595</Bid> <Ask>1.2762</Ask> <High>1.27748</High> <Low>1.27385</Low> <Direction>-1</Direction> <Last>23:29:11</Last> </Rate> </Rates> 0

Then I decided to use file_get_contents and simplexml_load_string() it as a string with simplexml_load_string() , after which I use substr() to delete the lines before and after. However, sometimes random lines appear between these nodes:

 <Rate Symbol="EURTRY"> <Bid>2.29443</Bid> <Ask>2.29562</Ask> <High>2.29841</High> <Low>2.28999</Low> 137b <Direction>1</Direction> <Last>23:29:11</Last> </Rate>

My question is, is there anyway I can handle all these random strings when working with any regular expressions no matter where they are placed? (think it would be a better idea, not contact the site to get them to translate the corresponding XML files)

+4

php regex parsing preg-replace simplexml

Michael lam Nov 19 '12 at 4:44

source share

1 answer

Martin ender · Answer 1 · 2012-11-19T08:46:36+0000

I believe that preprocessing XML with regular expressions can be as bad as parsing .

But here is the preg replacement, which removes all characters without spaces, starting from the beginning of the line, from the end of the line and after closing / self-closing tags:

 $string = preg_replace( '~ (?| # start of alternation where capturing group count starts from # 1 for each alternative ^[^<]* # match non-< characters at the beginning of the string | # OR [^>]*$ # match non-> characters at the end of the string | # OR ( # start of capturing group $1: closing tag </[^>]++> # match a closing tag; note the possessive quantifier (++); it # suppresses backtracking, which is a convenient optimization, # the following bit is mutually exclusive anyway (this will be # used throughout the regex) \s++ # and the following whitespace ) # end of $1 [^<\s]*+ # match non-<, non-whitespace characters (the "bad" ones) (?: # start subgroup to repeat for more whitespace/non-whitespace # sequences \s++ # match whitespace [^<\s]++ # match at least one "bad" character )* # repeat # note that this will kind of pattern keeps all whitespace # before the first and the last "bad" character | # OR ( # start of capturing group $1: self-closing tag <[^>/]+/> # match a self-closing tag \s++ # and the following whitespace ) [^<]*+(?:\s++[^<\s]++)* # same as before ) # end of alternation ~x', '$1', $input);

And then we just write a closing or self-closing tag, if any.

One reason this approach is unsafe is because closing or self-closing tags can occur inside comments or attribute lines. But I can hardly suggest you use an XML parser, since your XML parser cannot parse XML as well.

PHP analysis of xml file

More articles: