Python regex issue
I am trying to match some html with a regex, and the regex works fine if:
import re reg = r";!--\"\'<[a-i0-9]{8}>=&\{\(\)\}" html_data = "some html data" if re.search(reg, html_data): print("Match") But if it receives html data either from reading a local file, or from receiving it from the Internet, it fails. I downloaded the html page from the server and then copied the source into a script and it works fine. But reading directly from a file or server does not work.
I also checked the local hex editor file to make sure there is no special char that wraps me up.
Example string to match:
<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;"> Where ;!--\"\'<a41cgb04>=&{()} is what should be matched.
For me, your problem is related to your erroneous interpretation of this:
<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;"> You think the backslashes before ' and ' are in the source code. But I think that one of the two is actually a display artifact: it is not actually present in the HTML code.
I do not know how you will get the indicated sequence of characters.
But I think this phenomenon is the same as when using repr () :
the display shows backslashes that are used by the display so that you understand what is in the sequence of characters, but in fact all backslashes are not in the value of the displayed string
You will understand what I mean:
a = "abc ' def " b = ' ABC " DEF' print repr(a + b) result
'abc \' def ABC " DEF' .
Update
An example is the following web page:
http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/
.
Performing the "Source Code Display" on this page creates the display on which the 13th line is located
<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" /> Now by executing the following code
from urllib import urlopen url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/' sock = urlopen(url) srce = sock.read() sock.close() li = srce.splitlines(True) print 'Displayed normally:\n-------------------\n' print '\n'.join(li[12:14]) print print 'Displayed with the help of repr():\n----------------------\n' print '\n'.join(map(repr,li[12:14])) print print 'Displayed in a list:\n--------------------\n' print li[12:14] gives the result:
Displayed normally: ------------------- <meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" /> <meta name="allow-search" content="YES" /> Displayed with repr(): ---------------------- '<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in Bergenia" />\n' '<meta name="allow-search" content="YES" />\n' Displayed in a list: -------------------- ['<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in Bergenia" />\n', '<meta name="allow-search" content="YES" />\n'] The source code is usually displayed: a special character such as '\ n' , '\ r' , '\ t' is not visible, and writing a regular expression pattern is not easy. This is why parsing the HTML source is facilitated by displaying strings without interpretation.
So, displaying the source code using repr () or in a list explicitly displays all the characters.
The only inconvenience is that sometimes the characters ' in the middle of the string are escaped, because this is how these characters should be defined in the string when the string is indicated by quotation marks ' at the beginning and at the end. When a list is displayed, its elements are displayed on the screen using repr () , so the print li[12:14] command displays the elements in the same form as the print '\n'.join(map(repr,li[12:14])) . In fact, repr () displays a string that has a specific value, since that string will be defined to give it the specified value.
.
In the end, I want to emphasize that: if someone defines a regular expression pattern using "\\\\'" or r"\\'" , because he believes that there is \ in front of the character ' because displaying the source code with repr () , it makes the wrong template.
The codes that follow explain this better, I hope:
import re from urllib import urlopen url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/' sock = urlopen(url) srce = sock.read() sock.close() pat = '<meta name="abstract" content="(Heronswood Bergenia (\'Lunar Glow\')? [a-zA-Z]+\d+ .*?)" />' regx = re.compile(pat) print regx.search(srce).groups() pat = "<meta name=\"abstract\" content=\"(Heronswood Bergenia (\\\\'Lunar Glow\\\\')? [a-zA-Z]+\d+ .*?)\" />" regx = re.compile(pat) print regx.search(srce).groups() result
("Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia", "'Lunar Glow'") Traceback (most recent call last): File "I:\trez.py", line 18, in <module> print regx.search(srce).groups() AttributeError: 'NoneType' object has no attribute 'groups' Perhaps this http://docs.python.org/library/htmlparser.html will be more useful to you than trying to use a regex. I tend to agree with Mark Pilgrim that using regular expression gives you two problems: regular expression and the original problem.
I would change your regex as you are in the black damn chamber. This expression works with a file.
reg = ";!--....<[a-i0-9]{8}>=&\{\(\)\}" Destroying your expression in parts:
reg = ";!--" Matches reg = ";!--\\" throws an error regarding bogus end of line escape. Python doesn't like \ at the end of a string, escaped or otherwise.
As the saying goes:
The developer has a problem and thinks, "I will solve it with regular expressions."
Now the developer has two problems.