') m = urllib.req...">

Python TypeError in regex

So, I have this code:

url = 'http://google.com' linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read() links = linkregex.findall(msg) 

But then python returns this error:

 links = linkregex.findall(msg) TypeError: can't use a string pattern on a bytes-like object 

What have I done wrong?

+51
python regex typeerror
Mar 03 '11 at 17:50
source share
6 answers

TypeError: can't use a string pattern on a bytes-like object

What did I do wrong?

You used a string pattern in a bytes object. Use a byte pattern instead:

 linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>') ^ Add the b there, it makes it into a bytes object 

(ps:

  >>> from disclaimer include dont_use_regexp_on_html "Use BeautifulSoup or lxml instead." 

)

+70
Mar 03 '11 at 19:23
source share

If you are using Python 2.6, there is no "request" in "urllib". So the third line becomes:

 m = urllib.urlopen(url) 

And in version 3 you should use this:

 links = linkregex.findall(str(msg)) 

Because "msg" is a byte object, not a string, as findall () expects. Or you can decode using the correct encoding. For example, if "latin1" is an encoding, then:

 links = linkregex.findall(msg.decode("latin1")) 
+3
Mar 03 '11 at 17:55
source share

Well, my version of Python does not have urllib with the request attribute, but if I use "urllib.urlopen (url)", I do not return a string, I get an object. This is a type error.

+1
Mar 03 2018-11-11T00:
source share

The URL that you didn’t work for me for Google, so I replaced http://www.google.com/ig?hl=en with it, which works for me.

Try the following:

 import re import urllib.request url="http://www.google.com/ig?hl=en" linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read(): links = linkregex.findall(str(msg)) print(links) 

Hope this helps.

+1
03 Mar. '11 at 18:04
source share

The regex pattern and string must be of the same type. If you match a regular string, you need a string pattern. If you are matching a byte string, you need a byte pattern.

In this case, m.read () returns a string of bytes, so you need a byte pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify the string literal of the string:

 linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>') 
+1
May 7 '13 at 14:54
source share

This worked for me in python3. Hope this helps

 import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, str(htmltext)) print(titles) i+=1 

And also this in which I added b before regex to convert it to an array of bytes.

 import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = b'<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, htmltext) print(titles) i+=1 
0
Jul 16 '16 at 18:15
source share



All Articles