Python TypeError in regex

Question

Python TypeError in regex

So, I have this code:

url = 'http://google.com' linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read() links = linkregex.findall(msg)

But then python returns this error:

 links = linkregex.findall(msg) TypeError: can't use a string pattern on a bytes-like object

What have I done wrong?

+51

python python-3.x regex typeerror

kamikaze_pilot Mar 03 '11 at 17:50

source share

6 answers

If you are using Python 2.6, there is no "request" in "urllib". So the third line becomes:

 m = urllib.urlopen(url)

And in version 3 you should use this:

 links = linkregex.findall(str(msg))

Because "msg" is a byte object, not a string, as findall () expects. Or you can decode using the correct encoding. For example, if "latin1" is an encoding, then:

 links = linkregex.findall(msg.decode("latin1"))

+3

Morten Kristensen Mar 03 '11 at 17:55

source share

Well, my version of Python does not have urllib with the request attribute, but if I use "urllib.urlopen (url)", I do not return a string, I get an object. This is a type error.

+1

Jeremy Whitlock Mar 03 2018-11-11T00:

source share

The URL that you didn’t work for me for Google, so I replaced http://www.google.com/ig?hl=en with it, which works for me.

Try the following:

 import re import urllib.request url="http://www.google.com/ig?hl=en" linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read(): links = linkregex.findall(str(msg)) print(links)

Hope this helps.

+1

John 03 Mar. '11 at 18:04

source share

The regex pattern and string must be of the same type. If you match a regular string, you need a string pattern. If you are matching a byte string, you need a byte pattern.

In this case, m.read () returns a string of bytes, so you need a byte pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify the string literal of the string:

 linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

+1

Seppo Enarvi May 7 '13 at 14:54

source share

This worked for me in python3. Hope this helps

 import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, str(htmltext)) print(titles) i+=1

And also this in which I added b before regex to convert it to an array of bytes.

 import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = b'<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, htmltext) print(titles) i+=1

0

user3022012 Jul 16 '16 at 18:15

source share

Lennart Regebro · Accepted Answer · 2011-03-03 19:23

TypeError: can't use a string pattern on a bytes-like object
What did I do wrong?

You used a string pattern in a bytes object. Use a byte pattern instead:

 linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>') ^ Add the b there, it makes it into a bytes object

(ps:

  >>> from disclaimer include dont_use_regexp_on_html "Use BeautifulSoup or lxml instead."

)

Python TypeError in regex

More articles: