Python TypeError in regex
So, I have this code:
url = 'http://google.com' linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read() links = linkregex.findall(msg) But then python returns this error:
links = linkregex.findall(msg) TypeError: can't use a string pattern on a bytes-like object What have I done wrong?
TypeError: can't use a string patternon a bytes-like objectWhat did I do wrong?
You used a string pattern in a bytes object. Use a byte pattern instead:
linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>') ^ Add the b there, it makes it into a bytes object (ps:
>>> from disclaimer include dont_use_regexp_on_html "Use BeautifulSoup or lxml instead." )
If you are using Python 2.6, there is no "request" in "urllib". So the third line becomes:
m = urllib.urlopen(url) And in version 3 you should use this:
links = linkregex.findall(str(msg)) Because "msg" is a byte object, not a string, as findall () expects. Or you can decode using the correct encoding. For example, if "latin1" is an encoding, then:
links = linkregex.findall(msg.decode("latin1")) Well, my version of Python does not have urllib with the request attribute, but if I use "urllib.urlopen (url)", I do not return a string, I get an object. This is a type error.
The URL that you didn’t work for me for Google, so I replaced http://www.google.com/ig?hl=en with it, which works for me.
Try the following:
import re import urllib.request url="http://www.google.com/ig?hl=en" linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read(): links = linkregex.findall(str(msg)) print(links) Hope this helps.
The regex pattern and string must be of the same type. If you match a regular string, you need a string pattern. If you are matching a byte string, you need a byte pattern.
In this case, m.read () returns a string of bytes, so you need a byte pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify the string literal of the string:
linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>') This worked for me in python3. Hope this helps
import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, str(htmltext)) print(titles) i+=1 And also this in which I added b before regex to convert it to an array of bytes.
import urllib.request import re urls = ["https://google.com","https://nytimes.com","http://CNN.com"] i = 0 regex = b'<title>(.+?)</title>' pattern = re.compile(regex) while i < len(urls) : htmlfile = urllib.request.urlopen(urls[i]) htmltext = htmlfile.read() titles = re.search(pattern, htmltext) print(titles) i+=1