UnicodeDecodeError: codec 'utf8' cannot decode byte 0xc3 at position 34: unexpected end of data

I am trying to write a scraper, but I have problems with the encoding. When I tried to copy the string that I was looking into my text file, python2.7 told me that it did not recognize the encoding, despite the absence of special characters. I do not know useful information.

My code is as follows:

 from urllib import FancyURLopener import os class MyOpener(FancyURLopener): #spoofs a real browser on Window version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' print "What is the webaddress?" webaddress = raw_input("8::>") print "Folder Name?" foldername = raw_input("8::>") if not os.path.exists(foldername): os.makedirs(foldername) def urlpuller(start, page): while page[start]!= '"': start += 1 close = start while page[close]!='"': close += 1 return page[start:close] myopener = MyOpener() response = myopener.open(webaddress) site = response.read() nexturl = '' counter = 0 while(nexturl!=webaddress): counter += 1 start = 0 for i in range(len(site)-35): if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"': start = i + 40 break else: print "Something broken, chief. Error = 1" next = 0 for i in range(start, 8, -1): if site[i:i+8] == u'<a href=': next = i break else: print "Something broken, chief. Error = 2" nexturl = urlpuller(next, site) myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg') print("Retrieval of "+foldername+" completed.") 

When I try to start it using the site I use, it returns an error:

 Traceback (most recent call last): File "yada/yadayada/Python/scraper.py", line 37, in <module> if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"': File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data 

When http://google.com is listed, it worked fine.

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 

but when I try to decode using utf-8, as you can see, it does not work.

Any suggestions?

+6
source share
2 answers
 site[i:i+35].decode('utf-8') 

You cannot arbitrarily break received bytes, and then ask UTF-8 to decode it. UTF-8 is multi-byte encoding, meaning you can have 1 to 6 bytes to represent one character. If you chop it in half and ask Python to decode it, it will give you an unexpected end of data error.

Look at the tool that is created for you. BeautifulSoup or lxml are two alternatives.

+8
source

Instead of your for loop, do something like:

 start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40 
0
source

Source: https://habr.com/ru/post/970211/


All Articles