UnicodeDecodeError: codec 'utf8' cannot decode byte 0xc3 at position 34: unexpected end of data

Question

UnicodeDecodeError: codec 'utf8' cannot decode byte 0xc3 at position 34: unexpected end of data

I am trying to write a scraper, but I have problems with the encoding. When I tried to copy the string that I was looking into my text file, python2.7 told me that it did not recognize the encoding, despite the absence of special characters. I do not know useful information.

My code is as follows:

 from urllib import FancyURLopener import os class MyOpener(FancyURLopener): #spoofs a real browser on Window version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' print "What is the webaddress?" webaddress = raw_input("8::>") print "Folder Name?" foldername = raw_input("8::>") if not os.path.exists(foldername): os.makedirs(foldername) def urlpuller(start, page): while page[start]!= '"': start += 1 close = start while page[close]!='"': close += 1 return page[start:close] myopener = MyOpener() response = myopener.open(webaddress) site = response.read() nexturl = '' counter = 0 while(nexturl!=webaddress): counter += 1 start = 0 for i in range(len(site)-35): if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"': start = i + 40 break else: print "Something broken, chief. Error = 1" next = 0 for i in range(start, 8, -1): if site[i:i+8] == u'<a href=': next = i break else: print "Something broken, chief. Error = 2" nexturl = urlpuller(next, site) myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg') print("Retrieval of "+foldername+" completed.")

When I try to start it using the site I use, it returns an error:

 Traceback (most recent call last): File "yada/yadayada/Python/scraper.py", line 37, in <module> if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"': File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

When http://google.com is listed, it worked fine.

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

but when I try to decode using utf-8, as you can see, it does not work.

Any suggestions?

+6

python utf-8 character-encoding decoding

user3701032 Jun 2 '14 at 22:26

source share

2 answers

Instead of your for loop, do something like:

 start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40

0

Daniel Jun 2 '14 at 10:31

source share

Martin Konecny · Accepted Answer · 2014-06-02T22:32:46+0000

 site[i:i+35].decode('utf-8')

You cannot arbitrarily break received bytes, and then ask UTF-8 to decode it. UTF-8 is multi-byte encoding, meaning you can have 1 to 6 bytes to represent one character. If you chop it in half and ask Python to decode it, it will give you an unexpected end of data error.

Look at the tool that is created for you. BeautifulSoup or lxml are two alternatives.

UnicodeDecodeError: codec 'utf8' cannot decode byte 0xc3 at position 34: unexpected end of data

More articles: