Beautiful Soup and UnicodeDecodeError

I am trying to crawl a page, but I have a UnicodeDecodeError. Here is my code:

def soup_def(link): req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) usock = urllib2.urlopen(req) encoding = usock.headers.getparam('charset') page = usock.read().decode(encoding) usock.close() soup = BeautifulSoup(page) return soup soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html") 

And the error:

 UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte 

I checked that a few more users have the same error, but I cannot figure out any solution.

+6
source share
2 answers

This is what I got from wikipedia with the character 0xff , which is the character for UTF-16.

 UTF-16[edit] In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters รพรฟ in a text display that expects the text to be ISO-8859-1. if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters รฟรพ in a text display that expects the text to be ISO-8859-1. Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable). 

So, I have two thoughts:

(1) This may be due to the fact that it should be considered as utf-16 instead of utf-8

(2) An error occurs because you are trying to print all the soup on the screen. Then this will cause your IDE (Eclipse / Pycharm) to be smart enough to display these unicode.

If I were you, I would try to move on without printing all the soup and collect only the part that you want. We will see that you have a problem reaching this step. If there are no problems, then why worry that you cannot print the entire soup on the screen.

If you really want to print the soup on screen, try:

 print soup.prettify(encoding='utf-16') 
0
source

Another possibility is a hidden file that you are trying to analyze (which is very common on Mac computers).

Add a simple if statement so that you create BeautifulSoup objects, which are actually html files:

 for root, dirs, files in os.walk(folderPath, topdown = True): for fileName in files: if fileName.endswith(".html"): soup = BeautifulSoup(open(os.path.join(root, fileName)).read(), 'lxml') 
0
source

Source: https://habr.com/ru/post/957999/


All Articles