Python Unicode Special Issues

#!/usr/bin/env python # -*- coding: utf_8 -*- def splitParagraphIntoSentences(paragraph): ''' break a paragraph into sentences and return a list ''' import re # to split by multile characters # regular expressions are easiest (and fastest) sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[AZ])') sentenceList = sentenceEnders.split(paragraph, re.UNICODE) return sentenceList if __name__ == '__main__': p = "While other species (eg horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – Sheffield's only mango tree is valued at £9.2 billion." sentences = splitParagraphIntoSentences(p) for s in sentences: print s.strip() 

Expected Result: While other species (for example, horse mango, M. foetida) are also grown, the Mangifer rate - ordinary mango or Indian mango - only the Sheffield mango tree is estimated at 9.2 billion pounds.

Result: While other species (for example, horse mango, M. foetida) are also grown, Mangifera ind ica is an ordinary mango or Indian mango ΓÇô Sheffield. Only mango tree va for $ 9.2 billion.

Ignore the meaning of the sentence, the main thing is that he is not able to use special characters, such as "-", "£", "" and others. I tried installing the sitecustomize.py file and this code with other encodings like ascii, utf-32, cp-500, iso8859_15 and utf-8, but could not solve it. Sorry, I'm new to python. Thank you in advance.

+2
source share
4 answers

Found a solution.

The following code snippet works very well.

 p = p.encode('utf-8') if isinstance(p,unicode) else p 
+2
source

Using Unicode string literals as recommended by Nam is correct, but if your terminal uses the cp437 code page, as your output shows, it will not be able to display some of the Unicode characters you want to use. The Windows console does not support UTF-8, this is what you send if you declare coding: utf-8 1 in the source file and do not use Unicode literals. coding: utf-8 declares the encoding of your source file , so make sure you really save your source in UTF-8 encoding.

When you use a Unicode literal, Python interprets the source string in the declared encoding and converts it to a Unicode string. When printing a Unicode string, Python will encode the string in terminal encoding or not have terminal encoding, use ascii encoding for Python 2 by default.

Example:

 # coding: utf8 print '£9.2 billion' # Sends UTF-8 to cp437 terminal (gibberish) print u'£9.2 billion' # Correctly prints on cp437 terminal. print 'Sheffield's' # Sends UTF-8 to cp437 terminal (gibberish) # Replaces Unicode characters that are unsupported in cp437. print u'Sheffield's £9.2 billion'.encode('cp437','xmlcharrefreplace') print u'Sheffield's' # UnicodeEncodeError. 

Exit

 ┬ú9.2 billion £9.2 billion SheffieldΓÇÖs Sheffield&#8217;s £9.2 billion Traceback (most recent call last): File "C:\Documents and Settings\metolone\Desktop\x.py", line 10, in <module> print u'SheffieldΓÇÖs' # UnicodeEncodeError. File "C:\dev\python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 9: character maps to <undefined> 

So, don't expect everything to print all Unicode correctly on the Windows console. Use a Python development environment that supports UTF-8, such as PythonWin (available in the pywin32 extension).

You need two things to correctly display Unicode characters in the Windows console: an encoding that displays the Unicode characters you want to display, and a font that supports the correct character for those characters. For example, if you change the console code page to Windows-1252 ( chcp 1252 ) and change the console font to Consolas or Lucida Console instead of Raster Fonts, your original program will work if you use Unicode literals ( p = u"..." ) .

+2
source

It looks like cp437 . Try the following:

 import codecs, sys sys.stdout = codecs.getwriter('UTF-8')(sys.stdout) print u"valued at £9.2 billion." 

This works for me in Python 2.6.

+1
source
 p = "While other species..." 

should be changed to

 p = u"While other species..." 

Pay attention to u before quote.

What you need is the so-called Unicode literals. In Python 2, string literals are not Unicode by default.

0
source

Source: https://habr.com/ru/post/918182/


All Articles