Python: unicode in windows terminal, use encoding?

I am using a Python interpreter in a Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.

I am typing:

>>> s='ë' >>> s '\x89' >>> u=u'ë' >>> u u'\xeb' 

Question 1 : Why is the encoding used in the string s different from the one used in the unicode string u ?

I continue and type:

 >>> us=unicode(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal not in range(128) >>> us=unicode(s, 'latin-1') >>> us u'\x89' 

Question2 : I tried using latin-1 encoding so that luck would turn the string into a unicode string (in fact, I tried a bunch of others first, including utf-8 ). How can I find out which terminal encoding used to encode my string?

Question 3 : how to print output ë how ë instead of '\x89' or u'xeb' ? Hmm, stupid to me. print(s) does the job.

I already addressed this related SO question, but no hints from it: Set Python terminal encoding on Windows

+6
source share
8 answers

Unicode is not an encoding. You encode into byte strings and decode in Unicode:

 >>> '\x89'.decode('cp437') u'\xeb' >>> u'\xeb'.encode('cp437') '\x89' >>> u'\xeb'.encode('utf8') '\xc3\xab' 

The Windows terminal uses outdated code pages for DOS. For US Windows, this is:

 >>> import sys >>> sys.stdout.encoding 'cp437' 

Windows applications use Windows code pages. Python IDLE will show Windows encoding:

 >>> import sys >>> sys.stdout.encoding 'cp1252' 

Your results may vary.

+11
source

Avoid Windows Terminals

I don’t go out on a limb, saying “terminal” in a more appropriate way “DOS hint” that comes with Windows 7 is completely undesirable. This was bad on Windows 95, NT, XP, Vista, and 7. Perhaps they fixed it with Powershell, I don’t know. Nevertheless, this indicates problems that at that time were the reason for the development of the OS in Microsoft.

Output to file instead

Set the environment variable PYTHONIOENCODING , and then redirect the output to a file.

 set PYTHONIOENCODING=utf-8 ./myscript.py > output.txt 

Then, using Notepad ++ , you can see the UTF-8 version of your output.

Install win-unicode console

win-unicode-console can solve your problems. You have to try it

 pip install win-unicode-console 

If you are interested in a cross-cutting discussion of the issue of python and the command line, check out Python issue 1602 . Otherwise, just use the win-unicode-console package.

 py -m run script.py 

Runs it in a script, or you can follow their instructions to add win_unicode_console.enable() to each call by adding it to usercustomize or sitecustomize .

+3
source

Read this Python HOWTO on unicode after reading this section from the tutorial

Creating Unicode strings in Python is as simple as creating regular strings:

 >>> u'Hello World !' u'Hello World !' 

To answer your first question, they are different, because only when using u'' you create a Unicode string.

Second question:

 sys.getdefaultencoding() 

returns default encoding

But to quote the link:

Python users who are new to Unicode are sometimes attracted by default to the encoding returned by sys.getdefaultencoding (). The first thing you need to know about the default encoding is that you don't need to worry about that. Its value must be "ascii", and it is used when converting StrIsNotAString byte strings to unicode strings.

+1
source
  • In fact, a unicode object does not have "Encoding". You must read Unicode in python to avoid constant confusion. This presentation looks adequate - http://farmdev.com/talks/unicode/ .

  • You are in the Russian version of the window, right? The terminal uses cp1251.

+1
source

You answered question 1, as you ask: the first line is an encoded byte string, but the second is not an encoding at all, it refers to the Unicode code point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb .

Now the question of what the first encoding is is an interesting one. I usually expect it to be either utf-8, or since you're on Windows, ISO-8859-1 or Win-1252 (which is not quite the same, but close enough). However, the normal representation of this letter in utf-8 is c3 ab , and in Win-1252 it actually matches the Unicode code point - i.e. hex eb . So this is a bit of a mystery.

+1
source

It seems you are using the CP850 code page, which makes sense, as it is a historical code page for DOS brought forward to the terminal window.

 >>> s '\x89' >>> us=unicode(s,'CP850') >>> us u'\xeb' 
+1
source

As you understand:

 >>> a = "" >>> a '\xf1' >>> print a  

Do you open any file when receiving such errors? If yes, try opening it with

 import codecs f = codecs.open('filename.txt','r','utf-8') 
+1
source

In case others get this page when searching The easiest way is to set the code page in the terminal first.

 CHCP 65001 

then run your program.

works well for me. For the power shell, run it with

 powershell.exe -NoExit /c "chcp.com 65001" 

Its from python: unicode in a Windows terminal, the encoding used?

0
source

Source: https://habr.com/ru/post/890503/


All Articles