Python: unicode in windows terminal, use encoding?

Question

Python: unicode in windows terminal, use encoding?

I am using a Python interpreter in a Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.

I am typing:

>>> s='ë' >>> s '\x89' >>> u=u'ë' >>> u u'\xeb'

Question 1 : Why is the encoding used in the string s different from the one used in the unicode string u ?

I continue and type:

 >>> us=unicode(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal not in range(128) >>> us=unicode(s, 'latin-1') >>> us u'\x89'

Question2 : I tried using latin-1 encoding so that luck would turn the string into a unicode string (in fact, I tried a bunch of others first, including utf-8 ). How can I find out which terminal encoding used to encode my string?

Question 3 : how to print output ë how ë instead of '\x89' or u'xeb' ? Hmm, stupid to me. print(s) does the job.

I already addressed this related SO question, but no hints from it: Set Python terminal encoding on Windows

+6

python windows terminal unicode

Rabarberski Jun 14 '11 at 14:09

source share

8 answers

Avoid Windows Terminals

I don’t go out on a limb, saying “terminal” in a more appropriate way “DOS hint” that comes with Windows 7 is completely undesirable. This was bad on Windows 95, NT, XP, Vista, and 7. Perhaps they fixed it with Powershell, I don’t know. Nevertheless, this indicates problems that at that time were the reason for the development of the OS in Microsoft.

Output to file instead

Set the environment variable PYTHONIOENCODING , and then redirect the output to a file.

 set PYTHONIOENCODING=utf-8 ./myscript.py > output.txt

Then, using Notepad ++ , you can see the UTF-8 version of your output.

Install win-unicode console

win-unicode-console can solve your problems. You have to try it

 pip install win-unicode-console

If you are interested in a cross-cutting discussion of the issue of python and the command line, check out Python issue 1602 . Otherwise, just use the win-unicode-console package.

 py -m run script.py

Runs it in a script, or you can follow their instructions to add win_unicode_console.enable() to each call by adding it to usercustomize or sitecustomize .

+3

Cameron Lowell Palmer May 16, '16 at 18:34

source share

Read this Python HOWTO on unicode after reading this section from the tutorial

Creating Unicode strings in Python is as simple as creating regular strings:

 >>> u'Hello World !' u'Hello World !'

To answer your first question, they are different, because only when using u'' you create a Unicode string.

Second question:

 sys.getdefaultencoding()

returns default encoding

But to quote the link:

Python users who are new to Unicode are sometimes attracted by default to the encoding returned by sys.getdefaultencoding (). The first thing you need to know about the default encoding is that you don't need to worry about that. Its value must be "ascii", and it is used when converting StrIsNotAString byte strings to unicode strings.

+1

Fredrik pihl Jun 14 '11 at 14:26

source share

In fact, a unicode object does not have "Encoding". You must read Unicode in python to avoid constant confusion. This presentation looks adequate - http://farmdev.com/talks/unicode/ .
You are in the Russian version of the window, right? The terminal uses cp1251.

+1

letitbee Jun 14 '11 at 14:26

source share

You answered question 1, as you ask: the first line is an encoded byte string, but the second is not an encoding at all, it refers to the Unicode code point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb .

Now the question of what the first encoding is is an interesting one. I usually expect it to be either utf-8, or since you're on Windows, ISO-8859-1 or Win-1252 (which is not quite the same, but close enough). However, the normal representation of this letter in utf-8 is c3 ab , and in Win-1252 it actually matches the Unicode code point - i.e. hex eb . So this is a bit of a mystery.

+1

Daniel Roseman Jun 14 '11 at 14:26

source share

It seems you are using the CP850 code page, which makes sense, as it is a historical code page for DOS brought forward to the terminal window.

 >>> s '\x89' >>> us=unicode(s,'CP850') >>> us u'\xeb'

+1

Mark ransom Jun 14 '11 at 15:47

source share

As you understand:

 >>> a = "" >>> a '\xf1' >>> print a

Do you open any file when receiving such errors? If yes, try opening it with

 import codecs f = codecs.open('filename.txt','r','utf-8')

+1

tony Jun 14 '11 at 20:48

source share

In case others get this page when searching The easiest way is to set the code page in the terminal first.

 CHCP 65001

then run your program.

works well for me. For the power shell, run it with

 powershell.exe -NoExit /c "chcp.com 65001"

Its from python: unicode in a Windows terminal, the encoding used?

0

lxx Mar 30 '15 at 0:45

source share

Mark tolonen · Accepted Answer · 2011-06-14T20:05:28+0000

Unicode is not an encoding. You encode into byte strings and decode in Unicode:

 >>> '\x89'.decode('cp437') u'\xeb' >>> u'\xeb'.encode('cp437') '\x89' >>> u'\xeb'.encode('utf8') '\xc3\xab'

The Windows terminal uses outdated code pages for DOS. For US Windows, this is:

 >>> import sys >>> sys.stdout.encoding 'cp437'

Windows applications use Windows code pages. Python IDLE will show Windows encoding:

 >>> import sys >>> sys.stdout.encoding 'cp1252'

Your results may vary.

Python: unicode in windows terminal, use encoding?

Avoid Windows Terminals

Output to file instead

Install win-unicode console

More articles: