Python strings do not seem to be "born equal",

Question

Python strings do not seem to be "born equal",

I'm trying to wrap my brain around "text encoding standards." When interpreting a bunch of bytes as "text", you need to know which "coding shema" is used. Possible candidates that I know of:

ASCII: very simple encoding scheme, supports 128 characters.
CP-1252: Windows coding scheme for the Latin alphabet. Also known as "ANSI".
UTF-8: encoding scheme for the Unicode table (1,114,112 characters). Represents each character with one byte, if possible, more bytes if necessary (maximum 4 bytes).
UTF-16: Another encoding scheme for the Unicode table (1,114,112 characters). Represents each character with min 2 bytes, maximum 4 bytes.
UTF-32: Another encoding scheme for a Unicode table. Represents each character with 4 bytes.
.,.

Now, I would expect Python to consistently use one coding scheme for its built-in String type. I did the following test and the result makes me tremble. I am starting to believe that Python does not always adhere to the same coding scheme in order to store its strings inside. In other words: Python strings do not seem to be born equal.

EDIT:

I forgot to mention that I am using Python 3.x. Excuse me: -)

1. Test

: myAnsi.txt myUtf.txt. , CP-1252, ANSI. utf-8. . Python String. . String . :

    ##############################
    #    TEST ON THE ANSI-coded  #
    #    FILE                    #
    ##############################
    import os
    file = open(os.getcwd() + '\\myAnsi.txt', 'r')
    fileText = file.read()
    file.close()

    file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
    file.write(fileText)
    file.close()

    # A print statement here like:
    #    >> print(fileText)
    # will raise an exception.
    # But if you're typing this code in a python terminal,
    # you can just write:
    #    >> fileText
    # and get the content printed. In my case, it is the exact
    # content of the file.
    # PS: I use the native windows cmd.exe as my Python terminal ;-)

    ##############################
    #    TEST ON THE Utf-coded   #
    #    FILE                    #
    ##############################
    import os
    file = open(os.getcwd() + '\\myUtf.txt', 'r')
    fileText = file.read()
    file.close()

    file = open(os.getcwd() + '\\outputUtf.txt', 'w')
    file.write(fileText)
    file.close()

    # A print statement here like:
    #    >> print(fileText)
    # will just work fine (at least for me).

    ############# END OF TEST #############

2. ,

, Python - utf-8 - . String . , utf-8:

    outputAnsi.txt   ->   utf-8 encoded
    outputUtf.txt    ->   utf-8 encoded

3.

:

    outputAnsi.txt   ->   CP-1252 encoded (ANSI)
    outputUtf.txt    ->   utf-8 encoded

, String fileText - , .

:

, open() , .

. open() " " - , cp1252 - *.txt , ?

4. ..

:

(1) , Python ? .

(2) -, Python , Python. Python Strings . ? , Python ?

(3) , Python , ? . , Python (!) .

5. ( ):

Python: Python 3.x( Anaconda)
: Windows 10
: Windows cmd.exe
, fileText. -, print(fileText) ANSI. . python fileText .
: Notepad ++ , - : https://nlp.fi.muni.cz/projects/chared/
outputAnsi.txt outputUtf.txt . , open(..) 'w'.

6. ( ):

, , . , . . (PS: , , ?):

myAnsi.txt

/*
******************************************************************************
**
**  File        : LinkerScript.ld
**
**  Author      : Auto-generated by Ac6 System Workbench
**
**  Abstract    : Linker script for STM32F746NGHx Device from STM32F7 series
**
**  Target      : STMicroelectronics STM32
**
**  Distribution: The file is distributed "as is," without any warranty
**                of any kind.
**
*****************************************************************************
** @attention
**
** <h2><center>&copy; COPYRIGHT(c) 2014 Ac6</center></h2>
**
*****************************************************************************
*/

/* Entry Point */
/*ENTRY(Reset_Handler)*/
ENTRY(Default_Handler)

/* Highest address of the user mode stack */
_estack = 0x20050000;    /* end of RAM */

_Min_Heap_Size = 0;      /* required amount of heap  */
_Min_Stack_Size = 0x400; /* required amount of stack */

/* Memories definition */
MEMORY
{
  RAM (xrw)     : ORIGIN = 0x20000000, LENGTH = 320K
  ROM (rx)      : ORIGIN = 0x8000000, LENGTH = 1024K
}

fileText :

>>> print(fileText)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Anaconda3\lib\encodings\cp850.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 357: character maps to <undefined>

:

>>> fileText
    ### contents of the file are printed out :-) ###

myUtf.txt

/*--------------------------------------------------------------------------------------------------------------------*/
/*           _ _ _                                                                                                    */
/*          / -,- \                   __  _            _                                                              */
/*         //  |  \\                 / __\ | ___   ___| | __                   _            _                         */
/*         |   0--,|                / /  | |/ _ \ / __| |/ /    __ ___ _ _  __| |_ __ _ _ _| |_ ___                   */
/*         \\     //               / /___| | (_) | (__|   <    / _/ _ \ ' \(_-<  _/ _` | ' \  _(_-<                   */
/*          \_-_-_/                \____/|_|\___/ \___|_|\_\   \__\___/_||_/__/\__\__,_|_||_\__/__/                   */
/*--------------------------------------------------------------------------------------------------------------------*/


#include "clock_constants.h"
#include "../CMSIS/stm32f7xx.h"
#include "stm32f7xx_hal_rcc.h"


/*--------------------------------------------------------------------------------------------------*/
/*          S y s t e m C o r e C l o c k       i n i t i a l        v a l u e                      */
/*--------------------------------------------------------------------------------------------------*/
/*                                                                                                  */
/* This variable is updated in three ways:                                                          */
/*      1) by calling CMSIS function SystemCoreClockUpdate()                                        */
/*      2) by calling HAL API function HAL_RCC_GetHCLKFreq()                                        */
/*      3) each time HAL_RCC_ClockConfig() is called to configure the system clock frequency        */
/*          Note: If you use this function to configure the system clock; then there                */
/*                is no need to call the 2 first functions listed above, since SystemCoreClock      */
/*                variable is updated automatically.                                                */
/*                                                                                                  */
uint32_t SystemCoreClock = 16000000;
const uint8_t AHBPrescTable[16] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 6, 7, 8, 9};


/*--------------------------------------------------------------------------------------------------*/
/*          S y s t e m C o r e C l o c k       v a l u e      u p d a t e                          */
/*--------------------------------------------------------------------------------------------------*/
/*                                                                                                  */
/* @brief  Update SystemCoreClock variable according to Clock Register Values.                      */
/*         The SystemCoreClock variable contains the core clock (HCLK), it can                      */
/*         be used by the user application to setup the SysTick timer or configure                  */
/*         other parameters.                                                                        */
/*--------------------------------------------------------------------------------------------------*/

+4

python python-3.x encoding unicode utf-8

K.Mulier 29 . '16 19:24

4

CP-1252 ; , UTF-8. , Windows , , open, cp-1252, Python, , . , , .

, UTF-8:

with open('utf8file.txt', 'w', encoding='utf-8') as f:
    f.write('é')

C3 A9.

cp-1252, , cp-1252:

with open('utf8file.txt') as f:
    data = f.read()

'é', , cp-1252: "Ã©" ( , , , , -ASCII)

, , , ; ( ) "Ã©" C9 A9, , .

, cp-1252 (, , Python Unicode, , latin-1 , \x8d), , .

+3

ShadowRanger 29 . '16 20:00

, .

open(). Python 3. *

open() : open ( , ). 1

, , , 2. ? ,

, str-. - . 3

, , . . Python , , .

+1

msleevi 29 . '16 20:06

, , .

, .

, text = open("myfile.txt").read(), , , , , , ASCII. , , ( , iso-8859-1 , utf-8).

, IIRC 0x00 0xFF a, .

, , , .

Python 3 , " ", .

, 8- , . , .

.

, UnicodeDecodeError. .

, Città, iso8859-1, utf-8 ( ), , à iso8859-1 0xe0 utf-8, t (0x74).

(.. Città, utf-8, iso8859-1), -, "", CittÃ (.. A 0xc3 0xA0).

0

6502 29 . '16 20:12

Andrea Corbellini · Accepted Answer · 2016-08-29T20:02:19+0000

, open() , ( , Windows).

, :

file = open(os.getcwd() + '\\myAnsi.txt', 'r')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
file = open(os.getcwd() + '\\myUtf.txt', 'r')
file = open(os.getcwd() + '\\outputUtf.txt', 'w')

, , .

encoding='cp1252' encoding='utf-8', , :

file = open(os.getcwd() + '\\myAnsi.txt', 'r', encoding='cp1252')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w', encoding='cp1252')
file = open(os.getcwd() + '\\myUtf.txt', 'r', encoding='utf-8')
file = open(os.getcwd() + '\\outputUtf.txt', 'w', encoding='utf-8')

( , Windows, , 'myAnsi.txt' os.getcwd() + '\\myAnsi.txt'.)

, , . , hello ASCII, CP-1252 UTF-8. , , ASCII, :

>>> 'hello'.encode('cp1252')
b'hello'
>>> 'hello'.encode('utf-8')
b'hello'  # different encoding, same byte representation

, , , , :

>>> b'\xe2\x82\xac'.decode('utf-8')
'€'
>>> b'\xe2\x82\xac'.decode('cp1252')
'â‚¬'  # same byte representation, different string

Python UTF-8, UTF-16 UTF-32 . Python "" , UTF-8 UTF-16 , O (1).

, , , , ( - ). , , CP-1252 UTF-8.

Python strings do not seem to be "born equal",

More articles: