Python strings do not seem to be "born equal",

I'm trying to wrap my brain around "text encoding standards." When interpreting a bunch of bytes as "text", you need to know which "coding shema" is used. Possible candidates that I know of:

  • ASCII: very simple encoding scheme, supports 128 characters.
  • CP-1252: Windows coding scheme for the Latin alphabet. Also known as "ANSI".
  • UTF-8: encoding scheme for the Unicode table (1,114,112 characters). Represents each character with one byte, if possible, more bytes if necessary (maximum 4 bytes).
  • UTF-16: Another encoding scheme for the Unicode table (1,114,112 characters). Represents each character with min 2 bytes, maximum 4 bytes.
  • UTF-32: Another encoding scheme for a Unicode table. Represents each character with 4 bytes.
  • .,.

Now, I would expect Python to consistently use one coding scheme for its built-in String type. I did the following test and the result makes me tremble. I am starting to believe that Python does not always adhere to the same coding scheme in order to store its strings inside. In other words: Python strings do not seem to be born equal.

EDIT:

I forgot to mention that I am using Python 3.x. Excuse me: -)

1. Test

: myAnsi.txt myUtf.txt. , CP-1252, ANSI. utf-8. . Python String. . String . :

    ##############################
    #    TEST ON THE ANSI-coded  #
    #    FILE                    #
    ##############################
    import os
    file = open(os.getcwd() + '\\myAnsi.txt', 'r')
    fileText = file.read()
    file.close()

    file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
    file.write(fileText)
    file.close()

    # A print statement here like:
    #    >> print(fileText)
    # will raise an exception.
    # But if you're typing this code in a python terminal,
    # you can just write:
    #    >> fileText
    # and get the content printed. In my case, it is the exact
    # content of the file.
    # PS: I use the native windows cmd.exe as my Python terminal ;-)

    ##############################
    #    TEST ON THE Utf-coded   #
    #    FILE                    #
    ##############################
    import os
    file = open(os.getcwd() + '\\myUtf.txt', 'r')
    fileText = file.read()
    file.close()

    file = open(os.getcwd() + '\\outputUtf.txt', 'w')
    file.write(fileText)
    file.close()

    # A print statement here like:
    #    >> print(fileText)
    # will just work fine (at least for me).

    ############# END OF TEST #############

2. ,

, Python - utf-8 - . String . , utf-8:

    outputAnsi.txt   ->   utf-8 encoded
    outputUtf.txt    ->   utf-8 encoded

3.

:

    outputAnsi.txt   ->   CP-1252 encoded (ANSI)
    outputUtf.txt    ->   utf-8 encoded

, String fileText - , .

:

, open() , .

. open() " " - , cp1252 - *.txt , ?

4. ..

:

(1) , Python ? .

(2) -, Python , Python. Python Strings . ? , Python ?

(3) , Python , ? . , Python (!) .

5. ( ):

  • Python: Python 3.x( Anaconda)
  • : Windows 10
  • : Windows cmd.exe
  • , fileText. -, print(fileText) ANSI. . python fileText .
  • : Notepad ++ , - : https://nlp.fi.muni.cz/projects/chared/
  • outputAnsi.txt outputUtf.txt . , open(..) 'w'.

6. ( ):

, , . , . . (PS: , , ?):

myAnsi.txt

/*
******************************************************************************
**
**  File        : LinkerScript.ld
**
**  Author      : Auto-generated by Ac6 System Workbench
**
**  Abstract    : Linker script for STM32F746NGHx Device from STM32F7 series
**
**  Target      : STMicroelectronics STM32
**
**  Distribution: The file is distributed "as is," without any warranty
**                of any kind.
**
*****************************************************************************
** @attention
**
** <h2><center>&copy; COPYRIGHT(c) 2014 Ac6</center></h2>
**
*****************************************************************************
*/

/* Entry Point */
/*ENTRY(Reset_Handler)*/
ENTRY(Default_Handler)

/* Highest address of the user mode stack */
_estack = 0x20050000;    /* end of RAM */

_Min_Heap_Size = 0;      /* required amount of heap  */
_Min_Stack_Size = 0x400; /* required amount of stack */

/* Memories definition */
MEMORY
{
  RAM (xrw)     : ORIGIN = 0x20000000, LENGTH = 320K
  ROM (rx)      : ORIGIN = 0x8000000, LENGTH = 1024K
}

fileText :

>>> print(fileText)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Anaconda3\lib\encodings\cp850.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 357: character maps to <undefined>

:

>>> fileText
    ### contents of the file are printed out :-) ###

myUtf.txt

/*--------------------------------------------------------------------------------------------------------------------*/
/*           _ _ _                                                                                                    */
/*          / -,- \                   __  _            _                                                              */
/*         //  |  \\                 / __\ | ___   ___| | __                   _            _                         */
/*         |   0--,|                / /  | |/ _ \ / __| |/ /    __ ___ _ _  __| |_ __ _ _ _| |_ ___                   */
/*         \\     //               / /___| | (_) | (__|   <    / _/ _ \ ' \(_-<  _/ _` | ' \  _(_-<                   */
/*          \_-_-_/                \____/|_|\___/ \___|_|\_\   \__\___/_||_/__/\__\__,_|_||_\__/__/                   */
/*--------------------------------------------------------------------------------------------------------------------*/


#include "clock_constants.h"
#include "../CMSIS/stm32f7xx.h"
#include "stm32f7xx_hal_rcc.h"


/*--------------------------------------------------------------------------------------------------*/
/*          S y s t e m C o r e C l o c k       i n i t i a l        v a l u e                      */
/*--------------------------------------------------------------------------------------------------*/
/*                                                                                                  */
/* This variable is updated in three ways:                                                          */
/*      1) by calling CMSIS function SystemCoreClockUpdate()                                        */
/*      2) by calling HAL API function HAL_RCC_GetHCLKFreq()                                        */
/*      3) each time HAL_RCC_ClockConfig() is called to configure the system clock frequency        */
/*          Note: If you use this function to configure the system clock; then there                */
/*                is no need to call the 2 first functions listed above, since SystemCoreClock      */
/*                variable is updated automatically.                                                */
/*                                                                                                  */
uint32_t SystemCoreClock = 16000000;
const uint8_t AHBPrescTable[16] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 6, 7, 8, 9};


/*--------------------------------------------------------------------------------------------------*/
/*          S y s t e m C o r e C l o c k       v a l u e      u p d a t e                          */
/*--------------------------------------------------------------------------------------------------*/
/*                                                                                                  */
/* @brief  Update SystemCoreClock variable according to Clock Register Values.                      */
/*         The SystemCoreClock variable contains the core clock (HCLK), it can                      */
/*         be used by the user application to setup the SysTick timer or configure                  */
/*         other parameters.                                                                        */
/*--------------------------------------------------------------------------------------------------*/
+4
4

, open() , ( , Windows).

, :

file = open(os.getcwd() + '\\myAnsi.txt', 'r')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
file = open(os.getcwd() + '\\myUtf.txt', 'r')
file = open(os.getcwd() + '\\outputUtf.txt', 'w')

, , .

encoding='cp1252' encoding='utf-8', , :

file = open(os.getcwd() + '\\myAnsi.txt', 'r', encoding='cp1252')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w', encoding='cp1252')
file = open(os.getcwd() + '\\myUtf.txt', 'r', encoding='utf-8')
file = open(os.getcwd() + '\\outputUtf.txt', 'w', encoding='utf-8')

( , Windows, , 'myAnsi.txt' os.getcwd() + '\\myAnsi.txt'.)


, , . , hello ASCII, CP-1252 UTF-8. , , ASCII, :

>>> 'hello'.encode('cp1252')
b'hello'
>>> 'hello'.encode('utf-8')
b'hello'  # different encoding, same byte representation

, , , , :

>>> b'\xe2\x82\xac'.decode('utf-8')
'€'
>>> b'\xe2\x82\xac'.decode('cp1252')
'€'  # same byte representation, different string

Python UTF-8, UTF-16 UTF-32 . Python "" , UTF-8 UTF-16 , O (1).


, , , , ( - ). , , CP-1252 UTF-8.

+2

CP-1252 ; , UTF-8. , Windows , , open, cp-1252, ​​Python, , . , , .

, UTF-8:

with open('utf8file.txt', 'w', encoding='utf-8') as f:
    f.write('é')

C3 A9.

cp-1252, , cp-1252:

with open('utf8file.txt') as f:
    data = f.read()

'é', , cp-1252: "é" ( , , , , -ASCII)

, , , ; ( ) "é" C9 A9, , .

, cp-1252 (, , Python Unicode, , latin-1 , \x8d), , .

+3

, .

open(). Python 3. *

open() : open ( , ). 1

, , , 2. ? ,

, str-. - . 3

, , . . Python , , .

+1

, , .

, .

, text = open("myfile.txt").read(), , , , , , ASCII. , , ( , iso-8859-1 , utf-8).

, IIRC 0x00 0xFF a, .

, , , .

Python 3 , " ", .

, 8- , . , .

.

, UnicodeDecodeError. .

, Città, iso8859-1, utf-8 ( ), , à iso8859-1 0xe0 utf-8, t (0x74).

(.. Città, utf-8, iso8859-1), -, "", Città (.. A 0xc3 0xA0).

0

Source: https://habr.com/ru/post/1652910/


All Articles