I'm trying to wrap my brain around "text encoding standards." When interpreting a bunch of bytes as "text", you need to know which "coding shema" is used. Possible candidates that I know of:
- ASCII: very simple encoding scheme, supports 128 characters.
- CP-1252: Windows coding scheme for the Latin alphabet. Also known as "ANSI".
- UTF-8: encoding scheme for the Unicode table (1,114,112 characters). Represents each character with one byte, if possible, more bytes if necessary (maximum 4 bytes).
- UTF-16: Another encoding scheme for the Unicode table (1,114,112 characters). Represents each character with min 2 bytes, maximum 4 bytes.
- UTF-32: Another encoding scheme for a Unicode table. Represents each character with 4 bytes.
- .,.
Now, I would expect Python to consistently use one coding scheme for its built-in String type. I did the following test and the result makes me tremble. I am starting to believe that Python does not always adhere to the same coding scheme in order to store its strings inside. In other words: Python strings do not seem to be born equal.
EDIT:
I forgot to mention that I am using Python 3.x. Excuse me: -)
1. Test
: myAnsi.txt myUtf.txt. , CP-1252, ANSI. utf-8. . Python String. . String . :
import os
file = open(os.getcwd() + '\\myAnsi.txt', 'r')
fileText = file.read()
file.close()
file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
file.write(fileText)
file.close()
import os
file = open(os.getcwd() + '\\myUtf.txt', 'r')
fileText = file.read()
file.close()
file = open(os.getcwd() + '\\outputUtf.txt', 'w')
file.write(fileText)
file.close()
2. ,
, Python - utf-8 - . String . , utf-8:
outputAnsi.txt -> utf-8 encoded
outputUtf.txt -> utf-8 encoded
3.
:
outputAnsi.txt -> CP-1252 encoded (ANSI)
outputUtf.txt -> utf-8 encoded
, String fileText - , .
:
, open() , .
. open() " " - , cp1252 - *.txt , ?
4. ..
:
(1) , Python ? .
(2) -, Python , Python. Python Strings . ? , Python ?
(3) , Python , ? . , Python (!) .
5. ( ):
- Python: Python 3.x( Anaconda)
- : Windows 10
- : Windows
cmd.exe - ,
fileText. -, print(fileText) ANSI. . python fileText . - : Notepad ++ , - : https://nlp.fi.muni.cz/projects/chared/
outputAnsi.txt outputUtf.txt . , open(..) 'w'.
6. ( ):
, , . , . . (PS: , , ?):
myAnsi.txt
/*
******************************************************************************
**
** File : LinkerScript.ld
**
** Author : Auto-generated by Ac6 System Workbench
**
** Abstract : Linker script for STM32F746NGHx Device from STM32F7 series
**
** Target : STMicroelectronics STM32
**
** Distribution: The file is distributed "as is," without any warranty
** of any kind.
**
*****************************************************************************
** @attention
**
** <h2><center>© COPYRIGHT(c) 2014 Ac6</center></h2>
**
*****************************************************************************
*/
/* Entry Point */
/*ENTRY(Reset_Handler)*/
ENTRY(Default_Handler)
/* Highest address of the user mode stack */
_estack = 0x20050000; /* end of RAM */
_Min_Heap_Size = 0; /* required amount of heap */
_Min_Stack_Size = 0x400; /* required amount of stack */
/* Memories definition */
MEMORY
{
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 320K
ROM (rx) : ORIGIN = 0x8000000, LENGTH = 1024K
}
fileText :
>>> print(fileText)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 357: character maps to <undefined>
:
>>> fileText
myUtf.txt
#include "clock_constants.h"
#include "../CMSIS/stm32f7xx.h"
#include "stm32f7xx_hal_rcc.h"
uint32_t SystemCoreClock = 16000000;
const uint8_t AHBPrescTable[16] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 6, 7, 8, 9};