Why do we need type str? Why not just byte strings?

Python3 has Unicode ( str ) and bytes strings. We are already dealing with literature and methods. Why do we need two different types, and not just byte strings of different encodings?

+5
source share
4 answers

The answer to your question depends on the meaning of the word "need."

We definitely do not need the str type in the sense that everything that we can calculate is with a type that we can also calculate without it (as you well know from your well-formulated question).

But we can also understand โ€œnecessityโ€ in terms of convenience. Isn't it nice to have a sqrt function? Or log or exp or sin ? You could write them yourself, but why bother? The standard library designer will add useful and convenient features.

This is the same for the language itself. Do we need a while loop? In fact, we cannot use tail recursive functions. Do we need "lists"? Tons of things in Python are not primitive. In this regard, we โ€œneedโ€ high-level languages. John von Neumann himself asked "why do you want more than machine language?"

The same thing happens with str and bytes . The str type, although not needed, is a nice, economical, and convenient thing. This gives us an interface as a sequence of characters, so that we can manipulate a text character without:

  • we need to write all the coding and decoding logic, or
  • inflating a string interface with multiple sets of iterators, for example each_byte and each_char .

As you suspect, we may have one type that provides a sequence of bytes and a sequence of characters (as the Ruby String class does). Python designers wanted to separate these customs into two separate types. You can easily convert an object of one type to another. Having two types, they say that separation of problems (and customs) is more important than fewer built-in types. Ruby makes a different choice.

TL DR This is a matter of preference in language design: separation of problems by a separate type, rather than different methods of the same type.

+9
source

Because bytes should not be considered strings, and strings should not be considered bytes. Python3 understands this correctly, no matter how it sounds to a new developer.

In Python 2.6, if I read data from a file and I pass the โ€œrโ€ flag, the text will be considered by default in the current locale, which will be a string, and when passing the โ€œrbโ€ flag, create a series of bytes. Indexing data is completely different, and methods that accept str may not be sure if I use bytes or str. This gets worse because for ASCII data the two are often synonymous, which means that code that works in simple test cases or English locales will not be able to meet characters other than ASCII.

Thus, there was a conscious effort to ensure that the bytes and strings were not identical: one was a sequence of blank bytes, and the other was a Unicode string with optimal data encoding to preserve O (1) indexing (ASCII, UCS-2, or UTF -32, depending on the data used, I think).

In Python 2, the Unicode string was used to disambiguate text from "dead bytes", however str considered as text for many users.

Or, to quote a Voluntary dictator :

Current Python string objects are overloaded. They serve to store both sequences of characters and sequences of bytes. This overloading of the target leads to confusion and errors. In future versions of Python, string objects will be used to store character data. The bytes object will act as a byte container. In the end, the unicode type will be renamed to str, and the old str type will be deleted.

tl; dr version Forcing separation of bytes and str makes coders aware of their difference, short-term dissatisfaction, but better long-term code. This is a conscious choice after many years of experience: what makes you aware of the difference will immediately save you days in the debugger later.

+5
source

Byte strings with different encodings are incompatible with each other, but before Python 3 there was nothing in this language to remind you of this fact. It turns out that mixing different character encodings is a surprisingly common problem in today's world, which leads to too many errors.

It is also often easier to work with whole characters without worrying that you simply modified a byte that accidentally mapped your 4-byte character into an invalid sequence.

+2
source

There are at least two reasons:

  • The str type has the important property of "one element = one character".

  • The str type is independent of the encoding.

Imagine how you would implement a simple operation, for example, reversing a string ( rword = word[::-1] ) if word were a byte string with some encoding.

+1
source

Source: https://habr.com/ru/post/1270091/


All Articles