How to port a python 2.6 project to UTF-8?

Question

How to port a python 2.6 project to UTF-8?

We go from latin1 to UTF-8 and have 100k lines of python code.

Plus I'm new to python (ha ha ha!).

I already know that when accepting Unicode, the str() function does not work, so we should use unicode() instead of it with almost the same effect.

What are the other “dangerous” places in the code?

Are there any basic recommendations / algorithms for switching to UTF-8? Can I write an automatic "code transformer"?

+4

python unicode

Dan Mar 18 '11 at 10:56

source share

3 answers

Can I write an automatic "code transformer"? =)

No. str and unicode are two different types that have different goals. You should not try to replace every occurrence of a byte string with a Unicode string in either Python 2 or Python 3.

Continue to use byte strings for binary data. In particular, everything you write to a file or network socket is bytes. And use Unicode strings for text facing the user.

In the gap is a gray area of internal ASCII characters, which can be equal to bytes or Unicode. In Python 2, these are usually bytes; in Python 3, usually Unicode. In this case, you can limit your Python 2.6+ code, you can mark lines with specific bytes as b'' and bytes , your lines with specific characters as u'' and unicode and use '' and str for the string "any type of string by default ".

+2

bobince Mar 18 '11 at 20:12

source share

One way to quickly convert Python 2.x to the default encoding of UTF-8 is to set the default encoding . This approach has its drawbacks - first of all, that it changes the encoding for all libraries, as well as your application, so use with caution. My company uses this technique in our manufacturing applications, and it suits us. It is also compatible with Python 3, which has UTF-8 as the default encoding. You still have to change str() links to unicode() , but you don’t need to explicitly specify the encoding with .decode() and encode() .

0

Jason R. coombs Mar 18 '11 at 12:43

source share

theheadofabroom · Accepted Answer · 2011-03-18T11:37:01+0000

str and unicode are classes, not functions. When you call str(u'abcd') , you initialize a new line that takes "abcd" as a variable. It so happened that str() can be used to convert strings of any type to ascii str .

Other areas to look for are reading from a file / input, or basically everything you return as a string from a function that was not written for unicode.

Enjoy :)

How to port a python 2.6 project to UTF-8?

More articles: