String Identity Comparison in CPython

I recently discovered a potential error in a production system where two strings were compared using an identification operator, for example:

if val[2] is not 's': 

I suppose this, however, often works anyway, since, as far as I know, CPython stores short immutable lines in the same place. I replaced it with != , But I need to confirm that the data that previously went through this code is correct, so I would like to know if this always works, or if it only sometimes works.

The Python version has always been 2.6.6, as far as I know, and the above code is the only place where the is statement was used.

Does anyone know if this line will always work as the programmer planned?

edit Since this is, without a doubt, very specific and useless for future readers, I will ask another question:

Where should I confirm with absolute certainty the behavior of the Python implementation? Optimization in CPython source code easy to digest? Any tips?

+4
source share
5 answers

You can see the CPython code for 2.6.x: http://svn.python.org/projects/python/branches/release26-maint/Objects/stringobject.c

It seems that single-character strings are processed specially, and each individual line exists only once, so your code is safe. Here are a few key code (excerpts):

 static PyStringObject *characters[UCHAR_MAX + 1]; PyObject * PyString_FromStringAndSize(const char *str, Py_ssize_t size) { register PyStringObject *op; if (size == 1 && str != NULL && (op = characters[*str & UCHAR_MAX]) != NULL) { Py_INCREF(op); return (PyObject *)op; } ... 
+3
source

Of course, you should not use the is / is not operator when you just want to compare two objects without checking if these objects are the same.

Although it makes sense that python never creates a new string object with the same contents as the existing one (since strings are immutable), and equality and identity are equivalent because of this, I would not rely on this, especially with a ton of implementation python.

+4
source

As already noted, this should always be true for strings created in python (or CPython, anyway), but if you use the C extension, it won't.

As a quick counter example:

 import numpy as np x = 's' y = np.array(['s'], dtype='|S1') print x print y[0] print 'x is y[0] -->', x is y[0] print 'x == y[0] -->', x == y[0] 

This gives:

 s s x is y[0] --> False x == y[0] --> True 

Of course, if nothing had ever used any C extension, you would probably be safe ... I would not count on it, though ...

Edit: as an even simpler example, it fails if everything was pickled or packed with struct any way.

eg:.

 import pickle x = 's' pickle.dump(x, file('test', 'w')) y = pickle.load(file('test', 'r')) print x is y print x == y 

Also (for clarity, use a different letter, since we need "s" for the format string):

 import struct x = 'a' y = struct.pack('s', x) print x is y print x == y 
+3
source

This behavior will always apply to empty and single characters. From unicodeobject.c:

 PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) { ..... /* Single character Unicode objects in the Latin-1 range are shared when using this constructor */ if (size == 1 && *u < 256) { unicode = unicode_latin1[*u]; 

This snippet is from Python 3, but probably a similar optimization exists in earlier versions.

+2
source

Suppose this works because of automatic interpolation of short strings (just like constants in a python source, like a literal 's'), but it's pretty silly to use an identifier here.

Python is a duck print, any object that looks like a string can be used, for example, the same code crashes if val[2] is actually u"s" .

0
source

Source: https://habr.com/ru/post/1337808/


All Articles