It is true that R puts all lines, that is, it has a "global character cache", which serves as the central dictionary of all lines ever used by your program. This has its advantages: data takes up less memory, and some algorithms (such as radix sorting) can use this structure to achieve higher speeds. This is especially true for scenarios, for example, in your example, where the number of unique lines is small relative to the size of the vector. On the other hand, it also has its drawbacks: a global character cache prevents multithreaded write access to character data.
In Python, afaik, only string literals are interned. For instance:
>>> 'abc' is 'abc' True >>> x = 'ab' >>> (x + 'c') is 'abc' False
In practice, this means that if you have not embedded the data directly in the program text, nothing will be interned.
Now, for your initial question: "What is the fastest way to sort strings in python"? You can achieve very good speed comparable to R with the python datatable . Here we use a template that sorts N = 10ings strings randomly selected from a set of 1024:
import datatable as dt import pandas as pd import random from time import time n = 10**8 src = ["%x" % random.getrandbits(10) for _ in range(n)] f0 = dt.Frame(src) p0 = pd.DataFrame(src) f0.to_csv("test1e8.csv") t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0)) t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0)) t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
What produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs) list.sort: 44.352s pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table) > DT = fread("test1e8.csv") > system.time(sort(DT$C1, method="radix")) user system elapsed 6.238 0.585 6.832 > system.time(DT[order(C1)]) user system elapsed 4.275 0.457 4.738 > system.time(setkey(DT, C1))
source share