What is the fastest way to sort strings in Python if locale is not a problem?

Question

What is the fastest way to sort strings in Python if locale is not a problem?

I tried to find a quick way to sort strings in Python, and locale is irrelevant, that is, I just want to sort the array lexically according to base bytes. This is perfect for something like radix sorting. Here is my MWE

import numpy as np import timeit # randChar is workaround for MemoryError in mtrand.RandomState.choice # http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice def randChar(f, numGrp, N) : things = [f%x for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] N=int(1e7) K=100 id3 = randChar("id%010d", N//K, N) # small groups (char) timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

As you can see, it took 6.8 seconds, which is almost 10 times slower than the R-radix below.

 N = 1e7 K = 100 id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE) system.time(sort(id3,method="radix"))

I understand that Python .sort() does not use radix sorting, is there an implementation somewhere that allows me to sort strings just like R?

AFAIK both R and Python strings, so any optimizations in R can also be done in Python.

The top google result for "radix python string strings" is this gist , which caused an error while sorting by my test array.

+5

python string sorting

xiaodai Dec 31 '17 at 1:57

source share

1 answer

Pasha · Answer 1 · 2018-03-21T04:44:41+0000

It is true that R puts all lines, that is, it has a "global character cache", which serves as the central dictionary of all lines ever used by your program. This has its advantages: data takes up less memory, and some algorithms (such as radix sorting) can use this structure to achieve higher speeds. This is especially true for scenarios, for example, in your example, where the number of unique lines is small relative to the size of the vector. On the other hand, it also has its drawbacks: a global character cache prevents multithreaded write access to character data.

In Python, afaik, only string literals are interned. For instance:

  >>> 'abc' is 'abc' True >>> x = 'ab' >>> (x + 'c') is 'abc' False

In practice, this means that if you have not embedded the data directly in the program text, nothing will be interned.

Now, for your initial question: "What is the fastest way to sort strings in python"? You can achieve very good speed comparable to R with the python datatable . Here we use a template that sorts N = 10ings strings randomly selected from a set of 1024:

 import datatable as dt import pandas as pd import random from time import time n = 10**8 src = ["%x" % random.getrandbits(10) for _ in range(n)] f0 = dt.Frame(src) p0 = pd.DataFrame(src) f0.to_csv("test1e8.csv") t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0)) t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0)) t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))

What produces:

 datatable: 1.465s / 1.462s / 1.460s (multiple runs) list.sort: 44.352s pandas: 395.083s

The same dataset in R (v3.4.2):

 > require(data.table) > DT = fread("test1e8.csv") > system.time(sort(DT$C1, method="radix")) user system elapsed 6.238 0.585 6.832 > system.time(DT[order(C1)]) user system elapsed 4.275 0.457 4.738 > system.time(setkey(DT, C1)) # sort in-place user system elapsed 3.020 0.577 3.600

What is the fastest way to sort strings in Python if locale is not a problem?

More articles: