Accepting the suggestions of others above, I flipped my own efficient code. As noted by Niklas B., the main improvement is a decrease from O (n ^ 3) to O (n ^ 2).
from __future__ import division
import itertools
def jack(a,b):
return len(a & b) / len(a | b)
def jacksort(t):
sts = [(i, it, set(it.split())) for i, it in enumerate(t)]
os = set()
out = []
for index, val1, sval1 in sts:
if not val1 in os:
out.append(val1)
os.add(val1)
for index2, val2, sval2 in itertools.islice(sts, index+1, len(sts)):
if val2 in os: continue
if jack(sval1, sval2) >= 0.5:
out.append(val2)
os.add(val2)
return out
def main(n=100):
t=['bancorp', 'bancorp', 'bancorp ali', 'bancorp puno', 'bancorp amo',
'gas eu', 'gas', 'profeta', 'bancorp america', 'uni', 'gas for',
'gas tr']
t += [" ".join(w.split()) for w in open("/usr/share/dict/words").read().split()]
t = t[:n]
jacksort(t)
n - . /usr/share/dict/words:/usr/share/dict/american-english debian package wamerican version 7.1-1.
:
10: 10 loops, best of 3: 30.6 msec per loop
20: 10 loops, best of 3: 28.9 msec per loop
50: 10 loops, best of 3: 29.6 msec per loop
100: 10 loops, best of 3: 31.7 msec per loop
200: 10 loops, best of 3: 38.7 msec per loop
500: 10 loops, best of 3: 85.1 msec per loop
1000: 10 loops, best of 3: 261 msec per loop
2000: 10 loops, best of 3: 1.01 sec per loop
5000: 10 loops, best of 3: 6.16 sec per loop
10000: 10 loops, best of 3: 25.3 sec per loop
, :
10: 10 loops, best of 3: 34.1 msec per loop
20: 10 loops, best of 3: 34.3 msec per loop
50: 10 loops, best of 3: 33.9 msec per loop
100: 10 loops, best of 3: 43.1 msec per loop
200: 10 loops, best of 3: 74.9 msec per loop
500: 10 loops, best of 3: 415 msec per loop
1000: 10 loops, best of 3: 2.35 sec per loop
2000: 10 loops, best of 3: 14.8 sec per loop
5000: [did not finish while preparing this post]
timeit :
$ for i in 10 20 50 100 200 500 1000 2000 5000 10000; do printf "%5d: " $i; python -mtimeit 'import ojack as jack' 'jack.main('$i')'; done
Python 2.7.5-5 Debian amd64, i5-3320m. , -, , , Big-O.
, "" - , 5-7 . , , , . , n = 30000 393 . , O (n ^ 2) (25.3 * 9 = 227.7), - log n, -, O (1) Python.