The Byte Pair Encoding algorithm has a replacement step, where it changes the character strings separated by spaces into bits.
Ie, given the list of str
tuples as such:
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
And the string tuple: ('i', 's')
How to process the list so that it repeats through all the keys of the tuple and replaces ('i', 's')
with ('is')
? , that is, the output of Counter
will look something like this:
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
I tried this:
>>> cin [('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')] >>> [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin] [('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
but is there a more efficient way than looping through each word, and then swapping them for a string, so that you can replace and split them again, and then return them to tuples?
Will replacing a regex faster? Is there a way to work with a list of tuples without accessing strings?
I have tried this and it seems that replacing the str.replace
string str.replace
not a problem. He really calculates the bigrams and extracts them:
import io from collections import Counter import time infile = 'big.txt' # comes from norvig.com/big.txt n = 2 with io.open(infile, 'r', encoding='utf8') as fin: text = fin.read().lower().replace(u' ', u"\uE000") for j in range(1,6400): unused_char = unichr(ord(u'\uE001') + j) start = time.time() char_bigrams = zip(*[text[i:] for i in range(n)]) bigram_time = time.time() - start start = time.time() most_freq_bigram = Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0] max_time = time.time() - start start = time.time() text = text.replace(''.join(most_freq_bigram), unused_char) replace_time = time.time() - start print j, ''.join(most_freq_bigram), most_freq_bigram, bigram_time, max_time, replace_time print text
This is checked at norvig.com/big.txt
[exit]:
1 th (u't', u'h') 0.896255016327 3.28389787674 0.0253069400787 2 e (u'\ue002', u'e') 1.47053217888 3.16544914246 0.0280749797821 3 in (u'i', u'n') 1.13404297829 3.10529899597 0.0245559215546 4 an (u'a', u'n') 1.20013689995 3.63801002502 0.0242891311646 5 er (u'e', u'r') 1.41387891769 3.13376092911 0.0237591266632 6 on (u'o', u'n') 1.22826981544 3.06997895241 0.0227301120758 7 re (u'r', u'e') 1.21916294098 2.97599196434 0.0238041877747 8 at (u'a', u't') 1.14608097076 2.97988891602 0.0226521492004 9 en (u'e', u'n') 1.20747494698 2.88649988174 0.019054889679 10 ed (u'e', u'd') 1.16296696663 2.8995718956 0.0198271274567 11 is (u'i', u's') 1.17692494392 3.02292394638 0.0228500366211 12 d (u'\ue005', u'd') 1.13779211044 2.85169506073 0.0229239463806
I have already experimented with scikit-learn
CountVectorizer and it seems not as fast as using zip
, see Fast / Optimize N-gram implementation in python
In addition, without them, the filter
in the Counter
step took even more time. Counter operation takes 3 seconds to iterate = (
How else can you optimize this operation?
Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]