How to make a byte pair Encoding bigram counters and replacing them efficiently in Python?

The Byte Pair Encoding algorithm has a replacement step, where it changes the character strings separated by spaces into bits.

Ie, given the list of str tuples as such:

 [('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')] 

And the string tuple: ('i', 's')

How to process the list so that it repeats through all the keys of the tuple and replaces ('i', 's') with ('is') ? , that is, the output of Counter will look something like this:

 [('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')] 

I tried this:

 >>> cin [('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')] >>> [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin] [('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')] 

but is there a more efficient way than looping through each word, and then swapping them for a string, so that you can replace and split them again, and then return them to tuples?

Will replacing a regex faster? Is there a way to work with a list of tuples without accessing strings?


I have tried this and it seems that replacing the str.replace string str.replace not a problem. He really calculates the bigrams and extracts them:

 import io from collections import Counter import time infile = 'big.txt' # comes from norvig.com/big.txt n = 2 with io.open(infile, 'r', encoding='utf8') as fin: text = fin.read().lower().replace(u' ', u"\uE000") for j in range(1,6400): unused_char = unichr(ord(u'\uE001') + j) start = time.time() char_bigrams = zip(*[text[i:] for i in range(n)]) bigram_time = time.time() - start start = time.time() most_freq_bigram = Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0] max_time = time.time() - start start = time.time() text = text.replace(''.join(most_freq_bigram), unused_char) replace_time = time.time() - start print j, ''.join(most_freq_bigram), most_freq_bigram, bigram_time, max_time, replace_time print text 

This is checked at norvig.com/big.txt

[exit]:

 1 th (u't', u'h') 0.896255016327 3.28389787674 0.0253069400787 2 e (u'\ue002', u'e') 1.47053217888 3.16544914246 0.0280749797821 3 in (u'i', u'n') 1.13404297829 3.10529899597 0.0245559215546 4 an (u'a', u'n') 1.20013689995 3.63801002502 0.0242891311646 5 er (u'e', u'r') 1.41387891769 3.13376092911 0.0237591266632 6 on (u'o', u'n') 1.22826981544 3.06997895241 0.0227301120758 7 re (u'r', u'e') 1.21916294098 2.97599196434 0.0238041877747 8 at (u'a', u't') 1.14608097076 2.97988891602 0.0226521492004 9 en (u'e', u'n') 1.20747494698 2.88649988174 0.019054889679 10 ed (u'e', u'd') 1.16296696663 2.8995718956 0.0198271274567 11 is (u'i', u's') 1.17692494392 3.02292394638 0.0228500366211 12 d (u'\ue005', u'd') 1.13779211044 2.85169506073 0.0229239463806 

I have already experimented with scikit-learn CountVectorizer and it seems not as fast as using zip , see Fast / Optimize N-gram implementation in python

In addition, without them, the filter in the Counter step took even more time. Counter operation takes 3 seconds to iterate = (

How else can you optimize this operation?

 Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0] 
+6
source share
3 answers

If you save a string tuple of length 2, you can use it as follows:

 def cons_2(word_list, t): j = ''.join(t) f = lambda acc, e: acc[:-1] + (j,) if (acc[-1] == t[0] and e == t[1]) else acc + (e,) return [reduce(f, i[1:], (i[0],)) for i in word_list] print cons_2(cin, ('i', 's')) 

No replacement is required, f is applied to each element i , the value of cin not changed, and a new array is created and returned.

More details:

  • reduce applies f for each element of array i and returns the value to the accumulator acc .
  • reduction options:
    • f : function to apply.
    • i[1:] : array to iterate with all elements except the first.
    • (i[0],) : the initial value of the accumulator is the tuple with the first value of the input tuple i .
  • f : is a lambda function with acc battery and current element e as inputs:
    • If the last battery element is equal to the first element of the row tuple, and the current element of e is equal to the second element of the row tuple, then return the tuple: acc[-1] + (j,) else continue with normal concatenation: acc + (e,) .

For string tuples> 2, the idea is the same, but we must control the length of the tuple l .

 def cons_n(word_list, t): l = len(t) j = ''.join(t) f = lambda acc, e: acc[:-l] + (j, e,) if acc[-l:] == t or acc[:l] == t else acc + (e,) return [reduce(f, i[l:], (i[:l])) for i in word_list] print cons_n(cin, ('i', 's')) 

This should work with n-length tuples.

More details:

  • The same process as above, but using l : reduce applies f to the remaining elements of i[l:] , and the initial value of the accumulator is a tuple with the first l elements: (i[:l]) .
  • Check back and forth the elements of l equal to the string tuple t , if true, then add the tuple: acc[:-l] + (j, e,) else continue with normal concatenation: acc + (e,) .

This is a functional approach in which data is not changed, but generated, therefore, the simultaneous use of several processes should be safe (theoretically, I am not an expert in the Python interpreter).

If the code above is too strange for people without functional programming, this is a different approach:

 def cons_n_iter(tuple_list, tuple_seq): jnt = ''.join(tuple_seq) lnt = len(tuple_seq) res = [] for word in tuple_list: acc = (word[:lnt]) for letter in word[lnt:]: if acc[-lnt:] == tuple_seq or acc[:lnt] == tuple_seq: acc = acc[:-lnt] + (jnt, letter,) else: acc += (letter,) res += (acc,) return res print cons_n_iter(cin, ('i', 's')) 

The logic is the same as the functional approach, the same use of the battery. In this case, the res drive is explicit, because reduce in the above examples took care of that.

+3
source

Source:

 [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin] 

I will expand it to make it easier to see what is happening

 result = [] qtuple = ("i", "s") for i in cin: f = " ".join(qtuple) r = "".join(qtuple) word = ' '.join(i) word = word.replace(f, r) word = word.split() word = tuple(word) result.append(word) print(result) 

Look for things that you can move outside the loop. We can pre-compromise replacements instead of computing them for each word

 find = " ".join(qtuple) replacement = "".join(qtuple) result = [] # this will join and split each word once for i in cin: word = " ".join(i) # if you had multiple replacements to do, they should be in an inner loop here word = word.replace(find, replacement) result.append(tuple(word.split(" "))) print(result) 

Perhaps someone else can talk about the relative effectiveness of str.replace and re.replace. Personally, I tend to avoid regular expressions if a simple replacement does this, just for readability.

Further efficiency gains can be realized by changing the data structure for input. If the replacement characters are single characters, we can use a string instead of a list of tuples and avoid any joins within the loop.

 result = [] replacements = [("\ue000", "X"), ("is", "Z")] s = "".join(["".join(t) for t in cin]) for f, r in replacements: s = s.replace(f,r) print(s) # output: thZXcorpusXinXtxtfileXtheXsentenceXbarXandXZXfooXfirstXaX.X 

I think some requirements need to be added to this question to explain why the preferred data structure is beneficial. From the point of view of efficiency, and in the context of an encoding algorithm for byte pairs, a string makes much more sense to me.

+4
source

This is what you need? using re.

 import re,ast cin = [('t','h',"i",'s', '\ue000'), ('c', 'i', 's', 'p')] cin = re.sub(r"i'[,\s]+'s", r"is",str(cin)) cin = ast.literal_eval(cin) 
+2
source

Source: https://habr.com/ru/post/1011757/


All Articles