Longest generic buffer prefix?

If I have an input line and an array:

s = "to_be_or_not_to_be" pos = [15, 2, 8] 

I am trying to find the longest common prefix between consecutive elements of the pos array that reference the original s . I am trying to get the following output:

 longest = [3,1] 

How I got this calculates the longest common prefix of the following pairs:

  • s[15:] , which is equal to _be and s[2:] , which is equal to _be_or_not_to_be , giving 3 ( _be )
  • s[2:] , which is equal to _be_or_not_to_be and s[8:] , which is equal to _not_to_be , giving 1 ( _ )

However, if s is huge, I do not want to create multiple copies when I do something like s[x:] . After hours of searching, I found the buffer function that only supports one copy of the input string, but I was not sure what the most efficient way to use it here is in this context. Any suggestions on how to achieve this?

+6
source share
4 answers

Here is a method without buffer that does not copy as it only looks at one character at a time:

 from itertools import islice, izip s = "to_be_or_not_to_be" pos = [15, 2, 8] length = len(s) for start1, start2 in izip(pos, islice(pos, 1, None)): pref = 0 for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)): if s[pos1] == s[pos2]: pref += 1 else: break print pref # prints 3 1 

I use islice , izip and xrange if you are talking about potentially very long lines.

I also could not resist this "One Liner", which does not even require indexing:

 [next((i for i, (a, b) in enumerate(izip(islice(s, start1, None), islice(s, start2, None))) if a != b), length - max((start1, start2))) for start1, start2 in izip(pos, islice(pos, 1, None))] 

One final method using os.path.commonprefix :

 [len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])] 
+2
source
 >>> import os >>> os.path.commonprefix([s[i:] for i in pos]) '_' 

Let Python manage memory for you. Do not optimize prematurely.

To get the exact result you could do (as @agf suggested ):

 print [len(commonprefix([buffer(s, i) for i in adj_indexes])) for adj_indexes in zip(pos, pos[1:])] # -> [3, 1] 
+2
source

I think your concern for copies is unfounded. See below:

 >>> s = "how long is a piece of string...?" >>> t = s[12:] >>> print t a piece of string...? >>> id(t[0]) 23295440 >>> id(s[12]) 23295440 >>> id(t[2:20]) == id(s[14:32]) True 

If you do not copy the slices and do not leave links to the copied copies, I would not think that this could cause problems.


edit: there are technical details with the Internet and stuff that I don’t quite understand on myself. But I'm sure a string slice is not always :

 >>> x = 'google.com' >>> y = x[:] >>> x is y True 

I guess the answer I'm trying to give is just to let python manage its memory, for starters, you can look at the memory buffers and views if necessary. And if this is already a real problem for you, please update your question with detailed information about what the real problem is.

+1
source

One way to use buffer is below. However, there can be much faster ways.

 s = "to_be_or_not_to_be" pos = [15, 2, 8] lcp = [] length = len(pos) - 1 for index in range(0, length): pre = buffer(s, pos[index]) cur = buffer(s, pos[index+1], pos[index+1]+len(pre)) count = 0 shorter, longer = min(pre, cur), max(pre, cur) for i, c in enumerate(shorter): if c != longer[i]: break else: count += 1 lcp.append(count) print print lcp 
0
source

Source: https://habr.com/ru/post/901169/


All Articles