Check if a line contains a certain number of words of another line

Say we have line 1 A B C D E Fand line 2 B D E(the letters are for demonstration purposes only, they are actually words). Now I would like to know if there are any nconscutive "words" from line 2 in line 1. To convert a line to "words", I would use string.split().

For example, for nequals 2, I would like to check if B Dor D E- in this order - is in line 1. B Dnot in this order in the line, but D Eis.

Does anyone see a pythonic way to do this?

I have a solution for nequals 2, but I realized that it is needed for an arbitrary n. Also it is not particularly beautiful:

def string_contains_words_of_string(words_str, words_to_check_str):
    words = words_str.split()
    words_to_check = words_to_check_str.split()

    found_word_index = None
    for word in words:
        start = 0 if found_word_index is None else found_word_index + 1
        for i, word_to_check in enumerate(words_to_check[start:]):
            if word_to_check == word:
                if found_word_index is not None:
                    return True
                found_word_index = i
                break
            else:
                found_word_index = None
    return False
+4
4

:

>>> import re
>>> st1='A B C D E F'
>>> st2='B D E'
>>> n=2
>>> pat=r'(?=({}))'.format(r's+'.join(r'\w+' for i in range(n)))
>>> print [(s, s in st1) for s in re.findall(pat, st2)]
[('B D', False), ('D E', True)]

, . :

>>> re.findall('(?=(\\w+\\s+\\w+))', 'B D E')
['B D', 'D E']

n , \w+, :

>>> n=2
>>> r'(?=({}))'.format(r's\+'.join(r'\w+' for i in range(n)))
'(?=(\\w+\\s+\\w+))'

, , Python in s .


, , , n n:

>>> li=st2.split()
>>> n=2
>>> [(s, s in st1) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1))]
[('B D', False), ('D E', True)]

( ), str.find:

>>> [(s, st1.find(s)) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1)) 
...     if s in st1]
[('D E', 6)]

, , , :

>>> st='wordW wordX wordY wordZ'
>>> re.findall(r'(?=(\b\w+\s\b\w+))', st)
['wordW wordX', 'wordX wordY', 'wordY wordZ']
+2

ngrams :

a = 'this is an example, whatever'.split()
b = 'this is another example, whatever'.split()

def ngrams(string, n):
    return set(zip(*[string[i:] for i in range(n)]))

def common_ngrams(string1, string2, n):
    return ngrams(string1, n) & ngrams(string2, n)

:

print(common_ngrams(a, b, 2))
{('this', 'is'), ('example,', 'whatever')}

print(common_ngrams(a, b, 1))
{('this',), ('is',), ('example,',), ('whatever',)}

, ngrams zip

zip(*[string[i:] for i in range(n)]

,

zip(string, string[1:], string[2:])

n = 3.

, , ...

+1

, ( , )

a = 'this is a beautiful day'
b = 'this day is awful'

, b, a,

x = [x for x in b.split() if x in a.split()]

x ( )

['this', 'day', 'is']

, x ( 0 len(x)) b

for i in range(len(x)):
    for j in range(i, len(x)+1):
        word = ' '.join(x[i:j])
        if word in b:
            print(word)

( ) b, a ( if )

0

the longest general substring algorithm will work here if you go to the list of sections instead of a simple line - with an added bonus, which it will also produce the longest line from the longest run of characters if you pass the string unsplit.

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]
0
source

Source: https://habr.com/ru/post/1536151/


All Articles