Python - calculate match matrix

I am working on an NLP task, and I need to compute a match matrix on documents. The basic wording is as follows:

Here I have a matrix with the form (n, length) , where each row represents a sentence composed by the words length . Thus, there are sentences n with the same length. Then with a specific context size, for example, window_size = 5 , I want to calculate the match matrix D , where the entry in the cth and cth is #(w,c) , which means the number of times the context word c appears in the context w .

Here is an example. How to calculate a match between two words in a text box?

I know that it can be calculated by stacking loops, but I want to know if it comes out in a simple way or a simple function? I find answers, but they cannot work with a window sliding on a suggestion. For example: word match matrix

So can someone tell me if there is any function in Python, can this problem be briefly solved? Because I think this task is quite common in NLP things.

+6
source share
2 answers

It’s not so difficult, I think. Why not make a function for yourself? First, get the X co-occurrence matrix according to this guide: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then, for each sentence, calculate the match and add them to the summary variable.

 m = np.zeros([length,length]) # n is the count of all words def cal_occ(sentence,m): for i,word in enumerate(sentence): for j in range(max(i-window,0),min(i+window,length)): m[word,sentence[j]]+=1 for sentence in X: cal_occ(sentence, m) 
+9
source

I calculated the Kukkurens matrix with window size = 2

  1. first write a function that gives the correct neighborhood words (here I used get context)

  2. Create a matrix and just add 1 if a specific value is present in the adjacent cap.

Here is the Python code:

 import numpy as np CORPUS=["abc def ijk pqr", "pqr klm opq", "lmn pqr xyz abc def pqr abc"] top2000 = [ "abc","pqr","def"]#list(set((' '.join(ctxs)).split(' '))) a = np.zeros((3,3), np.int32) for sentence in CORPUS: for index,word in enumerate(sentence.split(' ')): if word in top2000 : print(word) context=GetContext(sentence,index) print(context) for word2 in context: if word2 in top2000: a[top2000.index(word)][top2000.index(word2)]+=1 print(a) 

get context function

 def GetContext(sentence, index): words = sentence.split(' ') ret=[] for word in words: if index==0: ret.append(words[index+1]) ret.append(words[index+2]) elif index==1: ret.append(words[index-1]) ret.append(words[index+1]) if len(words)>3: ret.append(words[index+2]) elif index==(len(words)-1): ret.append(words[index-2]) ret.append(words[index-1]) elif index==(len(words)-2): ret.append(words[index-2]) ret.append(words[index-1]) ret.append(words[index+1]) else: ret.append(words[index-2]) ret.append(words[index-1]) ret.append(words[index+1]) ret.append(words[index+2]) return ret 

here is the result:

 array([[0, 3, 3], [3, 0, 2], [3, 2, 0]]) 
0
source

Source: https://habr.com/ru/post/1014106/


All Articles