K-Skip-N-Gram: generalization of for-loops in R

Question

K-Skip-N-Gram: generalization of for-loops in R

I have an R function to generate K-Skip-N-Grams :
My full function is on github .

My code really generates the necessary k-skip-n gram:

> kSkipNgram("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", n=2, skip=1) [1] "Lorem dolor" "Lorem ipsum" "ipsum sit" [4] "ipsum dolor" "dolor amet" "dolor sit" [7] "sit consectetur" "sit amet" "amet adipiscing" [10] "amet consectetur" "consectetur elit" "consectetur adipiscing" [13] "adipiscing elit"

But I would like to generalize / simplify the following switch statement of nested for-loops:

 # x - should be text, sentense # n - n-gramm # skip - number of skips ################################### switch(as.character(n), "0" = {ngram<-c(ngram, paste(x[i]))}, "1" = {for(j in skip:1) { if (i+j <= length(x)) {ngram<-c(ngram, paste(x[i],x[i+j]))} } }, "2" = {for(j in skip:1) {for (k in skip:1) { if (i+j <= length(x) && i+j+k <= length(x)) {ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k]))} } } }, "3" = {for(j in skip:1) {for (k in skip:1) {for (l in skip:1) { if (i+j <= length(x) && i+j+k <= length(x) && i+j+k+l <= length(x)) {ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k],x[i+j+k+l]))} } } } }, "4" = {for(j in skip:1) {for (k in skip:1) {for (l in skip:1) {for (m in skip:1) { if (i+j <= length(x) && i+j+k <= length(x) && i+j+k+l <= length(x) && i+j+k+l+m <= length(x)) {ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k],x[i+j+k+l],x[i+j+k+l+m]))} } } } } } ) } }

+4

for-loop r switch-statement n-gram

frankenstein Aug 15 '13 at 18:21

source share

1 answer

Beright · Answer 1 · 2014-03-20T22:05:17+0000

I used a recursive solution for generic k-skip-n-grams. I have included it here in Python; I am not familiar with R, but I hope you can translate it. I used the definition from this document: http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

This should probably be optimized with some dynamic programming if you intend to use it in long sentences, as it currently has a lot of redundant calculations (calculates subgrams many times). I also did not check it completely, there could be corner cases.

 def kskipngrams(sentence,k,n): "Assumes the sentence is already tokenized into a list" if n == 0 or len(sentence) == 0: return None grams = [] for i in range(len(sentence)-n+1): grams.extend(initial_kskipngrams(sentence[i:],k,n)) return grams def initial_kskipngrams(sentence,k,n): if n == 1: return [[sentence[0]]] grams = [] for j in range(min(k+1,len(sentence)-1)): kmjskipnm1grams = initial_kskipngrams(sentence[j+1:],kj,n-1) if kmjskipnm1grams is not None: for gram in kmjskipnm1grams: grams.append([sentence[0]]+gram) return grams

K-Skip-N-Gram: generalization of for-loops in R

More articles: