Create ngrams with Julia

To generate bigrams words in Julia, I could just flash the source list and a list that removes the first element, for example:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

To generate a trigram, I could use the same iteration collect(zip(...))to get:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog") 

But I need to manually add to the 3rd list for firmware, is there an idiomatic way so that I can do any order of n-grams?

eg. I would like to avoid this to extract 5 grams:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")
+4
source share
3 answers

Here's a clean single line layer for n-grams of any length.

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))

k, drop. , splat (...), drop zip , , collect zip Array.

julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

, - , drop, .

+4

- Iterators.jl partition():

ngram(s,n) = collect(partition(s, n, 1))
+5

SubArray Tuple s, , . , ( ). :

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]

:

julia> ngram(s,5)
 SubString{String}["the","lazy","fox","jumps","over"] 
 SubString{String}["lazy","fox","jumps","over","the"] 
 SubString{String}["fox","jumps","over","the","brown"]
 SubString{String}["jumps","over","the","brown","dog"]

julia> ngram(s,5)[1][3]
"fox"

.

, ngrams ( - ). , @Gnimuc collect .. partition(s, n, 1).

+4
source

Source: https://habr.com/ru/post/1670374/


All Articles