Count the number of substrings of 5 characters within a string

Given a string, I want to calculate how many substrings with len = 5 I have on it.

For example: Input: "ABCDEFG" Output: 3

And I'm not sure what the easiest and fastest way to do this in python should be. Any idea?

Update:

I just want to count the different substrings.

Entrance: "AAAAAA" Substrates: 2 times "AAAAA" Exit: 1

+6
source share
7 answers
>>> n = 5 >>> for s in 'ABCDEF', 'AAAAAA': ... len({s[i:i+n] for i in range(len(s)-n+1)}) ... 2 1 
+3
source

To get substrings, you can use NLTK as follows:

 >>> from nltk.util import ngrams >>> for gram in ngrams("ABCDEFG", 5): ... print gram ... ('A', 'B', 'C', 'D', 'E') ('B', 'C', 'D', 'E', 'F') ('C', 'D', 'E', 'F', 'G') 

You can apply Counter and then get unique n-grams (and their frequency) as follows:

 >>> Counter(ngrams("AAAAAAA", 5)) Counter({('A', 'A', 'A', 'A', 'A'): 3}) 
+2
source

Using List Counting (code golf):

 findSubs=lambda s,v:[''.join([s[i+j] for j in range(v)]) for i,x in enumerate(s) if i<=len(s)-v] findCount=lambda s,v:len(findSubs(s,v)) print findSubs('ABCDEFG', 5) #returns ['ABCDE', 'BCDEF', 'CDEFG'] print findCount('ABCDEFG', 5) #returns 3 

Update

For your update, you can list the list above into a set, return to the list, and then sort the lines.

 findUnique=lambda s,v:sorted(list(set(findSubs(s,v)))) findUniqueCount=lambda s,v:len(findUnique(s,v)) print findUnique('AAAAAA', 5) #returns ['AAAAA'] print findUniqueCount('AAAAAA', 5) #returns 1 
+2
source

This is just the length minus 4:

 def substrings(s): return len(s) - 4 

This is true because you can create a substring for the first, second, ..., fifth to last character as the first letter of the substring.

+1
source

A general solution could be:

 def count(string, nletters): return max(0, len(string) - nletters + 1) 

What version is used in your example:

 print count("ABCDEFG", 5) 
+1
source
 >>> how_much = lambda string, length: max(len(string) - length + 1, 0) >>> how_much("ABCDEFG", 5) 3 
+1
source

I'm sure python is not a good language for this, but if the length of the various substrings you want to find is not as small as 5, but more than 1000, where your main line is very long, then a linear solution to your problem is to build a tree suffix, you can read about them online. The suffix tree for a string of length n can be built in O (n) time, and moving the tree also takes O (n) time, and after going through higher levels of the tree, you can count all the different substrings of a certain length, also in O (n) time, regardless of the length of the required substrings.

+1
source

Source: https://habr.com/ru/post/973728/


All Articles